{
  "generated_at": "2026-06-02T13:56:35.975774+08:00",
  "timezone": "Asia/Shanghai",
  "lookback_hours": 24,
  "sorting": {
    "default_sort_by": "hybrid",
    "summary": "hybrid (relevance first, published_at tie-break)",
    "weights": {
      "title_match_weight": 40,
      "summary_match_weight": 18,
      "doi_weight": 12,
      "pdf_weight": 8,
      "rich_summary_weight": 6,
      "metadata_weight": 4,
      "multi_source_weight": 10,
      "freshness_weight_cap": 24
    },
    "feeds": [
      {
        "name": "LM",
        "sort_by": "hybrid"
      },
      {
        "name": "Agent Runtime Security",
        "sort_by": "hybrid"
      },
      {
        "name": "Terminal and SWE Agents",
        "sort_by": "hybrid"
      }
    ]
  },
  "highlights": [
    "主题「LLM」：命中 18 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems》、《MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation》。",
    "主题「Benchmark」：命中 15 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems》、《MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation》。",
    "主题「Language Model」：命中 7 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design》、《Auditing Asset-Specific Preferences in Financial Large Language Models: Evidence from Bitcoin Representations and Portfolio Allocation》。",
    "主题「Agent」：命中 2 篇，覆盖 LM、Terminal and SWE Agents，代表论文包括 《AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents》、《SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction》。",
    "主题「Guardrail」：命中 1 篇，覆盖 Agent Runtime Security，代表论文包括 《PyFEX: Uncovering Evasive Python-based Threats via Resilient and Exhaustive Path Exploration》。"
  ],
  "focus_items": [],
  "action_items": [],
  "topic_sections": [
    {
      "name": "LLM",
      "paper_count": 18,
      "feed_names": [
        "LM",
        "Agent Runtime Security"
      ],
      "paper_titles": [
        "POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems",
        "MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation",
        "Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling",
        "Auditing Asset-Specific Preferences in Financial Large Language Models: Evidence from Bitcoin Representations and Portfolio Allocation",
        "K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts",
        "Food Noise & False Safety: A Systematic Evaluation of How LLMs Fail to Adapt to Eating Disorder Queries with Clinician Feedback",
        "Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference",
        "MOC: Multi-Order Communication in LLM-based Multi-Agent Systems",
        "ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning",
        "Unified Context Evolution for LLM Agents",
        "AdaCodec: A Predictive Visual Code for Video MLLMs",
        "SimSD: Simple Speculative Decoding in Diffusion Language Models",
        "SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment",
        "Jailbreaking Multimodal Large Language Models using Multi-Clip Video",
        "SentGuard: Sentence-Level Streaming Guardrails for Large Language Models",
        "AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations",
        "SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning",
        "SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents"
      ],
      "key_points": [
        "《POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems》〔评测 / 应用 / 方法〕：Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emergent failures and hallucinations…",
        "《MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation》〔评测 / 应用 / 方法〕：The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data sources and tools, and…"
      ]
    },
    {
      "name": "Benchmark",
      "paper_count": 15,
      "feed_names": [
        "LM",
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems",
        "MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation",
        "Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling",
        "AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design",
        "K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts",
        "AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents",
        "ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning",
        "Unified Context Evolution for LLM Agents",
        "AdaCodec: A Predictive Visual Code for Video MLLMs",
        "SimSD: Simple Speculative Decoding in Diffusion Language Models",
        "SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment",
        "SentGuard: Sentence-Level Streaming Guardrails for Large Language Models",
        "AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations",
        "SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents",
        "SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction"
      ],
      "key_points": [
        "《POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems》〔评测 / 应用 / 方法〕：Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emergent failures and hallucinations…",
        "《MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation》〔评测 / 应用 / 方法〕：The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data sources and tools, and…"
      ]
    },
    {
      "name": "Language Model",
      "paper_count": 7,
      "feed_names": [
        "LM",
        "Agent Runtime Security"
      ],
      "paper_titles": [
        "AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design",
        "Auditing Asset-Specific Preferences in Financial Large Language Models: Evidence from Bitcoin Representations and Portfolio Allocation",
        "Food Noise & False Safety: A Systematic Evaluation of How LLMs Fail to Adapt to Eating Disorder Queries with Clinician Feedback",
        "Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference",
        "MOC: Multi-Order Communication in LLM-based Multi-Agent Systems",
        "Jailbreaking Multimodal Large Language Models using Multi-Clip Video",
        "SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning"
      ],
      "key_points": [
        "《AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design》〔评测 / 方法〕：Protein language models (PLMs) are passive oracles: they generate sequences in a single forward pass with no mechanism to consult external biophysical feedback…",
        "《Auditing Asset-Specific Preferences in Financial Large Language Models: Evidence from Bitcoin Representations and Portfolio Allocation》〔应用 / 方法〕：Large language models now power robo-advisors and trading agents, yet whether they carry built-in biases toward specific assets is largely untested. We ask thr…"
      ]
    },
    {
      "name": "Agent",
      "paper_count": 2,
      "feed_names": [
        "LM",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents",
        "SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction"
      ],
      "key_points": [
        "《AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents》〔评测 / 应用 / 方法〕：Language agents spend substantial inference time solving individual tasks, yet the experience acquired in one episode is often underutilized in future episodes…",
        "《SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction》〔评测 / 应用 / 方法〕：Agent skills occupy a privileged position in the agent workflow, as agents are expected to implicitly follow and execute them, rendering third-party skills a v…"
      ]
    },
    {
      "name": "Guardrail",
      "paper_count": 1,
      "feed_names": [
        "Agent Runtime Security"
      ],
      "paper_titles": [
        "PyFEX: Uncovering Evasive Python-based Threats via Resilient and Exhaustive Path Exploration"
      ],
      "key_points": [
        "《PyFEX: Uncovering Evasive Python-based Threats via Resilient and Exhaustive Path Exploration》〔应用〕：The rapid expansion of the Python ecosystem has fueled two distinct but converging threats: adversaries increasingly target the software supply chain via the P…"
      ]
    }
  ],
  "template": "zh_daily_brief",
  "feeds": [
    {
      "name": "LM",
      "key_points": [
        "《POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems》〔评测 / 应用 / 方法〕：Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emergent failures and hallucinations…",
        "《MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation》〔评测 / 应用 / 方法〕：The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data sources and tools, and…",
        "《Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling》〔评测 / 数据 / 方法〕：Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains limited by a critical…",
        "《AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design》〔评测 / 方法〕：Protein language models (PLMs) are passive oracles: they generate sequences in a single forward pass with no mechanism to consult external biophysical feedback…",
        "《Auditing Asset-Specific Preferences in Financial Large Language Models: Evidence from Bitcoin Representations and Portfolio Allocation》〔应用 / 方法〕：Large language models now power robo-advisors and trading agents, yet whether they carry built-in biases toward specific assets is largely untested. We ask thr…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems",
          "summary": "Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emergent failures and hallucinations that resist characterisation block their deployment in safety-critical domains -- a gap made legally untenable by emerging AI regulation. Existing evaluation paradigms share a common flaw: centralised judgment creates single points of failure and demands domain-specific expertise. Here we present POIROT, a protocol that repurposes a system's own agents as its diagnostic layer, leveraging the epistemic diversity already present in the architecture. Across evaluated settings, POIROT outperforms single-LLM evaluator baselines, with gains that scale with problem complexity (OR = 1.60, $p = 0.008$), agent count, and fault dimensionality, persisting under compound fault conditions. These results demonstrate that safety oversight need not be externalised: the agents executing a role carry sufficient collective intelligence to audit it. We release POIROT as an open-source library alongside BLAME, a benchmark for fault attribution in safety-critical multi-agent systems.",
          "authors": [
            "Iñaki Dellibarda Varela",
            "R. Sendra-Arranz",
            "Pablo Romero-Sorozabal",
            "J. M. Valverde-García",
            "Annemarie F. Laudanski",
            "Álvaro Gutiérrez",
            "Eduardo Rocon",
            "Manuel Cebrian"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.02282v1",
          "abstract_url": "https://arxiv.org/abs/2606.02282v1",
          "pdf_url": "https://arxiv.org/pdf/2606.02282v1",
          "published_at": "2026-06-01T14:05:35+00:00",
          "updated_at": "2026-06-01T14:05:35+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.02282",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.02282v1"
          },
          "relevance_score": 192,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.02282"
        },
        {
          "title": "MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation",
          "summary": "The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data sources and tools, and has been rapidly adopted across personal applications and development platforms. However, existing benchmarks predominantly focus on generic information-seeking tools and fail to capture the practical challenges posed by personal social applications, where tools interact with individual accounts or local databases. To bridge this critical gap, we introduce MCP-Persona, the first benchmark specifically designed for evaluating agent performance on real-world, personalized MCP tools. MCP-Persona encompasses a diverse set of widely-used applications, ranging from social media platforms like Reddit and Xiaohongshu (Rednote) to enterprise collaboration suites such as Lark (Feishu) and Slack. Our extensive experiments on various state-of-the-art (SOTA) agents demonstrate their significant struggles with personalized tool use, thereby highlighting the benchmark's crucial role in identifying and addressing these limitations. MCP-Persona is publicly available at https://github.com/wwh0411/MCP-Persona}{https://github.com/wwh0411/MCP-Persona.",
          "authors": [
            "Wenhao Wang",
            "Peizhi Niu",
            "Gongyi Zou",
            "Xiyuan Yang",
            "Jingxing Wang",
            "Haoting Shi",
            "Yaxin Du",
            "Jingyi Chai",
            "Xianghe Pang",
            "Shuo Tang",
            "Yanfeng Wang",
            "Siheng Chen"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.02470v1",
          "abstract_url": "https://arxiv.org/abs/2606.02470v1",
          "pdf_url": "https://arxiv.org/pdf/2606.02470v1",
          "published_at": "2026-06-01T16:44:10+00:00",
          "updated_at": "2026-06-01T16:44:10+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.02470",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.02470v1"
          },
          "relevance_score": 184,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"agent\"",
            "title matched \"benchmark\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.02470"
        },
        {
          "title": "Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling",
          "summary": "Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains limited by a critical weakness: when visual evidence conflicts with textual cues, MLLM judges tend to reward plausible narratives over perceptually correct answers. We identify and systematically analyze this phenomenon, which we term Perceptual Judgment Bias. Through controlled visual perturbations, existing multimodal judges frequently anchor on the response text instead of their own visual perception, leading to inconsistent and non-verifiable evaluations. To address this issue, we introduce the Perceptually Perturbed Judgment Dataset, which constructs minimally edited counterfactual responses that isolate perceptual errors and enable verifiable supervision. Building on this dataset, we develop a unified training framework that combines a structured GRPO-based reward with a batch-ranking objective, achieving coherent global ordering without explicit pairwise labels. Experiments across diverse MLLM-as-a-Judge benchmarks show that our approach substantially improves perceptual fidelity, ranking coherence, and alignment with human evaluation. Our results establish a scalable and generalizable pathway for training multimodal judges that are perceptually grounded, interpretable, and robust to visual-reasoning conflicts.",
          "authors": [
            "Seojeong Park",
            "Jiho Choi",
            "Junyong Kang",
            "Seonho Lee",
            "Jaeyo Shin",
            "Hyunjung Shim"
          ],
          "categories": [
            "cs.CV",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.02578v1",
          "abstract_url": "https://arxiv.org/abs/2606.02578v1",
          "pdf_url": "https://arxiv.org/pdf/2606.02578v1",
          "published_at": "2026-06-01T17:59:46+00:00",
          "updated_at": "2026-06-01T17:59:46+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.02578",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.02578v1"
          },
          "relevance_score": 178,
          "match_reasons": [
            "title matched \"LLM\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"reasoning\"",
            "summary matched \"alignment\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.02578"
        },
        {
          "title": "AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design",
          "summary": "Protein language models (PLMs) are passive oracles: they generate sequences in a single forward pass with no mechanism to consult external biophysical feedback or redirect generation when a candidate violates thermodynamic or structural constraints. We introduce AgentPLM, which addresses this by equipping a pre-trained PLM with i) Reasoning-Augmented Decoding (RAD), which interleaves autoregressive generation with tool calls (ESMFold, FoldX, AutoDock Vina), and ii) Contrastive Agent Policy Optimisation (CAPO), a trajectory-level extension of direct preference optimisation that trains the policy end-to-end to learn when oracle feedback is informative rather than merely imitating high-fitness sequences. We evaluate AgentPLM on benchmark tasks spanning de novo enzyme design, antibody optimisation, thermostability, PPI interface design, and zero-shot fitness prediction with standardised oracle APIs and controlled sequence-identity splits. AgentPLM achieves state-of-the-art results with a gain in antibody top-10% hit rate over the strongest passive baseline, providing mechanistic evidence of online error correction without explicit backtracking.",
          "authors": [
            "Sahil Rahman",
            "Maxx Richard Rahman"
          ],
          "categories": [
            "cs.AI",
            "q-bio.QM"
          ],
          "paper_id": "http://arxiv.org/abs/2606.02386v1",
          "abstract_url": "https://arxiv.org/abs/2606.02386v1",
          "pdf_url": "https://arxiv.org/pdf/2606.02386v1",
          "published_at": "2026-06-01T15:35:02+00:00",
          "updated_at": "2026-06-01T15:35:02+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.02386",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.02386v1"
          },
          "relevance_score": 165,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"reasoning\"",
            "title matched \"agent\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.02386"
        },
        {
          "title": "Auditing Asset-Specific Preferences in Financial Large Language Models: Evidence from Bitcoin Representations and Portfolio Allocation",
          "summary": "Large language models now power robo-advisors and trading agents, yet whether they carry built-in biases toward specific assets is largely untested. We ask three questions: do LLMs systematically prefer certain financial instruments; can an internal representation with causal leverage over those preferences be identified; and does that representation affect downstream financial decisions? We develop a three-level audit protocol and apply it to Bitcoin. First, a behavioral audit of eight frontier LLMs shows that Bitcoin's ranking among money-like instruments is frame-dependent: models place it around rank 5 of 8 as \"reliable money\" but near the top under crisis and autonomous-agent frames, and an attribute-swap experiment confirms rankings track functional properties, not names. Second, we open a model's internals: a search across thousands of sparse-autoencoder features in Gemma 3 identifies a dominant Bitcoin-selective feature. Amplifying it shifts the model toward the asset and suppressing it shifts the model away, even when \"Bitcoin\" never appears in the prompt. Third, we test financial consequences: amplification raises Bitcoin's portfolio share by 5.2 percentage points while suppression lowers it by 4.6 pp, with amplification reallocating within crypto and suppression cutting total crypto exposure. We characterize this as bounded behavioral leverage (leverage meaning causal influence over outputs, not financial leverage): an identifiable internal feature can be perturbed to move financial choices, but only within measurable limits. The framework links internal representations to external recommendations, validated with random controls and mechanism boundaries. As LLMs become autonomous financial agents, this is a first step toward a behavioral layer for emerging know-your-agent (KYA) standards: knowing what an agent prefers, and how far that preference can be moved.",
          "authors": [
            "Wenbin Wu"
          ],
          "categories": [
            "q-fin.GN",
            "cs.CY",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2606.02528v1",
          "abstract_url": "https://arxiv.org/abs/2606.02528v1",
          "pdf_url": "https://arxiv.org/pdf/2606.02528v1",
          "published_at": "2026-06-01T17:36:06+00:00",
          "updated_at": "2026-06-01T17:36:06+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.02528",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.02528v1"
          },
          "relevance_score": 163,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"agent\"",
            "summary matched \"RAG\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.02528"
        },
        {
          "title": "K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts",
          "summary": "Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\\%, and we report this split separately as a targeted stress test. We publicly release our data and code.",
          "authors": [
            "Nahyun Lee",
            "Dongkeun Yoon",
            "Guijin Son",
            "Geewook Kim",
            "Dayoon Ko",
            "Jeonghun Park",
            "Haneul Yoo",
            "Jaewon Cho",
            "Junghun Park",
            "Changyoon Lee",
            "Kyochul Jang",
            "Jaeyeon Kim",
            "Eunsu Kim",
            "Woojin Cho",
            "Seungone Kim"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.02404v1",
          "abstract_url": "https://arxiv.org/abs/2606.02404v1",
          "pdf_url": "https://arxiv.org/pdf/2606.02404v1",
          "published_at": "2026-06-01T15:50:03+00:00",
          "updated_at": "2026-06-01T15:50:03+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.02404",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.02404v1"
          },
          "relevance_score": 161,
          "match_reasons": [
            "title matched \"agent\"",
            "title matched \"benchmark\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.02404"
        },
        {
          "title": "AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents",
          "summary": "Language agents spend substantial inference time solving individual tasks, yet the experience acquired in one episode is often underutilized in future episodes. Continual learning expects an agent to accumulate reusable experience across a stream of tasks, improve over time, and avoid interference from irrelevant experiences. Unfortunately, existing benchmarks struggle to evaluate continual learning in language agents rigorously. Most efforts focus on retrieval and reasoning over long-context conversations or documents, while recent lifelong-adaptation benchmarks often rely on naive task streams with limited analysis of cross-task relationships, making it difficult to understand what an agent learns and reuses over time. This paper presents an evaluation framework AgentCL for continual learning in agents, centered on controlled task streams and metrics for transfer gains. AGENTCL constructs compositional streams where earlier sub-solutions, evidence, or workflows are intentionally reusable in later tasks, and contrasts them with naive streams where such reusability is not guaranteed. We use the benchmark to evaluate non-parametric memory designs for continual learning. To diagnose how memory design choices affect continual learning, we develop MemProbe, a probing method that stores interactions, insights, and skills, while filtering unreliable experiences during consolidation. Empirical analysis across coding, deep research, and language understanding/reasoning tasks shows that naive streams offer limited ability to distinguish memory designs, whereas controlled streams more clearly distinguish their plasticity. Meanwhile, naive and held-out settings often yield limited gains and can expose memory-induced degradation. These results highlight the need for stronger memory designs that balance plasticity and stable reuse.",
          "authors": [
            "Yiheng Shu",
            "Bernal Jiménez Gutiérrez",
            "Saisri Padmaja Jonnalagedda",
            "Yuguang Yao",
            "Huan Sun",
            "Yu Su"
          ],
          "categories": [
            "cs.AI",
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.02461v1",
          "abstract_url": "https://arxiv.org/abs/2606.02461v1",
          "pdf_url": "https://arxiv.org/pdf/2606.02461v1",
          "published_at": "2026-06-01T16:32:59+00:00",
          "updated_at": "2026-06-01T16:32:59+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.02461",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.02461v1"
          },
          "relevance_score": 144,
          "match_reasons": [
            "title matched \"agent\"",
            "title matched \"evaluation\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.02461"
        },
        {
          "title": "Food Noise & False Safety: A Systematic Evaluation of How LLMs Fail to Adapt to Eating Disorder Queries with Clinician Feedback",
          "summary": "Recent evidence shows that people with eating disorders (EDs) are increasingly seeking guidance, advice, and emotional support from Large Language Model (LLM)-based chat systems. Although these systems are not designed to provide clinical advice, their perceived expertise, neutrality and accessibility make them a frequent, albeit risky, source of support. This paper investigates potential patterns of interaction between users with EDs and LLMs, focusing on the potential harms arising from models that uncritically adapt to, and facilitate unsafe or self-harming user requests. We find, in consultation with clinical ED experts, that specific linguistic cues in prompts increase the likelihood of unsafe responses and, through systematically varying the degree of potential risk present in the user prompt, report the extent to which LLMs uncritically adapt to problematic, and potentially dangerous user inputs.",
          "authors": [
            "Giulia Pucci",
            "Emily Hemendinger",
            "Ruizhe Li",
            "Gavin Abercrombie",
            "Tanvi Dinkar",
            "Arabella Sinclair"
          ],
          "categories": [
            "cs.AI",
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.02444v1",
          "abstract_url": "https://arxiv.org/abs/2606.02444v1",
          "pdf_url": "https://arxiv.org/pdf/2606.02444v1",
          "published_at": "2026-06-01T16:14:18+00:00",
          "updated_at": "2026-06-01T16:14:18+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.02444",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.02444v1"
          },
          "relevance_score": 144,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"evaluation\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.02444"
        },
        {
          "title": "Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference",
          "summary": "Large language models (LLMs) are increasingly integrated into high-performance computing (HPC) workflows, accelerating scientific discovery through diverse perspectives such as code generation and domain-specific decision-making. Yet, how soft errors propagate and affect LLM inference remains largely unexplored. To bridge this gap, we present a comprehensive study on error propagation in LLM inference, enabled by our proposed LLMFI, a configurable and deterministic fault-injection framework. Using LLMFI, we systematically inject faults across three open-weighted LLMs and thirteen representative tasks, covering reasoning, multilingual, mathematical, and coding domains. In addition, we conduct fine-grained case studies that reveal critical vulnerability patterns. Overall, our study yields 17 takeaways that advance the understanding of error propagation in LLM inference and introduces four low-overhead directions to improve reliability through software-only modification, offering practical guidance for future error detection and mitigation.",
          "authors": [
            "Yafan Huang",
            "Sheng Di",
            "Guanpeng Li"
          ],
          "categories": [
            "cs.DC",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.02430v1",
          "abstract_url": "https://arxiv.org/abs/2606.02430v1",
          "pdf_url": "https://arxiv.org/pdf/2606.02430v1",
          "published_at": "2026-06-01T16:04:51+00:00",
          "updated_at": "2026-06-01T16:04:51+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.02430",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.02430v1"
          },
          "relevance_score": 144,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.02430"
        },
        {
          "title": "MOC: Multi-Order Communication in LLM-based Multi-Agent Systems",
          "summary": "Despite the remarkable progress of Large Language Model (LLM) based Multi-Agent Systems, most research focuses on optimizing coordination topology while largely underexploring the equally critical problem: how to transmit and optimize messages among agents effectively? Current communication schemes typically rely on the direct concatenation of first-order neighbor responses, which induces a restricted evidence receptive field and leads to the dilution of crucial insights over multi-hop paths. To address these limitations, we propose the Multi-Order Communication (MOC) scheme, which reconstructs the inter-agent communication to capture multi-hop dependencies and incorporates a structural message consolidation strategy to ensure efficiency. Specifically, we formalize the communication mechanism to construct a structured multi-order evidence stream, and subsequently design a Semantic-Topological Merging algorithm to optimize semantic fidelity within token constraints. Extensive experiments across six diverse datasets and LLM backbones of varying parameter scales demonstrate that MOC consistently improves task performance and reduces communication costs. The code is available at https://github.com/yao-guan/MOC.",
          "authors": [
            "Yao Guan",
            "Lin Wang",
            "Zhihu Lu",
            "Ziyi Wang",
            "Wenzhu Yan",
            "Qiang Duan"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.02359v1",
          "abstract_url": "https://arxiv.org/abs/2606.02359v1",
          "pdf_url": "https://arxiv.org/pdf/2606.02359v1",
          "published_at": "2026-06-01T15:06:38+00:00",
          "updated_at": "2026-06-01T15:06:38+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.02359",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.02359v1"
          },
          "relevance_score": 143,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.02359"
        },
        {
          "title": "ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning",
          "summary": "Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually acquire new vision-language capabilities, making Multimodal Continual Instruction Tuning (MCIT) essential. To reduce inter-task interference and promote collaboration, recent methods often employ sparse architectures like Mixture of LoRA Experts with image-text similarity routing. However, tasks with distinct response structures could share highly similar visual-linguistic semantics and thus be wrongly routed to the same expert; image-text similarity alone is insufficient for reliable task assignment. For example, an expert in a grounding task requiring coordinate prediction may be biased toward producing short textual answers after learning semantically similar VQA tasks. This format-blind task assignment integrates heterogeneous response types into shared parameters, inducing gradient interference and ineffective expert collaboration. To address this problem, we propose ProtoAda, a prototype-guided adaptive tuning framework. ProtoAda introduces format-aware task prototypes to align task assignment and routing with both task semantics and output structure, and further consolidates format-compatible updates in a geometry-aware manner to effectively reuse and progressively refine existing parameters. Extensive experiments on multiple benchmarks demonstrate that ProtoAda achieves superior performance, especially on tasks whose answer structures are easily corrupted by sequential tuning.",
          "authors": [
            "Yu-Cheng Shi",
            "Zhen-Hao Xie",
            "Jun-Tao Tang",
            "Da-Wei Zhou"
          ],
          "categories": [
            "cs.CV",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2606.02576v1",
          "abstract_url": "https://arxiv.org/abs/2606.02576v1",
          "pdf_url": "https://arxiv.org/pdf/2606.02576v1",
          "published_at": "2026-06-01T17:59:13+00:00",
          "updated_at": "2026-06-01T17:59:13+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.02576",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.02576v1"
          },
          "relevance_score": 142,
          "match_reasons": [
            "title matched \"instruction tuning\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.02576"
        },
        {
          "title": "Unified Context Evolution for LLM Agents",
          "summary": "LLM-based agents can solve multi-step interactive tasks by combining reasoning with environment feedback, yet each episode starts from the same fixed context and any useful strategy discovered along the way is lost once the task ends. Existing approaches either limit learning to the current task or pool all experience into a single untyped store, without distinguishing knowledge types, tracking quality through use, or balancing what the library still lacks. We introduce Unified Context Evolution (UCE), a gradient-free framework that externalizes agent experience into an evolving library of typed Evolvable Context Units (ECUs). UCE decomposes experience into four complementary types (Memory, Strategy, Workflow, and Skill), each generated from trajectories under type-specific conditions, retrieved at decision time, scored through repeated usage outcomes, and pruned when no longer valuable. A scheduling module allocates each cycle's generation budget toward the types where the library is weakest. Across two interactive benchmarks, UCE raises ALFWorld success from 75.4% to 96.3% and WebShop task score from 45.1% to 61.3%, and the accumulated library transfers to alternative actor backbones without retraining.",
          "authors": [
            "Zixuan Zhu",
            "Yitong Hu",
            "Yong Dai",
            "Junfeng Fang",
            "Chunyang Jiang",
            "Senkang Hu",
            "Yuzhi Zhao"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.02304v1",
          "abstract_url": "https://arxiv.org/abs/2606.02304v1",
          "pdf_url": "https://arxiv.org/pdf/2606.02304v1",
          "published_at": "2026-06-01T14:25:29+00:00",
          "updated_at": "2026-06-01T14:25:29+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.02304",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.02304v1"
          },
          "relevance_score": 142,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"agent\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.02304"
        },
        {
          "title": "AdaCodec: A Predictive Visual Code for Video MLLMs",
          "summary": "Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame as an independent RGB image, causing visual tokens to repeat content already present in earlier frames. This suggests a more direct video interface: send a full reference frame only when the scene cannot be predicted well from prior context, and otherwise transmit a compact description of inter-frame changes. We call this interface a \\emph{predictive visual code}, and instantiate it for video MLLMs as \\textbf{AdaCodec}. AdaCodec spends full visual tokens on a reference frame only when its conditional predictive cost is high; otherwise, it encodes inter-frame changes, including motion and prediction residuals, as compact P-tokens. Across all eleven benchmarks, AdaCodec improves over the Qwen3-VL-8B per-frame RGB baseline at a matched visual-token budget. Even at $1/7$ the budget, AdaCodec with 32k tokens surpasses the 224k baseline on all long-video benchmarks; on five general-video benchmarks, it raises the average score while substantially cutting time-to-first-token from 9.26s to 1.62s.",
          "authors": [
            "Haowen Hou",
            "Zhen Huang",
            "Zheming Liang",
            "Qingyi Si",
            "Chenglin Li",
            "Shuai Dong",
            "Kele Shao",
            "Ruilin Li",
            "Dianyi Wang",
            "Nan Duan",
            "Jiaqi Wang"
          ],
          "categories": [
            "cs.CV",
            "cs.AI",
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.02569v1",
          "abstract_url": "https://arxiv.org/abs/2606.02569v1",
          "pdf_url": "https://arxiv.org/pdf/2606.02569v1",
          "published_at": "2026-06-01T17:56:35+00:00",
          "updated_at": "2026-06-01T17:56:35+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.02569",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.02569v1"
          },
          "relevance_score": 141,
          "match_reasons": [
            "title matched \"LLM\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.02569"
        },
        {
          "title": "SimSD: Simple Speculative Decoding in Diffusion Language Models",
          "summary": "Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) LLMs, offering faster inference through parallel or blockwise decoding. However, their masked language modeling formulation remains incompatible with standard token-level speculative decoding, one of the most effective acceleration techniques for AR models. In AR decoding, the causal mask preserves temporally valid token-level contexts, enabling a target model to verify multiple drafted tokens in a single forward pass. In contrast, dLLMs rely on mask tokens and bidirectional attention, causing the effective context to change across denoising steps and preventing direct token-level speculative verification. To bridge this gap, we propose a simple but effective speculative decoding algorithm for diffusion language models, named SimSD, which mainly adopts a plug-and-play masking strategy that equips dLLMs with temporally valid token-level contexts for speculative decoding. Our method explicitly introduces reference tokens from draft-model predictions and designs an attention mask that regulates their interaction with current-step tokens, allowing dLLMs to compute valid logits for drafted tokens in a single forward pass. This restores the key verification ability provided by causal masking in AR models while preserving the parallel decoding advantages of dLLMs. The proposed method is training-free and can be flexibly integrated with other acceleration techniques such as KV cache and blockwise decoding. Experiments on SDAR-family dLLMs across four benchmarks show that our method achieves up to 7.46x higher decoding throughput while maintaining and even improving average generation quality.",
          "authors": [
            "Junxia Cui",
            "Haotian Ye",
            "Runchu Tian",
            "Hongcan Guo",
            "Jinya Jiang",
            "Haoru Li",
            "Chaojie Ren",
            "Yiming Huang",
            "Kaijie Zhu",
            "Zhongkai Yu",
            "Kun Zhou",
            "Jingbo Shang"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.02544v1",
          "abstract_url": "https://arxiv.org/abs/2606.02544v1",
          "pdf_url": "https://arxiv.org/pdf/2606.02544v1",
          "published_at": "2026-06-01T17:46:46+00:00",
          "updated_at": "2026-06-01T17:46:46+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.02544",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.02544v1"
          },
          "relevance_score": 141,
          "match_reasons": [
            "title matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.02544"
        },
        {
          "title": "SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment",
          "summary": "Aligning Large Language Models (LLMs) with human values often degrades their general capabilities, termed the alignment tax. Existing methods mitigate this by balancing dual objectives, which heavily rely on massive general-purpose data or auxiliary reward models. In this paper, we argue that, because safety features are inherently sparse within the output distribution, alignment requires localized modifications rather than global trade-offs. To this end, we propose SafeSteer, which performs on-policy distillation confined to safety tokens. First, we construct a safety teacher via activation steering. Based on this teacher, we develop a safety token selection algorithm. Consequently, SafeSteer restricts the reverse KL penalty to these tokens during training to preserve general capabilities. Experimental results across diverse models show that our SafeSteer achieves a superior trade-off between safety and general capability compared with existing methods, attaining strong safety performance on seven safety benchmarks with only minimal degradation on five general capability benchmarks. Notably, SafeSteer requires only 100 harmful samples without using any general-purpose data, less than 1% of what previous baselines used, considerably reducing alignment cost. More details are on our project page at https://anjingkun.github.io/SafeSteer.",
          "authors": [
            "Hao Li",
            "Jingkun An",
            "Zijun Song",
            "Pengyu Zhu",
            "Rui Li",
            "Hao Wang",
            "Wendi Feng",
            "Yesheng Liu",
            "Lijun Li",
            "Jin-Ge Yao",
            "Lei Sha"
          ],
          "categories": [
            "cs.AI",
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.02530v1",
          "abstract_url": "https://arxiv.org/abs/2606.02530v1",
          "pdf_url": "https://arxiv.org/pdf/2606.02530v1",
          "published_at": "2026-06-01T17:38:12+00:00",
          "updated_at": "2026-06-01T17:38:12+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.02530",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.02530v1"
          },
          "relevance_score": 141,
          "match_reasons": [
            "title matched \"alignment\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.02530"
        }
      ]
    },
    {
      "name": "Agent Runtime Security",
      "key_points": [
        "《Jailbreaking Multimodal Large Language Models using Multi-Clip Video》〔数据 / 应用 / 方法〕：As multimodal large language models (MLLMs) have advanced to process video inputs, concerns have emerged about their potential for malicious misuse. Prior jail…",
        "《SentGuard: Sentence-Level Streaming Guardrails for Large Language Models》〔评测 / 数据 / 方法〕：Large language models increasingly stream long, reasoning-intensive responses in real time, making when to moderate as critical as whether to moderate. Existin…",
        "《AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations》〔评测 / 数据 / 方法〕：Indirect prompt injection in tool-use agents is a concrete production threat: LLM agents read from integrations (third-party services such as Gmail, Salesforce…",
        "《SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning》〔方法〕：As Large Language Model (LLM) agents increasingly leverage the Model Context Protocol (MCP) to operate in complex environments, the expansion of their action s…",
        "《SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents》〔评测 / 应用 / 方法〕：Autonomous LLM agents increasingly operate in stateful environments where they access tools, files, memory, and external services. While such capabilities enab…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Jailbreaking Multimodal Large Language Models using Multi-Clip Video",
          "summary": "As multimodal large language models (MLLMs) have advanced to process video inputs, concerns have emerged about their potential for malicious misuse. Prior jailbreak studies have shown that safety alignment in MLLMs can be bypassed through visual inputs, yet it remains unclear which properties of video inputs induce this vulnerability. To address this gap, we introduce Multi-Clip Video (MCV) SafetyBench, a dataset of 2,920 videos designed to evaluate how the diversity of video inputs affects the vulnerability of MLLMs. Each video consists of multiple short clips depicting diverse contexts related to a harmful query. Experiments on eight representative video MLLMs show that attack success consistently increases with the number of clips. Our results further indicate that the video modality is (1) more vulnerable than the image modality, (2) more vulnerable to dynamic videos than to static videos, and (3) more vulnerable when videos contain more diverse contexts. Building on these findings, we propose a defense strategy that leverages the relative robustness of the image modality.",
          "authors": [
            "Choongwon Kang",
            "Seungjong Sun",
            "Hyunmin Jun",
            "Jang Hyun Kim"
          ],
          "categories": [
            "cs.CV",
            "cs.AI",
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.02111v1",
          "abstract_url": "https://arxiv.org/abs/2606.02111v1",
          "pdf_url": "https://arxiv.org/pdf/2606.02111v1",
          "published_at": "2026-06-01T11:43:53+00:00",
          "updated_at": "2026-06-01T11:43:53+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.02111",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.02111v1"
          },
          "relevance_score": 63,
          "match_reasons": [
            "title matched \"jailbreak\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.02111"
        },
        {
          "title": "SentGuard: Sentence-Level Streaming Guardrails for Large Language Models",
          "summary": "Large language models increasingly stream long, reasoning-intensive responses in real time, making when to moderate as critical as whether to moderate. Existing guardrails fall into two unsatisfactory extremes: response-level methods delay intervention until the full output is generated, whereas token-level methods act on incomplete semantics, often producing unstable decisions and excessive guard invocations. To address this challenge, we propose SentGuard, a sentence-level streaming guardrail that operates in parallel with generation. A lightweight waiting buffer groups streamed tokens into sentence chunks and releases only verified chunks to the user, introducing a small offset that enables SentGuard to assess the current prefix while the target LLM decodes subsequent content. To support this, we construct StreamSafe, a benchmark with structured per-sentence annotations across 8 harm categories, capturing the evolution of safety risks across both reasoning and response segments. We further train SentGuard with a coarse-to-fine objective to detect unsafe intent as soon as it emerges at sentence boundaries. Experiments on 5 safety benchmarks show that SentGuard outperforms existing baselines, detecting 90.5% of unsafe cases within two sentences while maintaining a low streaming false-positive rate of 7.41%.",
          "authors": [
            "Jiaqi Yu",
            "Xin Wang",
            "Yixu Wang",
            "Jie Li",
            "Yan Teng",
            "Xingjun Ma",
            "Yingchun Wang"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.02041v1",
          "abstract_url": "https://arxiv.org/abs/2606.02041v1",
          "pdf_url": "https://arxiv.org/pdf/2606.02041v1",
          "published_at": "2026-06-01T10:30:08+00:00",
          "updated_at": "2026-06-01T10:30:08+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.02041",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.02041v1"
          },
          "relevance_score": 62,
          "match_reasons": [
            "title matched \"guardrail\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.02041"
        },
        {
          "title": "AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations",
          "summary": "Indirect prompt injection in tool-use agents is a concrete production threat: LLM agents read from integrations (third-party services such as Gmail, Salesforce, or Jira accessed through tool calls) whose response content the user neither writes nor controls. Existing benchmarks under-measure the threat: most cover only a handful of integrations with the same attack payload replayed across runs, and open-source guards are trained on chat-style data rather than tool-response content. We introduce AGENTREDBENCH, a dynamic LLM-driven redteaming benchmark of 215 subtle underspecified authorization (attacks at the boundary of what the user's request authorises) scenarios across 24 enterprise integrations in nine functional families and five attack types. Across an eight-model panel (Anthropic, OpenAI, Google), no-guard ASR (attack success rate) ranges from 32% (Claude Sonnet 4.6) to 81% (Gemini 3 Flash). To keep the scenario set out of training corpora and preserve headline ASR meaning over time, we release the codebase, integration schemas, and AGENTREDGUARD model openly; the canonical scenarios are evaluated through a maintainer-mediated channel with immutable versioning. We release AGENTREDGUARD alongside the benchmark: a guard trained on an integration-diverse corpus of adversarial tool-response content. AGENTREDGUARD cuts panel ASR from 69.9% to 2.4% at 0.37% false-positive rate, outperforming every open-source baseline with non-trivial detection (Llama Guard, PromptGuard 2, ProtectAI) on both axes. Cross-integration and cross-attack type holdouts both confirm the gain transfers beyond the training subset.",
          "authors": [
            "Hiskias Dingeto",
            "Will Leeney"
          ],
          "categories": [
            "cs.CR",
            "cs.AI",
            "cs.CL",
            "cs.ET"
          ],
          "paper_id": "http://arxiv.org/abs/2606.02240v1",
          "abstract_url": "https://arxiv.org/abs/2606.02240v1",
          "pdf_url": "https://arxiv.org/pdf/2606.02240v1",
          "published_at": "2026-06-01T13:34:24+00:00",
          "updated_at": "2026-06-01T13:34:24+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.02240",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.02240v1"
          },
          "relevance_score": 61,
          "match_reasons": [
            "summary matched \"prompt injection\"",
            "summary matched \"indirect prompt injection\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.02240"
        },
        {
          "title": "SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning",
          "summary": "As Large Language Model (LLM) agents increasingly leverage the Model Context Protocol (MCP) to operate in complex environments, the expansion of their action spaces offers agents unsafe capabilities and underscores the risk of power-seeking. While broad action space and greater environment influence are essential for task fulfillment, they create a fragile risk surface where minor errors or hallucinations are magnified into catastrophic failures. In response, we propose SafeMCP, a {server-side} defense plugin that constrains tool acquisition via predictive reasoning regarding future safety risks. SafeMCP utilizes an internal world model for look-ahead reasoning to implement a two-tier defense: proactive tool filtering to constrain hazardous power expansion and immediate intervention as a fail-safe. To train SafeMCP, we introduce a three-stage pipeline comprising environmental dynamic grounding, safe policy initialization, and reinforcement learning (RL) with dual verifiable rewards. Experiments on PowerSeeking Bench, ToolEmu, and AgentHarm show that SafeMCP achieves a safe equilibrium, effectively mitigating risks while preserving agent utility.",
          "authors": [
            "Lichao Wang",
            "Zhaoxing Ren",
            "Tianzhuo Yang",
            "Jiaming Ji",
            "Chi Harold Liu",
            "Yaodong Yang",
            "Juntao Dai"
          ],
          "categories": [
            "cs.AI",
            "cs.CL",
            "cs.CY"
          ],
          "paper_id": "http://arxiv.org/abs/2606.01991v1",
          "abstract_url": "https://arxiv.org/abs/2606.01991v1",
          "pdf_url": "https://arxiv.org/pdf/2606.01991v1",
          "published_at": "2026-06-01T09:48:41+00:00",
          "updated_at": "2026-06-01T09:48:41+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.01991",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.01991v1"
          },
          "relevance_score": 61,
          "match_reasons": [
            "title matched \"agent defense\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.01991"
        },
        {
          "title": "SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents",
          "summary": "Autonomous LLM agents increasingly operate in stateful environments where they access tools, files, memory, and external services. While such capabilities enable complex real-world workflows, they also introduce security risks that are difficult to capture with existing evaluations. Current agent security benchmarks often rely on manually curated tasks, provide limited coverage of emerging threats, and focus primarily on final outcomes rather than the execution processes that lead to unsafe behavior. We introduce SeClaw, a framework that combines specification-driven security task synthesis with execution-based security evaluation for Autonomous agents. Spec-driven security task synthesis enables scalable and controllable construction of security tasks from structured risk specifications, while SeClaw docker provides a standardized testbed for evaluating agent behavior under diverse safety-risk scenarios. The benchmark covers risks arising from resources, user tasks, environments, and intrinsic agent behaviors, and supports trajectory-aware assessment of unsafe actions beyond final responses. By bridging systematic task synthesis and reproducible security evaluation, SeClaw provides a practical foundation for measuring, diagnosing, and comparing security failures in autonomous LLM agents. The code is available at https://github.com/seclaw-eval/seclaw-eval.",
          "authors": [
            "Hao Cheng",
            "Changtao Miao",
            "Tianle Song",
            "Yin Wu",
            "He Liu",
            "Erjia Xiao",
            "Junchi Chen",
            "Xiaoyu Shi",
            "Yichi Wang",
            "Jing Yang",
            "Taowen Wang",
            "Jinhao Duan",
            "Mengshu Sun",
            "Peiyan Dong",
            "Xuan Shen",
            "Yang Cao",
            "Renjing Xu",
            "Kaidi Xu",
            "Jindong Gu",
            "Bo Zhang",
            "Jize Zhang",
            "Chenhao Lin",
            "Philip Torr",
            "Chao Shen"
          ],
          "categories": [
            "cs.CR",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.02302v1",
          "abstract_url": "https://arxiv.org/abs/2606.02302v1",
          "pdf_url": "https://arxiv.org/pdf/2606.02302v1",
          "published_at": "2026-06-01T14:23:42+00:00",
          "updated_at": "2026-06-01T14:23:42+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.02302",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.02302v1"
          },
          "relevance_score": 44,
          "match_reasons": [
            "summary matched \"agent security\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.02302"
        },
        {
          "title": "PyFEX: Uncovering Evasive Python-based Threats via Resilient and Exhaustive Path Exploration",
          "summary": "The rapid expansion of the Python ecosystem has fueled two distinct but converging threats: adversaries increasingly target the software supply chain via the Python Package Index (PyPI), while also building evasive, cross-platform malicious binaries compiled from source code written in Python. Current program analysis techniques struggle to address this dual threat. Static analysis based tools are often blinded by runtime obfuscation and compiled bytecode, while dynamic analysis based ones are fragile, prone to evasion by environmental guardrails, and often terminates prematurely due to unsatisfied dependencies. To overcome these limitations, we present PyFEX, a resilient forced-execution engine. PyFEX explores a program's behavioral space systematically by forcing execution across all conditional branches to bypass evasion checks. To address the fragility of dynamic execution, it introduces a novel resilient crash recovery mechanism that synthesizes dummy objects to satisfy failed operations at the runtime, allowing analysis to proceed past fatal errors, and employs path merging to mitigate path explosion. PyFEX further incorporates an automated entry identification mechanism that proactively discovers and invokes dormant functions, exposing malicious logic hidden within uncalled APIs. To demonstrate the efficacy of this engine, we built PyFEXScan, a proof-of-concept malware detector built on top of PyFEX. Evaluated against both known malicious PyPI packages and real-world compiled binaries, PyFEX exposes critical behaviors missed by the existing state-of-the-art tools. In a live deployment on PyPI, PyFEXScan discovered 212 previously unknown malicious packages accounting for over 91,648 downloads, underscoring the necessity of resilient, exhaustive analysis for securing the Python ecosystem.",
          "authors": [
            "Meng Wang",
            "Yue Ma",
            "Majid Garoosi",
            "Wenting Fan",
            "Liwei Guo",
            "Jianqiang Wang",
            "Ali Abbasi"
          ],
          "categories": [
            "cs.CR"
          ],
          "paper_id": "http://arxiv.org/abs/2606.02196v1",
          "abstract_url": "https://arxiv.org/abs/2606.02196v1",
          "pdf_url": "https://arxiv.org/pdf/2606.02196v1",
          "published_at": "2026-06-01T12:51:21+00:00",
          "updated_at": "2026-06-01T12:51:21+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "应用"
          ],
          "topics": [
            "RAG",
            "Guardrail"
          ],
          "doi": null,
          "arxiv_id": "2606.02196",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.02196v1"
          },
          "relevance_score": 42,
          "match_reasons": [
            "summary matched \"guardrail\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.02196"
        }
      ]
    },
    {
      "name": "Terminal and SWE Agents",
      "key_points": [
        "《SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction》〔评测 / 应用 / 方法〕：Agent skills occupy a privileged position in the agent workflow, as agents are expected to implicitly follow and execute them, rendering third-party skills a v…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction",
          "summary": "Agent skills occupy a privileged position in the agent workflow, as agents are expected to implicitly follow and execute them, rendering third-party skills a vulnerable attack surface. Existing studies have revealed unsafe agent behaviors induced by skill-based attacks, but they primarily evaluate poisoned skills within a single task execution and enumerate harms through ad-hoc risk lists. To bridge these gaps, we introduce SkillHarm, a benchmark of skill-based attacks across the skill-use lifecycle, paired with a systematic taxonomy of skill-relevant risks. SkillHarm evaluates two attack scenarios: Fixed-Payload Poisoning (FPP), where a fixed poisoned skill package directly compromises any task session that invokes it, and Self-Mutating Poisoning (SMP), where an initially benign execution silently mutates persistent skill content, deferring harm until a subsequent reuse. It further defines 12 risk types based on the agent workflow component targeted by the harm: data pipelines, system environments, and agent autonomy. To instantiate these attacks at scale, we build AutoSkillHarm, an automated construction pipeline with coding agents driven by natural-language harnesses. The resulting benchmark contains 879 attack samples across 71 skills. Experiments show that current agents remain vulnerable with attack success rates up to 86.3% in FPP and 69.3% in SMP. Our analysis further reveals a latent risk: many apparent attack failures stem from the agent failing to engage with the poisoned file rather than genuine resistance, and current defenses still fail to reliably mitigate the threat.",
          "authors": [
            "Yuting Ning",
            "Zhehao Zhang",
            "Yash Kumar Lal",
            "Boyu Gou",
            "Junyi Li",
            "Weitong Ruan",
            "Chentao Ye",
            "Rahul Gupta",
            "Diyi Yang",
            "Yu Su",
            "Huan Sun"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.02540v1",
          "abstract_url": "https://arxiv.org/abs/2606.02540v1",
          "pdf_url": "https://arxiv.org/pdf/2606.02540v1",
          "published_at": "2026-06-01T17:45:39+00:00",
          "updated_at": "2026-06-01T17:45:39+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.02540",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.02540v1"
          },
          "relevance_score": 47,
          "match_reasons": [
            "summary matched \"coding agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.02540"
        }
      ]
    }
  ]
}