{
  "generated_at": "2026-05-20T13:10:58.892013+08:00",
  "timezone": "Asia/Shanghai",
  "lookback_hours": 24,
  "sorting": {
    "default_sort_by": "hybrid",
    "summary": "hybrid (relevance first, published_at tie-break)",
    "weights": {
      "title_match_weight": 40,
      "summary_match_weight": 18,
      "doi_weight": 12,
      "pdf_weight": 8,
      "rich_summary_weight": 6,
      "metadata_weight": 4,
      "multi_source_weight": 10,
      "freshness_weight_cap": 24
    },
    "feeds": [
      {
        "name": "LM",
        "sort_by": "hybrid"
      },
      {
        "name": "Agent Runtime Security",
        "sort_by": "hybrid"
      },
      {
        "name": "Terminal and SWE Agents",
        "sort_by": "hybrid"
      }
    ]
  },
  "highlights": [
    "主题「LLM」：命中 21 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models》、《OpenCompass: A Universal Evaluation Platform for Large Language Models》。",
    "主题「Language Model」：命中 19 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models》、《OpenCompass: A Universal Evaluation Platform for Large Language Models》。",
    "主题「Agent」：命中 5 篇，覆盖 Agent Runtime Security、Terminal and SWE Agents，代表论文包括 《OpenComputer: Verifiable Software Worlds for Computer-Use Agents》、《A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents》。",
    "主题「Reasoning」：命中 3 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding》、《Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models》。",
    "主题「Benchmark」：命中 3 篇，覆盖 Agent Runtime Security、Terminal and SWE Agents，代表论文包括 《TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing》、《RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades》。"
  ],
  "focus_items": [],
  "action_items": [],
  "topic_sections": [
    {
      "name": "LLM",
      "paper_count": 21,
      "feed_names": [
        "LM",
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models",
        "OpenCompass: A Universal Evaluation Platform for Large Language Models",
        "Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking",
        "SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models",
        "LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening",
        "Prompting language influences diagnostic reasoning and accuracy of large language models",
        "ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning",
        "From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning",
        "Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning",
        "Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning",
        "Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation",
        "CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning",
        "KoRe: Compact Knowledge Representations for Large Language Models",
        "LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models",
        "Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models",
        "OpenComputer: Verifiable Software Worlds for Computer-Use Agents",
        "A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents",
        "Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents",
        "TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing",
        "PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents",
        "Toward Training Superintelligent Software Agents through Self-Play SWE-RL"
      ],
      "key_points": [
        "《MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models》〔评测 / 方法〕：Large language models (LLMs) are increasingly integrated into high-stakes decision-making. Inspired by the theory of \\emph{inattentional blindness} in human co…",
        "《OpenCompass: A Universal Evaluation Platform for Large Language Models》〔评测 / 数据 / 应用 / 方法〕：In recent years, the field of artificial intelligence has undergone a paradigm shift from task-specific small-scale models to general-purpose large language mo…"
      ]
    },
    {
      "name": "Language Model",
      "paper_count": 19,
      "feed_names": [
        "LM",
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models",
        "OpenCompass: A Universal Evaluation Platform for Large Language Models",
        "Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking",
        "SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models",
        "LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening",
        "Prompting language influences diagnostic reasoning and accuracy of large language models",
        "ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning",
        "FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding",
        "From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning",
        "Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning",
        "Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning",
        "Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation",
        "CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning",
        "KoRe: Compact Knowledge Representations for Large Language Models",
        "LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models",
        "Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents",
        "SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents",
        "PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents",
        "Toward Training Superintelligent Software Agents through Self-Play SWE-RL"
      ],
      "key_points": [
        "《MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models》〔评测 / 方法〕：Large language models (LLMs) are increasingly integrated into high-stakes decision-making. Inspired by the theory of \\emph{inattentional blindness} in human co…",
        "《OpenCompass: A Universal Evaluation Platform for Large Language Models》〔评测 / 数据 / 应用 / 方法〕：In recent years, the field of artificial intelligence has undergone a paradigm shift from task-specific small-scale models to general-purpose large language mo…"
      ]
    },
    {
      "name": "Agent",
      "paper_count": 5,
      "feed_names": [
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "OpenComputer: Verifiable Software Worlds for Computer-Use Agents",
        "A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents",
        "SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents",
        "Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study",
        "RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades"
      ],
      "key_points": [
        "《OpenComputer: Verifiable Software Worlds for Computer-Use Agents》〔评测 / 应用 / 方法〕：We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four compon…",
        "《A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents》〔方法〕：Production LLM agents combine stochastic model outputs with deterministic software systems, yet the boundary between the two is rarely treated as a first-class…"
      ]
    },
    {
      "name": "Reasoning",
      "paper_count": 3,
      "feed_names": [
        "LM",
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding",
        "Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models",
        "The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next"
      ],
      "key_points": [
        "《FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding》〔评测 / 应用 / 方法〕：Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehen…",
        "《Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models》〔评测 / 方法〕：Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex problems by generating structured, step-by-step reasoning content. H…"
      ]
    },
    {
      "name": "Benchmark",
      "paper_count": 3,
      "feed_names": [
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing",
        "RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades",
        "The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next"
      ],
      "key_points": [
        "《TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing》〔评测 / 应用 / 方法〕：LLM routing matters most in long-horizon applications such as coding agents, deep research systems, and computer-use agents, where a single user request trigge…",
        "《RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades》〔评测 / 方法〕：Coding agents are increasingly deployed in real software development, where a single version iteration requires months of coordinated work across many files. H…"
      ]
    }
  ],
  "template": "zh_daily_brief",
  "feeds": [
    {
      "name": "LM",
      "key_points": [
        "《MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models》〔评测 / 方法〕：Large language models (LLMs) are increasingly integrated into high-stakes decision-making. Inspired by the theory of \\emph{inattentional blindness} in human co…",
        "《OpenCompass: A Universal Evaluation Platform for Large Language Models》〔评测 / 数据 / 应用 / 方法〕：In recent years, the field of artificial intelligence has undergone a paradigm shift from task-specific small-scale models to general-purpose large language mo…",
        "《Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking》〔评测 / 方法〕：Checkpoint selection for multimodal large language models (MLLMs) presents significant challenges when performance differentials are marginal and evaluation si…",
        "《SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models》〔评测 / 数据 / 应用 / 方法〕：Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities requ…",
        "《LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening》〔评测 / 数据 / 应用 / 方法〕：Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly f…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models",
          "summary": "Large language models (LLMs) are increasingly integrated into high-stakes decision-making. Inspired by the theory of \\emph{inattentional blindness} in human cognition, we investigate whether LLMs, trained on human-preferred corpora that embed attentional biases, exhibit a similar limitation: \\emph{failing to attend to subtle yet important contextual cues under explicit task instructions}. To evaluate this, we introduce the task of \\textbf{explicit-implicit reasoning} and present \\textbf{MixRea}, a benchmark of 2,246 multiple-choice questions across 9 reasoning types with varying distributions of explicit and implicit information. Evaluation of 21 advanced LLMs shows that even the best-performing reasoning model (Gemini 2.5 Pro) achieves only 42.8\\% consistency, revealing widespread inattentional blindness. To mitigate this, we propose \\textbf{Potential Relation Completion Prompting (PRCP)}, a prompting method that improves reasoning by recovering overlooked causal relations. Further analysis shows that this limitation persists across diverse multi-source reasoning tasks, highlighting the need for more cognitively aligned models.",
          "authors": [
            "Yuanqing Cai",
            "Ziyi Huang",
            "Minhao Liu",
            "Lixin Duan",
            "Wen Li",
            "Yanru Zhang"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.20128",
          "abstract_url": "https://arxiv.org/abs/2605.20128",
          "pdf_url": "https://arxiv.org/pdf/2605.20128",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.20128",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.20128"
          },
          "relevance_score": 236,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "title matched \"reasoning\"",
            "title matched \"benchmark\"",
            "summary matched \"LLM\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.20128"
        },
        {
          "title": "OpenCompass: A Universal Evaluation Platform for Large Language Models",
          "summary": "In recent years, the field of artificial intelligence has undergone a paradigm shift from task-specific small-scale models to general-purpose large language models (LLMs). With the rapid iteration of LLMs, objective, quantitative, and comprehensive evaluation of their capabilities has become a critical link in advancing technological development. Currently, the mainstream static benchmark dataset-based evaluation methods face challenges such as the diversity of task types, inconsistent evaluation criteria, and fragmentation of data and processing workflows, making it difficult to efficiently conduct cross-domain and large-scale model evaluation. To address the aforementioned issues, this paper proposes and open-sources OpenCompass, a one-stop, scalable, and high-concurrency-supported general-purpose LLM evaluation platform. Adhering to the design philosophy of modularization and component decoupling, the platform boasts three core advantages: high compatibility, flexibility, and high concurrency. The core architecture of OpenCompass comprises five key components: the Configuration System, Task Partitioning Module, Execution and Scheduling Module, Task Execution Unit, and Result Visualization Module. Its workflow provides rule-based, LLM-as-a-Judge, and cascaded evaluators to adapt to the requirements of different task scenarios. Supporting mainstream benchmark datasets across multiple domains, including knowledge, reasoning, computation, science, language, code, etc., the platform offers a unified and efficient LLM evaluation tool for both academia and industry, facilitating the accurate identification of strengths and weaknesses of LLMs as well as their subsequent optimization.",
          "authors": [
            "Maosong Cao",
            "Kai Chen",
            "Haodong Duan",
            "Yixiao Fang",
            "Tong Gao",
            "Ge Jiaye",
            "Mo Li",
            "Hongwei Liu",
            "Junnan Liu",
            "Yuan Liu",
            "Chengqi Lyu",
            "Han Lyu",
            "Ningsheng Ma",
            "Zerun Ma",
            "Yu Sun",
            "Zhiyong Wu",
            "Linchen Xiao",
            "Jun Xu",
            "Haochen Ye",
            "Zhaohui Yu",
            "Yike Yuan",
            "Songyang Zhang",
            "Yufeng Zhao",
            "Fengzhe Zhou",
            "Peiheng Zhou",
            "Dongsheng Zhu",
            "Lin Zhu",
            "Jingming Zhuo"
          ],
          "categories": [
            "cs.CL",
            "cs.LG"
          ],
          "paper_id": "https://arxiv.org/abs/2605.19276",
          "abstract_url": "https://arxiv.org/abs/2605.19276",
          "pdf_url": "https://arxiv.org/pdf/2605.19276",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.19276",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.19276"
          },
          "relevance_score": 232,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "title matched \"evaluation\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.19276"
        },
        {
          "title": "Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking",
          "summary": "Checkpoint selection for multimodal large language models (MLLMs) presents significant challenges when performance differentials are marginal and evaluation signals are prone to noise. Existing methodologies rely heavily on static benchmarks or pointwise scoring, which frequently misalign with in-the-wild usage and lack robust uncertainty estimation, particularly in OCR-heavy scenarios. In this work, we formulate checkpoint selection as a robust decision problem under evaluation uncertainty. We propose a multi-stage framework that integrates curated real-world data, structured LLM-based judgment, and multi-stage ranking protocols. The evaluation system orchestrates progressive refinement via pointwise filtering, listwise ranking, and pairwise comparison. To enhance reliability, we introduce subsampling-based confidence estimation and a percentile-based scoring formulation that captures distributional characteristics while penalizing tail failures. Furthermore, we demonstrate that data quality, specifically OCR readability, is a critical determinant of evaluation validity.",
          "authors": [
            "Qinwu Xu",
            "Zhuoheng Li",
            "Jessie Salas"
          ],
          "categories": [
            "cs.LG",
            "cs.AI",
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.18852",
          "abstract_url": "https://arxiv.org/abs/2605.18852",
          "pdf_url": "https://arxiv.org/pdf/2605.18852",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.18852",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18852"
          },
          "relevance_score": 214,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"agent\"",
            "title matched \"evaluation\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18852"
        },
        {
          "title": "SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models",
          "summary": "Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs. The source code is available at https://github.com/yjwtheonly/SciCustom.",
          "authors": [
            "Yiyang Gu",
            "Junwei Yang",
            "Junyu Luo",
            "Ye Yuan",
            "Bin Feng",
            "Yingce Xia",
            "Shufang Xie",
            "Kaili Liu",
            "Bohan Wu",
            "Qi Shi",
            "Haoran Li",
            "Beier Xiao",
            "Zhiping Xiao",
            "Xiao Luo",
            "Weizhi Zhang",
            "Philip S. Yu",
            "Zequn Liu",
            "Ming Zhang"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.19357",
          "abstract_url": "https://arxiv.org/abs/2605.19357",
          "pdf_url": "https://arxiv.org/pdf/2605.19357",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.19357",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.19357"
          },
          "relevance_score": 214,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "title matched \"evaluation\"",
            "summary matched \"LLM\"",
            "summary matched \"alignment\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.19357"
        },
        {
          "title": "LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening",
          "summary": "Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at https://github.com/llmeval/LLMEval-Logic.",
          "authors": [
            "Ming Zhang",
            "Qiyuan Peng",
            "Yinxi Wei",
            "Yujiong Shen",
            "Kexin Tan",
            "Yuhui Wang",
            "Zhenghao Xiang",
            "Junjie Ye",
            "Zhangyue Yin",
            "Zhiheng Xi",
            "Shihan Dou",
            "Tao Gui",
            "Maxm Pan",
            "Ruizhi Yang",
            "Qi Zhang",
            "Xuanjing Huang"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.19597",
          "abstract_url": "https://arxiv.org/abs/2605.19597",
          "pdf_url": "https://arxiv.org/pdf/2605.19597",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.19597",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.19597"
          },
          "relevance_score": 196,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"reasoning\"",
            "title matched \"benchmark\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.19597"
        },
        {
          "title": "Prompting language influences diagnostic reasoning and accuracy of large language models",
          "summary": "Large language models (LLMs) are increasingly explored for clinical decision support, yet most evaluations are conducted in English, leaving their reliability in other languages uncertain. Here we evaluate the impact of prompting language on diagnostic reasoning and final diagnosis accuracy by comparing English and French performance across five LLMs (o3, DeepSeek-R1, GPT-4-Turbo, Llama-3.1-405B-Instruct, and BioMistral-7B). A total of 180 clinical vignettes covering 16 medical specialties were assessed by two physicians using an 18-point scale evaluating both diagnosis accuracy and reasoning quality. Four of the five models performed better in English (mean difference 0.37-0.91, adjusted p < 0.05), with the gap spanning multiple aspects of reasoning, including differential diagnosis, logical structure, and internal validity. o3 was the only model showing no overall language effect. These findings demonstrate that prompting language remains a critical determinant of LLM clinical performance, with implications for equitable linguistico-cultural deployment worldwide.",
          "authors": [
            "Adrien Bazoge",
            "Josselin Corvellec",
            "Sofiane Djillali Sid-Ahmed",
            "Pierre-Antoine Gourraud"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.19173",
          "abstract_url": "https://arxiv.org/abs/2605.19173",
          "pdf_url": "https://arxiv.org/pdf/2605.19173",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.19173",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.19173"
          },
          "relevance_score": 196,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "title matched \"reasoning\"",
            "summary matched \"LLM\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.19173"
        },
        {
          "title": "ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning",
          "summary": "Large language models (LLMs) and agentic systems have shown promise for clinical decision support, but existing works largely assume that evidence has already been curated and handed to the model. Real-world clinical workflows instead require agents to actively seek, iteratively plan, and synthesize multimodal evidence from heterogeneous sources. In this paper, we introduce ClinSeekAgent, an automated agentic framework for dynamic multimodal evidence seeking that shifts the paradigm from passive evidence consumption to active evidence acquisition. Given only a clinical query and access to raw data sources, ClinSeekAgent gathers evidence by querying medical knowledge bases, navigating raw EHRs, and invoking medical imaging tools; refines its hypotheses as new information emerges; and integrates the collected evidence into grounded clinical decisions. ClinSeekAgent serves both as an inference-time agent for frontier LLMs and as a training-time pipeline for distilling high-quality agent trajectories into compact open-source models. To validate its inference-time effectiveness, we construct ClinSeek-Bench, which pairs Curated Input reasoning from fixed pre-selected evidence with Automated Evidence-Seeking over raw clinical data. On text-only EHR tasks, ClinSeekAgent improves Claude Opus 4.6 from 60.0 to 63.2 overall F1 and MiniMax M2.5 from 43.1 to 47.3, with positive risk-prediction gains in 7 out of 9 evaluated host models. On multimodal tasks, ClinSeekAgent improves Claude Opus 4.6 from 47.5 to 62.6 (+15.1); all evaluated models improve across the three CXR-related task groups. We further validate ClinSeekAgent as a training pipeline by distilling agentic evidence-seeking trajectories into ClinSeek-35B-A3B, which achieves 34.0 average F1 on existing AgentEHR-Bench, improving over its Qwen3.5-35B-A3B baseline by +11.9 points and approaching Claude Opus 4.6.",
          "authors": [
            "Juncheng Wu",
            "Letian Zhang",
            "Yuhan Wang",
            "Haoqin Tu",
            "Hardy Chen",
            "Zijun Wang",
            "Cihang Xie",
            "Yuyin Zhou"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.20176",
          "abstract_url": "https://arxiv.org/abs/2605.20176",
          "pdf_url": "https://arxiv.org/pdf/2605.20176",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.20176",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.20176"
          },
          "relevance_score": 192,
          "match_reasons": [
            "title matched \"reasoning\"",
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"RAG\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.20176"
        },
        {
          "title": "FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding",
          "summary": "Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human-centric video question answering (VQA) benchmark specifically designed to assess fine-grained understanding. FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 minutes each), focusing on detailed person movement, person interaction, and object manipulation, including compositional actions. Our extensive evaluation reveals that while proprietary models like GPT-5 achieve respectable performance, current open-source VLMs significantly underperform, struggling particularly with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions. To address these identified weaknesses, we propose FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor. Experiments show that FineAgent consistently improves the performance of various open VLMs on FineBench. FineBench provides a rigorous testbed for future research into fine-grained human-centric video understanding, while FineAgent offers a practical approach to enhance such reasoning in current VLMs.",
          "authors": [
            "Gueter Josmy Faure",
            "Min-Hung Chen",
            "Jia-Fong Yeh",
            "Hung-Ting Su",
            "Winston H. Hsu"
          ],
          "categories": [
            "cs.CV",
            "cs.AI",
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.19846",
          "abstract_url": "https://arxiv.org/abs/2605.19846",
          "pdf_url": "https://arxiv.org/pdf/2605.19846",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Language Model",
            "Reasoning"
          ],
          "doi": null,
          "arxiv_id": "2605.19846",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.19846"
          },
          "relevance_score": 192,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"benchmark\"",
            "summary matched \"reasoning\"",
            "summary matched \"agent\"",
            "summary matched \"RAG\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.19846"
        },
        {
          "title": "From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning",
          "summary": "Recent attempts to support high-level scene interpretation and planning in Autonomous Vehicles (AVs) using ensembles of Large Language Models (LLMs) and Large Multimodal Models (LMMs) continue to treat time as a secondary property. This lack of temporal grounding leads to inconsistencies in reasoning about continuous actions, undermining both safety and interpretability. This work explores whether temporal conditioning within inter-agent communication can preserve or enhance coherence without introducing degradation in semantic or logical consistency. To investigate this, we introduce three planner architectures with progressively increasing temporal integration and evaluate them on curated subsets of the BDD-X dataset using semantic, syntactic, and logical metrics. Results show that while temporal conditioning reshapes reasoning style, it yields no statistically significant improvements in standard NLP-based correctness metrics. However, qualitative analysis reveals predictive hazard reasoning, stable corrective behavior, and strategic divergence in the Sentinel. These findings clarify the limits of prompt-based temporal grounding and establish the first empirical benchmark for temporal scene-to-plan reasoning.",
          "authors": [
            "Ahmed Y. Gado",
            "Omar Y. Goba",
            "Alaa Hassanein",
            "Catherine M. Elias",
            "Ahmed Hussein"
          ],
          "categories": [
            "cs.AI",
            "cs.CL",
            "cs.CV",
            "cs.RO"
          ],
          "paper_id": "https://arxiv.org/abs/2605.19824",
          "abstract_url": "https://arxiv.org/abs/2605.19824",
          "pdf_url": "https://arxiv.org/pdf/2605.19824",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.19824",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.19824"
          },
          "relevance_score": 192,
          "match_reasons": [
            "title matched \"reasoning\"",
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.19824"
        },
        {
          "title": "Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning",
          "summary": "Entropy-based deep reasoning has emerged as a promising direction for improving the reasoning capabilities of Large Language Models (LLMs), but existing methods often either increase response length indiscriminately or shorten responses at the cost of accuracy. To better balance this trade-off, we introduce Conditional Entropy Shaping (CES), a framework that dynamically controls token-level response entropy, enabling LLMs to produce concise solutions on simple problems while encouraging deeper exploration on hard ones. Built on DAPO, CES uses token-level entropy as an uncertainty signal and applies a conditional bidirectional policy: it penalizes high-entropy \"forking point\" tokens on correct reasoning paths to improve conciseness, and rewards them on incorrect paths to encourage exploration and error correction. We implement CES on DeepSeek-R1-Distill-7B and evaluate it on 12 mathematical benchmarks. CES consistently improves average accuracy while reducing response length relative to DAPO, and supplementary experiments show similar trends on a smaller 1.5B backbone and on out-of-domain benchmarks.",
          "authors": [
            "Shuyu Wei",
            "Jian Sun",
            "Delai Qiu",
            "Yining Wang",
            "Shengping Liu",
            "Jiaen Liang",
            "Ying Fu",
            "Wei Huang",
            "Jitao Sang"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.19358",
          "abstract_url": "https://arxiv.org/abs/2605.19358",
          "pdf_url": "https://arxiv.org/pdf/2605.19358",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.19358",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.19358"
          },
          "relevance_score": 192,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"reasoning\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.19358"
        },
        {
          "title": "Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning",
          "summary": "Tool-augmented reasoning has emerged as a promising direction for enhancing the reasoning capabilities of multimodal large language models (MLLMs). However, existing studies mainly focus on enabling models to perform tool invocation, while neglecting the necessity of invoking tools. We argue that tool usage is not always beneficial, as redundant or inappropriate invocations largely increase reasoning overhead and even mislead model predictions. To address this issue, we introduce AutoTool, a model that adaptively decides whether to invoke tools according to the characteristics of each query. Within a reinforcement learning framework, we design an explicit dual-mode reasoning strategy with mode-specific reward functions to guide the model toward producing accurate responses. Moreover, to prevent premature bias toward a single reasoning mode, AutoTool jointly explores and balances tool-assisted and text-centric reasoning throughout training, and promotes free exploration in later stages. Extensive experiments demonstrate that AutoTool exhibits outstanding performance and high efficiency, yielding a 21.8\\% accuracy gain on V* benchmark compared to the base model, and a 44.9\\% improvement in efficiency over existing tool-augmented methods on POPE benchmark. Code is available at https://github.com/MQinghe/AutoTool.",
          "authors": [
            "Qinghe Ma",
            "Zhen Zhao",
            "Yiming Wu",
            "Jian Zhang",
            "Lei Bai",
            "Yinghuan Shi"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.19852",
          "abstract_url": "https://arxiv.org/abs/2605.19852",
          "pdf_url": "https://arxiv.org/pdf/2605.19852",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.19852",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.19852"
          },
          "relevance_score": 174,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"reasoning\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.19852"
        },
        {
          "title": "Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation",
          "summary": "Large language models (LLMs) have achieved remarkable success in complex reasoning tasks via long chain-of-thought (CoT), yet their immense computational overhead hinders real-world deployment. LLM reasoning distillation addresses this by transferring reasoning capabilities from formidable teacher models to compact student models. However, existing distillation paradigms face a fundamental dilemma. Typical off-policy distillation strictly utilizes teacher-generated golden trajectories, suffering from an exposure bias due to the mismatch between training distributions and student-generated inference contexts, which leads to error cascades in long CoT reasoning. To address this, on-policy distillation allows students to explore their own trajectories, but we demonstrate that it inherently introduces a reciprocal reversed exposure bias: the teacher model also struggles to provide positive guidance when conditioned on student-generated sub-optimal contexts. To resolve this dual exposure biases problem, we propose Monitoring Trajectories and Backtracking when it strays (MOTAB), a new LLM reasoning distillation pipeline. Specifically, MOTAB dynamically monitors the student's on-policy generation against an adaptive safety boundary. When the generation strays and exceeds this threshold, MOTAB backtracks to the last safe state and leverages teacher intervention to correct the course. This approach inherently tolerates minor student errors to mitigate exposure bias, while preventing sub-optimal contexts to circumvent reversed exposure bias. Extensive experiments on the LIMO-v2 and AceReason datasets demonstrate that MOTAB effectively alleviates the dual exposure biases, yielding a roughly 3% average performance improvement in reasoning tasks.",
          "authors": [
            "Bing Wang",
            "Shaotian Yan",
            "Chen Shen",
            "kaiyuan liu",
            "Sinan Fan",
            "Ximing Li",
            "Rui Miao",
            "Xiaosong Yuan",
            "Zhanming Shen",
            "Jieping Ye"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "https://arxiv.org/abs/2605.19433",
          "abstract_url": "https://arxiv.org/abs/2605.19433",
          "pdf_url": "https://arxiv.org/pdf/2605.19433",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.19433",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.19433"
          },
          "relevance_score": 174,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"reasoning\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"RAG\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.19433"
        },
        {
          "title": "CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning",
          "summary": "Chain-of-thought (CoT) is a standard approach for eliciting reasoning capabilities from large language models (LLMs). However, the common CoT paradigm treats thinking as a prerequisite for answering, which can delay access to plausible answers and incur unnecessary token costs even when the model is able to identify an answer before extended thinking, a behavior known as performative reasoning. In this paper, we introduce CopT, a reformulated reasoning pipeline that reverses the usual order of thinking and answering. Instead of thinking before answering, CopT first elicits a draft answer and then invokes subsequent on-policy thinking conditioned on its own draft answer for reflection and correction. To assess whether the draft answer should be trusted, CopT recasts continuous embeddings as inference-time contrastive verifiers. Specifically, it contrasts the model's support for the same generated tokens under discrete-token inputs and continuous-embedding inputs, yielding a sequence-level reverse KL estimator for answer reliability. Our analysis shows that under certain assumptions, the expected estimate equals the mutual information between the unresolved latent state and the emitted answer token, explaining why it captures answer-relevant uncertainty rather than arbitrary uncertainty in the latent state. When the answer is deemed insufficiently reliable, CopT performs further on-policy thinking, where a second KL estimator dynamically controls draft-answer visibility, preserving useful partial information while reducing the risk of being misled by unreliable content. Across mathematics, coding, and agentic reasoning tasks, CopT improves peak accuracy by up to 23% and reduces token usage by up to 57% at comparable or higher accuracy, without any additional training. The code is available at https://github.com/sdc17/CopT.",
          "authors": [
            "Dachuan Shi",
            "Hanlin Zhu",
            "Xiangchi Yuan",
            "Wanjia Zhao",
            "Kejing Xia",
            "Wen Xiao",
            "Wenke Lee"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "https://arxiv.org/abs/2605.20075",
          "abstract_url": "https://arxiv.org/abs/2605.20075",
          "pdf_url": "https://arxiv.org/pdf/2605.20075",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.20075",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.20075"
          },
          "relevance_score": 174,
          "match_reasons": [
            "title matched \"reasoning\"",
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.20075"
        },
        {
          "title": "KoRe: Compact Knowledge Representations for Large Language Models",
          "summary": "Modern Large Language Models (LLMs) have shown impressive performances in user-facing tasks such as question answering, as well as consistent improvements in reasoning capabilities. Still, the way these models encode knowledge seems inherently flawed: by design, LLMs encode world-knowledge within their parameters. This way of representing knowledge is inherently opaque, difficult to debug and update, and prone to hallucinations. On the other hand, Knowledge Graphs can provide human-readable and easily editable world knowledge representations, and their application in knowledge-intensive tasks has consistently proven beneficial to downstream performance. Nonetheless, current integration techniques require extensive retraining or finetuning. To overcome this issue, we introduce KoRe, a methodology to encode 1-hop sub-graphs into compact discrete knowledge tokens and inject them into a LLM backbone. We test the proposed approach on three established benchmarks, and report competitive performances coupled with a significant reduction (up to 10x) in token usage. Our results show that compact discrete KG representations can efficiently and effectively be used to ground modern LLMs.",
          "authors": [
            "Davide Cavicchini",
            "Fausto Giunchiglia",
            "Jacopo Staiano"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.20170",
          "abstract_url": "https://arxiv.org/abs/2605.20170",
          "pdf_url": "https://arxiv.org/pdf/2605.20170",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.20170",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.20170"
          },
          "relevance_score": 174,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.20170"
        },
        {
          "title": "LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models",
          "summary": "Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajectory cohorts. However, the method's reliance on a monolithic statistical baseline, such as the group mean, collapses the relational topology of the trajectory space into a single scalar, thereby erasing the fine-grained preference information essential for navigating complex, rank-sensitive reward landscapes. To address this issue, we introduce a novel framework, Lambda Policy Optimization (LambdaPO), that addresses this information-theoretic bottleneck by re-conceptualizing advantage estimation from a scalar value to a decomposed, pairwise preference structure. Specifically, the advantage for any given trajectory is formulated as the integrated sum of reward differentials against all peers in its cohort, where each pairwise comparison is dynamically attenuated by the policy's own probabilistic confidence in the established preference. To further mitigate the sparsity of binary outcome supervision, we augment the objective with a semantic density reward, derived from the precision-recall alignment between generated reasoning traces and ground-truth solutions. As a result, our method can mine more fine-grained optimization signals from a group of rollouts, guiding the LLM to a better optima. Experimental results across challenging math reasoning and question-answering tasks demonstrates that LambdaPO improves performance compared to the baseline methods.",
          "authors": [
            "Zhe Yuan",
            "Yipeng Zhou",
            "Jinghan Li",
            "Xinyuan Chen",
            "Bowen Deng",
            "Zhiqian Chen",
            "Liang Zhao"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.19416",
          "abstract_url": "https://arxiv.org/abs/2605.19416",
          "pdf_url": "https://arxiv.org/pdf/2605.19416",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.19416",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.19416"
          },
          "relevance_score": 174,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"reasoning\"",
            "summary matched \"LLM\"",
            "summary matched \"alignment\"",
            "summary matched \"RAG\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.19416"
        }
      ]
    },
    {
      "name": "Agent Runtime Security",
      "key_points": [
        "《Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models》〔评测 / 方法〕：Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex problems by generating structured, step-by-step reasoning content. H…",
        "《OpenComputer: Verifiable Software Worlds for Computer-Use Agents》〔评测 / 应用 / 方法〕：We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four compon…",
        "《Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains》〔应用 / 方法〕：Foundation models are increasingly deployed in socially sensitive domains such as education, mental health, and caregiving, where failures are often cumulative…",
        "《A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents》〔方法〕：Production LLM agents combine stochastic model outputs with deterministic software systems, yet the boundary between the two is rarely treated as a first-class…",
        "《Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents》〔应用 / 方法〕：Large Language Model (LLM) agents increasingly act inside real workspaces, where tools and skills determine whether model reasoning becomes reliable action. Ex…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models",
          "summary": "Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex problems by generating structured, step-by-step reasoning content. However, exposing a model's internal reasoning process introduces additional safety risks; for example, recent studies show that LRMs are more vulnerable to jailbreak attacks than standard LLMs. In this paper, we investigate jailbreak attacks on LRMs and reveal that the attack success rate (ASR) is closely correlated with LRMs' attention patterns. Specifically, successful jailbreaks tend to assign lower attention to harmful tokens in the input prompt, while allocating higher attention to those tokens in the reasoning content. Motivated by this finding, we propose a novel jailbreak method for LRMs that leverages reinforcement learning (RL) to enhance attack effectiveness, explicitly incorporating attention signals into the reward function design. In addition, we introduce diverse persuasion strategies to enrich the RL action space, which consistently improves the ASR. Extensive experiments on five open-source and closed-source LRMs across three benchmarks demonstrate that our method achieves substantially higher ASR, outperforming existing approaches in terms of effectiveness, efficiency, and transferability.",
          "authors": [
            "Zheng Lin",
            "Zhenxing Niu",
            "Haoxuan Ji",
            "Yuzhe Huang",
            "Haichang Gao"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "https://arxiv.org/abs/2605.19485",
          "abstract_url": "https://arxiv.org/abs/2605.19485",
          "pdf_url": "https://arxiv.org/pdf/2605.19485",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Reasoning"
          ],
          "doi": null,
          "arxiv_id": "2605.19485",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.19485"
          },
          "relevance_score": 80,
          "match_reasons": [
            "title matched \"jailbreak\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.19485"
        },
        {
          "title": "OpenComputer: Verifiable Software Worlds for Computer-Use Agents",
          "summary": "We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards. In its current form, OpenComputer covers 33 desktop applications and 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications. Experiments show that OpenComputer's hard-coded verifiers align more closely with human adjudication than LLM-as-judge evaluation, especially when success depends on fine-grained application state. Frontier agents struggle with end-to-end completion despite partial progress, and open-source models exhibit sharp drops from their OSWorld-Verified scores, exposing a persistent gap in robust computer automation.",
          "authors": [
            "Jinbiao Wei",
            "Qianran Ma",
            "Yilun Zhao",
            "Xiao Zhou",
            "Kangqi Ni",
            "Guo Gan",
            "Arman Cohan"
          ],
          "categories": [
            "cs.AI",
            "cs.SE"
          ],
          "paper_id": "https://arxiv.org/abs/2605.19769",
          "abstract_url": "https://arxiv.org/abs/2605.19769",
          "pdf_url": "https://arxiv.org/pdf/2605.19769",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.19769",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.19769"
          },
          "relevance_score": 80,
          "match_reasons": [
            "title matched \"computer-use agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.19769"
        },
        {
          "title": "Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains",
          "summary": "Foundation models are increasingly deployed in socially sensitive domains such as education, mental health, and caregiving, where failures are often cumulative and context-dependent. Existing guardrail approaches -- ranging from training-time alignment to prompting, decoding constraints, and post-hoc moderation -- primarily provide empirical risk reduction rather than enforceable behavioral guarantees, and largely treat safety as a property of individual outputs rather than interaction trajectories. We reframe guardrails as a problem of runtime behavioral control over interaction trajectories, drawing on robotics to introduce formal constructs for constraint enforcement in uncertain, closed-loop systems. We instantiate these ideas in the Grounded Observer framework and apply it across three real-world deployments: small talk, in-home autism therapy, and behavioral de-escalation in schools. Across settings, the framework enables runtime interventions that mitigate drift into undesirable interaction regimes while adapting to diverse social contexts. We discuss extensions to the framework and propose research directions toward stronger guarantees.",
          "authors": [
            "Rebecca Ramnauth",
            "Drazen Brscic",
            "Brian Scassellati"
          ],
          "categories": [
            "cs.AI",
            "cs.RO"
          ],
          "paper_id": "https://arxiv.org/abs/2605.19940",
          "abstract_url": "https://arxiv.org/abs/2605.19940",
          "pdf_url": "https://arxiv.org/pdf/2605.19940",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "Alignment",
            "Guardrail"
          ],
          "doi": null,
          "arxiv_id": "2605.19940",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.19940"
          },
          "relevance_score": 80,
          "match_reasons": [
            "title matched \"guardrail\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.19940"
        },
        {
          "title": "A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents",
          "summary": "Production LLM agents combine stochastic model outputs with deterministic software systems, yet the boundary between the two is rarely treated as a first-class architectural object. This paper names that boundary the stochastic-deterministic boundary (SDB): a four-part contract among a proposer, verifier, commit step, and reject signal that specifies how an LLM output becomes a system action. We argue that the SDB is the load-bearing primitive of production agent runtimes. Around this primitive, we organize agent runtime design into three concerns: Coordination, State, and Control. We present a catalog of six runtime patterns that compose the SDB differently across conversational, autonomous, and long-horizon agents: hierarchical delegation, scatter-gather plus saga, event-driven sequencing, shared state machine, supervisor plus gate, and human in the loop. For each pattern, we trace its lineage to distributed-systems concepts and identify what changes when the worker is stochastic. The paper contributes a five-step methodology for selecting runtime patterns, a diagnostic procedure that maps production failures to pattern weaknesses, and a failure mode called replay divergence, in which LLM-based consumers of a deterministic event log produce different downstream outputs under model-version or prompt changes. A stylized reliability decomposition separates per-call model variance from architectural momentum, motivating the claim that as model variance decreases, pattern choice and SDB strength become increasingly important levers for long-run reliability. We apply the methodology to five workloads and provide one runnable reference implementation for a 90-day contract-renewal agent.",
          "authors": [
            "Vasundra Srinivasan"
          ],
          "categories": [
            "cs.AI",
            "cs.SE"
          ],
          "paper_id": "https://arxiv.org/abs/2605.20173",
          "abstract_url": "https://arxiv.org/abs/2605.20173",
          "pdf_url": "https://arxiv.org/pdf/2605.20173",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.20173",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.20173"
          },
          "relevance_score": 58,
          "match_reasons": [
            "summary matched \"agent runtime\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.20173"
        },
        {
          "title": "Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents",
          "summary": "Large Language Model (LLM) agents increasingly act inside real workspaces, where tools and skills determine whether model reasoning becomes reliable action. Existing skills remain largely informal: Markdown skills and instruction packs encode procedures as long natural-language documents, while function calling, Model Context Protocol (MCP) servers, and framework tools structure individual actions but usually leave workflow state, policy enforcement, and completion discipline outside the skill itself. We introduce Formal Skill, a runtime-native abstraction that represents reusable capability with JSON metadata and action schemas, reliable Python executors, hook-governed control logic, Formal Skill routing, and skill-local runtime state. By moving reusable procedure from repeated prompt text into executable state machines and hook policies, Formal Skill gives agents a token-efficient and enforceable control surface. We implement the abstraction in FairyClaw, an open-source event-driven runtime for executable, observable, and composable Formal Skills. On Harness-Bench, FairyClaw obtains highly competitive average scores while using substantially fewer tokens, with especially strong results on tasks that expose the role of Formal Skill.",
          "authors": [
            "Xi Zhang",
            "Meijun Gao",
            "Yuntian Zhao",
            "Xinyu Tan",
            "Yilun Yao",
            "Feiyu Wang",
            "Yanshu Wang",
            "Dingsiyi",
            "Tong Yang"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "https://arxiv.org/abs/2605.19604",
          "abstract_url": "https://arxiv.org/abs/2605.19604",
          "pdf_url": "https://arxiv.org/pdf/2605.19604",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.19604",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.19604"
          },
          "relevance_score": 58,
          "match_reasons": [
            "summary matched \"policy enforcement\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.19604"
        },
        {
          "title": "SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents",
          "summary": "A/B testing remains the gold standard for evaluating modifications to e-commerce storefronts, yet it diverts traffic, requires weeks to reach statistical significance, and risks degrading user experience. We present SimGym, a framework for simulating A/B tests on e-commerce storefronts using vision-language model (VLM) agents operating in a live browser. The framework comprises three key components: (a) a traffic-grounded persona generation pipeline that derives per-shop buyer archetypes and intents from production clickstream data; (b) a live-browser agent architecture that combines multimodal perception over visual and browser-structured observations with episodic memory and guardrails to conduct coherent shopping sessions across control and treatment storefronts; and (c) an evaluation protocol that compares simulated outcome shifts with observed shifts in real buyer behavior. We validate SimGym on A/B tests of visually driven UI theme changes from a major e-commerce platform across diverse storefronts and product categories. Empirical results show that SimGym agents achieve strong agreement with observed outcome shifts, attaining 77% directional alignment with add-to-cart shifts observed across interface variants in real-buyer traffic. It reduces experimental cycles from weeks to under an hour, enabling rapid experimentation without exposing real buyers to candidate variants.",
          "authors": [
            "Han Li",
            "Vibhor Malik",
            "Zahra Zanjani Foumani",
            "Alberto Castelo",
            "Shuang Xie",
            "Ailin Fan",
            "Keat Yang Koay",
            "Yuanzheng Zhu",
            "Meysam Feghhi",
            "Ronie Uliana",
            "Zhaoyu Zhang",
            "Angelo Ocana Martins",
            "Mingyu Zhao",
            "Francis Pelland",
            "Jonathan Faerman",
            "Nikolas LeBlanc",
            "Aaron Glazer",
            "Andrew McNamara",
            "Zhong Wu",
            "Lingyun Wang"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "https://arxiv.org/abs/2605.19219",
          "abstract_url": "https://arxiv.org/abs/2605.19219",
          "pdf_url": "https://arxiv.org/pdf/2605.19219",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Language Model",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.19219",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.19219"
          },
          "relevance_score": 58,
          "match_reasons": [
            "summary matched \"guardrail\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.19219"
        },
        {
          "title": "TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing",
          "summary": "LLM routing matters most in long-horizon applications such as coding agents, deep research systems, and computer-use agents, where a single user request triggers many model calls. Routing each call to the cheapest sufficient model can cut costs without sacrificing quality, yet existing router benchmarks evaluate routers only on one-shot prompts. They never expose the router-visible prefix at an intermediate agent step, never test whether a cheaper replacement preserves downstream task success, and often rely on online LLM judges at evaluation time. We introduce TwinRouterBench, a step-level routing benchmark with two tracks. The static track provides 970 router-visible prefixes from 520 instances across SWE-bench, BFCL, mtRAG, QMSum, and PinchBench, each paired with an execution-verified target tier estimated under a released downgrade-and-cascade protocol; scoring is deterministic arithmetic over tier labels, trajectory membership, and token costs, with no online evaluator-side LLM judge. The dynamic track supplies a harness that runs routers on the full 500-case SWE-bench Verified suite; in this paper we report a 100-case held-out evaluation disjoint from the static SWE supervision split. At each LLM call the router selects a concrete model from a locked pool, and success is measured by official task resolution and realized API spend. The two tracks support fast offline iteration followed by end-to-end validation under live agent execution. Code and data are available at https://github.com/CommonstackAI/TwinRouterBench.",
          "authors": [
            "Pei Yang",
            "Wanyi Chen",
            "Tongyun Yang",
            "Pengbin Feng",
            "Jiarong Xing",
            "Wentao Guo",
            "Yuhang Yao",
            "Yuhang Han",
            "Hanchen Li",
            "Xu Wang",
            "Zeyu Wang",
            "Jie Xiao",
            "Anjie Yang",
            "Liang Tian",
            "Lynn Ai",
            "Eric Yang",
            "Tianyu Shi"
          ],
          "categories": [
            "cs.LG",
            "cs.AI"
          ],
          "paper_id": "https://arxiv.org/abs/2605.18859",
          "abstract_url": "https://arxiv.org/abs/2605.18859",
          "pdf_url": "https://arxiv.org/pdf/2605.18859",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.18859",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18859"
          },
          "relevance_score": 58,
          "match_reasons": [
            "summary matched \"computer-use agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18859"
        }
      ]
    },
    {
      "name": "Terminal and SWE Agents",
      "key_points": [
        "《Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study》〔评测 / 应用 / 方法〕：As autonomous coding agents see rapid adoption, their evaluation has primarily focused on task completion rates holding the target codebase fixed. This leaves…",
        "《PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents》〔方法〕：Large language model (LLM) agents increasingly operate over long and recurring external contexts, like document corpora and code repositories. Across invocatio…",
        "《RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades》〔评测 / 方法〕：Coding agents are increasingly deployed in real software development, where a single version iteration requires months of coordinated work across many files. H…",
        "《The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next》〔评测 / 方法〕：Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases -- and at the frontier, thi…",
        "《Toward Training Superintelligent Software Agents through Self-Play SWE-RL》〔评测 / 方法〕：While current software agents powered by large language models (LLMs) and agentic reinforcement learning (RL) can boost programmer productivity, their training…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study",
          "summary": "As autonomous coding agents see rapid adoption, their evaluation has primarily focused on task completion rates holding the target codebase fixed. This leaves a critical question unanswered: does the structural and stylistic quality, or ``cleanliness'' of the underlying code affect an agent's ability to navigate and modify it? To isolate the effect of code cleanliness from agent capability, we introduce an evaluation protocol built around minimal pairs: repositories that match on architecture, dependencies, and external behaviour, but differ on static-analysis rule violations and cognitive complexity. The pairs are constructed in both directions, by agent pipelines that either degrade a clean repository or clean a messy one. We author 33 tasks across six such pairs, evaluated through hidden tests at the application's public surface. Across 660 trials with Claude Code, code cleanliness does not change the agent's pass rate. However, it substantially alters the agent's operational footprint: agents working on cleaner code use 7 to 8% fewer tokens and reduce file revisitations by 34%. Our findings suggest that traditional maintainability principles remain highly relevant in the era of AI-driven development, shaping the computational cost and navigational efficiency of coding agents. Code cleanliness joins model choice, harness, and prompting as a factor that materially affects agent behaviours.",
          "authors": [
            "Priyansh Trivedi (SonarSource)",
            "Olivier Schmitt (SonarSource)"
          ],
          "categories": [
            "cs.SE",
            "cs.AI"
          ],
          "paper_id": "https://arxiv.org/abs/2605.20049",
          "abstract_url": "https://arxiv.org/abs/2605.20049",
          "pdf_url": "https://arxiv.org/pdf/2605.20049",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Agent",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2605.20049",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.20049"
          },
          "relevance_score": 80,
          "match_reasons": [
            "title matched \"coding agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.20049"
        },
        {
          "title": "PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents",
          "summary": "Large language model (LLM) agents increasingly operate over long and recurring external contexts, like document corpora and code repositories. Across invocations, existing approaches preserve either the agent's trajectory, passive access to raw material, or task-level strategies. None of them preserves what we argue is most needed for repeated same-context workloads: reusable orientation knowledge (e.g., what the context contains, how it is organized, and which entities, constants, and schemas have historically been useful) about the recurring context itself. We introduce PEEK, a system that caches and maintains this orientation knowledge as a context map: a small, constant-sized artifact in the agent's prompt that gives it a persistent peek into the external context. The map is maintained by a programmable cache policy with three modules: a Distiller that extracts transferable knowledge from inference-time signals, a Cartographer that translates it into structured edits, and a priority-based Evictor that enforces a fixed token budget. On long-context reasoning and information aggregation, PEEK improves over strong baselines by 6.3-34.0% while using 93-145 fewer iterations and incurring 1.7-5.8x lower cost than the state-of-the-art prompt-learning framework, ACE. On context learning, PEEK improves solving rate and rubric accuracy by 6.0-14.0% and 7.8-12.1%, respectively, at 1.4x lower cost than ACE. These gains generalize across LMs and agent architectures, including OpenAI Codex, a production-grade coding agent. Together, these results show that a context map helps long-context LLM agents interact with recurring external contexts more accurately and efficiently.",
          "authors": [
            "Zhuohan Gu",
            "Qizheng Zhang",
            "Omar Khattab",
            "Samuel Madden"
          ],
          "categories": [
            "cs.AI",
            "cs.CL",
            "cs.LG"
          ],
          "paper_id": "https://arxiv.org/abs/2605.19932",
          "abstract_url": "https://arxiv.org/abs/2605.19932",
          "pdf_url": "https://arxiv.org/pdf/2605.19932",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.19932",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.19932"
          },
          "relevance_score": 58,
          "match_reasons": [
            "summary matched \"coding agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.19932"
        },
        {
          "title": "RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades",
          "summary": "Coding agents are increasingly deployed in real software development, where a single version iteration requires months of coordinated work across many files. However, most existing benchmarks focus predominantly on single-issue bug fixes from Python repositories, with coarse pass/fail evaluation outcomes, and thus fail to capture long-horizon, multi-target development at real engineering scale. To address this gap, we present RoadmapBench, a benchmark of 115 long-horizon coding tasks grounded in real open-source version upgrades across 17 repositories and 5 programming languages. Each task places the agent on a source-version code snapshot and provides a multi-target roadmap instruction requiring it to implement the functionality introduced in the target version, with a median modification of 3,700 lines across 51 files. We conduct a systematic evaluation on thirteen frontier models and find that even the strongest, Claude-Opus-4.7, resolves only 39.1% of tasks, while the weakest achieves merely 5.2%, in stark contrast to existing bug-fix benchmarks, suggesting that long-horizon software development remains a largely unsolved problem.",
          "authors": [
            "Xinbo Xu",
            "Ruihan Yang",
            "Haiyang Shen",
            "Wendong Xu",
            "Bofei Gao",
            "Ruoyu Wu",
            "Kean Shi",
            "Weichu Xie",
            "Xuanzhong Chen",
            "Ming Wu",
            "Jason Zeng",
            "Michael Heinrich",
            "Elvis Zhang",
            "Liang Chen",
            "Kuan Li",
            "Baobao Chang"
          ],
          "categories": [
            "cs.SE",
            "cs.AI"
          ],
          "paper_id": "https://arxiv.org/abs/2605.15846",
          "abstract_url": "https://arxiv.org/abs/2605.15846",
          "pdf_url": "https://arxiv.org/pdf/2605.15846",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.15846",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.15846"
          },
          "relevance_score": 58,
          "match_reasons": [
            "summary matched \"coding agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.15846"
        },
        {
          "title": "The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next",
          "summary": "Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases -- and at the frontier, this interaction is the more informative signal. We decompose paired SWE-bench and GPQA Diamond scores into a population coupling trend and per-release residual ($h$-field) that diagnoses capability emphasis and identifies which measurement or stress test is most informative next. Across 34 models from 10 labs (2024--2026), capabilities cooperate ($r = +0.72$, $p < 10^{-6}$), but cooperation varies by lab and over time: DeepSeek reversed from reasoning-rich to coding-first ($h$: $+11.2 \\to -4.7$, 15.9-pp swing); Google maintains consistent reasoning emphasis; Anthropic oscillates between coding excursions and recovery. Cooperation is not static -- it cascades. Six open-weight architectures confirm a second capability transition at 30--72B, and SWE-bench is now saturating while HLE and instruction-following retain discriminatory spread -- signaling the next axis rotation. We provide a three-level playbook (locate, diagnose, rotate), a per-lab measurement-priority table, and seven falsifiable predictions with timestamped criteria for the next 12 months of frontier releases. Per-lab coupling slopes vary $5\\times$ (Google $1.15$ vs. DeepSeek $0.23$), quantifying how efficiently each recipe converts coding gains into reasoning. Five April 2026 releases confirm the diagnostic out of sample ($r$ rises from $+0.72$ to $+0.75$). An interactive dashboard provides phase classification with actionable recommendations, $h$-field diagnostics, per-lab coupling trajectories, ODE-based scaling predictions, benchmark rotation guidance, self-steering demo, and live tracking of all seven predictions: https://zehenlabs.com/cape/.",
          "authors": [
            "Adil Amin"
          ],
          "categories": [
            "cs.LG",
            "cs.AI",
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.18840",
          "abstract_url": "https://arxiv.org/abs/2605.18840",
          "pdf_url": "https://arxiv.org/pdf/2605.18840",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Reasoning",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.18840",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18840"
          },
          "relevance_score": 58,
          "match_reasons": [
            "summary matched \"SWE-bench\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18840"
        },
        {
          "title": "Toward Training Superintelligent Software Agents through Self-Play SWE-RL",
          "summary": "While current software agents powered by large language models (LLMs) and agentic reinforcement learning (RL) can boost programmer productivity, their training data (e.g., GitHub issues and pull requests) and environments (e.g., pass-to-pass and fail-to-pass tests) heavily depend on human knowledge or curation, posing a fundamental barrier to superintelligence. In this paper, we present Self-play SWE-RL (SSR), a first step toward training paradigms for superintelligent software agents. Our approach takes minimal data assumptions, only requiring access to sandboxed repositories with source code and installed dependencies, with no need for human-labeled issues or tests. Grounded in these real-world codebases, a single LLM agent is trained via reinforcement learning in a self-play setting to iteratively inject and repair software bugs of increasing complexity, with each bug formally specified by a test patch rather than a natural language issue description. On the SWE-bench Verified and SWE-Bench Pro benchmarks, SSR achieves notable self-improvement (+10.4 and +7.8 points, respectively) and consistently outperforms the human-data baseline over the entire training trajectory, despite being evaluated on natural language issues absent from self-play. Our results, albeit early, suggest a path where agents autonomously gather extensive learning experiences from real-world software repositories, ultimately enabling superintelligent systems that exceed human capabilities in understanding how systems are constructed, solving novel challenges, and autonomously creating new software from scratch.",
          "authors": [
            "Yuxiang Wei",
            "Zhiqing Sun",
            "Emily McMilin",
            "Jonas Gehring",
            "David Zhang",
            "Gabriel Synnaeve",
            "Daniel Fried",
            "Lingming Zhang",
            "Sida Wang"
          ],
          "categories": [
            "cs.SE",
            "cs.AI",
            "cs.CL",
            "cs.LG"
          ],
          "paper_id": "https://arxiv.org/abs/2512.18552",
          "abstract_url": "https://arxiv.org/abs/2512.18552",
          "pdf_url": "https://arxiv.org/pdf/2512.18552",
          "published_at": "2026-05-20T04:00:00+00:00",
          "updated_at": "2026-05-20T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2512.18552",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2512.18552"
          },
          "relevance_score": 58,
          "match_reasons": [
            "summary matched \"SWE-bench\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2512.18552"
        }
      ]
    }
  ]
}