{
  "generated_at": "2026-06-12T13:55:02.783024+08:00",
  "timezone": "Asia/Shanghai",
  "lookback_hours": 24,
  "sorting": {
    "default_sort_by": "hybrid",
    "summary": "hybrid (relevance first, published_at tie-break)",
    "weights": {
      "title_match_weight": 40,
      "summary_match_weight": 18,
      "doi_weight": 12,
      "pdf_weight": 8,
      "rich_summary_weight": 6,
      "metadata_weight": 4,
      "multi_source_weight": 10,
      "freshness_weight_cap": 24
    },
    "feeds": [
      {
        "name": "LM",
        "sort_by": "hybrid"
      },
      {
        "name": "Agent Runtime Security",
        "sort_by": "hybrid"
      },
      {
        "name": "Terminal and SWE Agents",
        "sort_by": "hybrid"
      }
    ]
  },
  "highlights": [
    "主题「LLM」：命中 17 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments》、《Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents》。",
    "主题「Agent」：命中 16 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments》、《Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents》。",
    "主题「Language Model」：命中 6 篇，覆盖 LM、Terminal and SWE Agents，代表论文包括 《Mod-Guide: An LLM-based Content Moderation Feedback System to Address Insensitive Speech toward Indigenous Ethnic and Religious Minority Communities》、《SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning》。",
    "主题「RAG」：命中 2 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《Operadic consistency: a label-free signal for compositional reasoning failures in LLMs》、《ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm》。",
    "主题「Coding Agent」：命中 1 篇，覆盖 Terminal and SWE Agents，代表论文包括 《Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset》。"
  ],
  "focus_items": [],
  "action_items": [],
  "topic_sections": [
    {
      "name": "LLM",
      "paper_count": 17,
      "feed_names": [
        "LM",
        "Agent Runtime Security"
      ],
      "paper_titles": [
        "EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments",
        "Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents",
        "An LLM System for Autonomous Variational Quantum Circuit Design",
        "Mod-Guide: An LLM-based Content Moderation Feedback System to Address Insensitive Speech toward Indigenous Ethnic and Religious Minority Communities",
        "ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages",
        "Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data",
        "Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models",
        "Reward Modeling for Multi-Agent Orchestration",
        "From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent",
        "Operadic consistency: a label-free signal for compositional reasoning failures in LLMs",
        "Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning",
        "AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility",
        "SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents",
        "Automated reproducibility assessments in the social and behavioral sciences using large language models",
        "Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda",
        "Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents",
        "Nous: An Attempt to Extract and Inject the Cognition Behind Prediction-Market Behavior"
      ],
      "key_points": [
        "《EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments》〔评测 / 应用 / 方法〕：Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast,…",
        "《Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents》〔评测 / 应用 / 方法〕：Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execu…"
      ]
    },
    {
      "name": "Agent",
      "paper_count": 16,
      "feed_names": [
        "LM",
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments",
        "Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents",
        "An LLM System for Autonomous Variational Quantum Circuit Design",
        "SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning",
        "ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages",
        "Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models",
        "Reward Modeling for Multi-Agent Orchestration",
        "From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent",
        "AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility",
        "SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents",
        "Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda",
        "ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm",
        "Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents",
        "Nous: An Attempt to Extract and Inject the Cognition Behind Prediction-Market Behavior",
        "Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset",
        "Recursive Agent Harnesses"
      ],
      "key_points": [
        "《EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments》〔评测 / 应用 / 方法〕：Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast,…",
        "《Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents》〔评测 / 应用 / 方法〕：Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execu…"
      ]
    },
    {
      "name": "Language Model",
      "paper_count": 6,
      "feed_names": [
        "LM",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "Mod-Guide: An LLM-based Content Moderation Feedback System to Address Insensitive Speech toward Indigenous Ethnic and Religious Minority Communities",
        "SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning",
        "Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data",
        "Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning",
        "Automated reproducibility assessments in the social and behavioral sciences using large language models",
        "Recursive Agent Harnesses"
      ],
      "key_points": [
        "《Mod-Guide: An LLM-based Content Moderation Feedback System to Address Insensitive Speech toward Indigenous Ethnic and Religious Minority Communities》〔评测 / 数据 / 应用 / 方法〕：Language operates as a mechanism of both marginalization and resistance, especially for minority communities navigating insensitive and harmful speech online.…",
        "《SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning》〔评测 / 方法〕：Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language mo…"
      ]
    },
    {
      "name": "RAG",
      "paper_count": 2,
      "feed_names": [
        "LM",
        "Agent Runtime Security"
      ],
      "paper_titles": [
        "Operadic consistency: a label-free signal for compositional reasoning failures in LLMs",
        "ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm"
      ],
      "key_points": [
        "《Operadic consistency: a label-free signal for compositional reasoning failures in LLMs》〔评测 / 数据 / 方法〕：Detecting LLM reasoning failures at inference time without ground-truth labels has motivated a wide range of confidence baselines, including self-consistency,…",
        "《ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm》〔评测 / 方法〕：Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long…"
      ]
    },
    {
      "name": "Coding Agent",
      "paper_count": 1,
      "feed_names": [
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset"
      ],
      "key_points": [
        "《Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset》〔数据 / 方法〕：AI coding agents are increasingly used to generate pull requests (PRs) that propose code fixes in software projects. From a first exploration of the AIDev data…"
      ]
    }
  ],
  "template": "zh_daily_brief",
  "feeds": [
    {
      "name": "LM",
      "key_points": [
        "《EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments》〔评测 / 应用 / 方法〕：Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast,…",
        "《Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents》〔评测 / 应用 / 方法〕：Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execu…",
        "《An LLM System for Autonomous Variational Quantum Circuit Design》〔评测 / 应用 / 方法〕：The design of high performing quantum circuits remains largely dependent on human expertise. We introduce an autonomous agentic framework that employs large la…",
        "《Mod-Guide: An LLM-based Content Moderation Feedback System to Address Insensitive Speech toward Indigenous Ethnic and Religious Minority Communities》〔评测 / 数据 / 应用 / 方法〕：Language operates as a mechanism of both marginalization and resistance, especially for minority communities navigating insensitive and harmful speech online.…",
        "《SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning》〔评测 / 方法〕：Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language mo…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments",
          "summary": "Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to continually align their knowledge, skills, and behavior with changing environments and updated task conditions. To address this gap, we introduce EvoArena, a benchmark suite that models environment changes as sequences of progressive updates across terminal, software, and social domains. We further propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories, enabling agents to reason about environmental evolution through changes in their memory. Experiments show that current agents struggle on EvoArena, achieving an average accuracy of 39.6% across evolving terminal, software, and social-preference domains. EvoMem consistently improves performance, yielding an average gain of 1.5% on EvoArena and also improving standard benchmarks such as GAIA and LoCoMo by 6.1% and 4.8%. Beyond individual tasks, EvoMem further improves chain-level accuracy by 3.7% on EvoArena, where success requires completing a consecutive sequence of related evolutionary subtasks. Mechanistic analysis shows that EvoMem improves evidence capture in the memory, indicating better preservation of complete evolving environment states. Our results highlight the importance of modeling evolution in both evaluation and memory for reliable agent deployment.",
          "authors": [
            "Jundong Xu",
            "Qingchuan Li",
            "Jiaying Wu",
            "Yihuai Lan",
            "Shuyue Stella Li",
            "Huichi Zhou",
            "Bowen Jiang",
            "Lei Wang",
            "Jun Wang",
            "Anh Tuan Luu",
            "Caiming Xiong",
            "Hae Won Park",
            "Bryan Hooi",
            "Zhiyuan Hu"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.13681v1",
          "abstract_url": "https://arxiv.org/abs/2606.13681v1",
          "pdf_url": "https://arxiv.org/pdf/2606.13681v1",
          "published_at": "2026-06-11T17:59:59+00:00",
          "updated_at": "2026-06-11T17:59:59+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.13681",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.13681v1"
          },
          "relevance_score": 200,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.13681"
        },
        {
          "title": "Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents",
          "summary": "Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execute actions with direct consequences. This makes them vulnerable to prompt-injection attacks, in which seemingly benign content embeds adversarial instructions that manipulate agent behaviour. Existing security benchmarks adopt an \\textit{attack-centric} perspective, focusing on the technical feasibility of injections while overlooking the nuanced distribution of resulting harms. In practice, however, prompt-injection risk is victim-dependent: a single exploit can produce asymmetric consequences for different stakeholders, and the same attack pattern may exhibit substantially different effectiveness depending on whom it targets. To capture these properties, we introduce \\textbf{\\sysname}, a \\textit{stakeholder-centric} benchmark to systematically categorize and attribute harm in real-world web agent systems. It distinguishes between affected entities (e.g., user, seller, platform), decomposes the attacks into concrete objectives, and evaluates each case with complementary outcome- and process-level metrics. Our results reveal substantial and heterogeneous vulnerabilities: not a single attack objective is reliably resisted by current agents, and failures distribute across qualitatively distinct modes ranging from \\emph{stealthy parasitism} (attack succeeds without disrupting the user's delegated task) to \\emph{misaligned disruption} (task disrupted without attack success) and \\emph{compounded failure} (both adversarial objective and task integrity simultaneously violated). These patterns are missed by conventional evaluation, highlighting the need for stakeholder-aware assessment of LLM-based agents in real-world deployments. Benchmark is available at https://github.com/StakeBench/SBC.",
          "authors": [
            "Zihao Wang",
            "Yiming Li",
            "Yutong Wu",
            "Zheyu Liu",
            "Kangjie Chen",
            "Fok Kar Wai",
            "Pin-Yu Chen",
            "Vrizlynn L. L. Thing",
            "Bo Li",
            "Dacheng Tao",
            "Tianwei Zhang"
          ],
          "categories": [
            "cs.CR",
            "cs.AI",
            "cs.CY",
            "cs.HC",
            "cs.MM"
          ],
          "paper_id": "http://arxiv.org/abs/2606.13385v1",
          "abstract_url": "https://arxiv.org/abs/2606.13385v1",
          "pdf_url": "https://arxiv.org/pdf/2606.13385v1",
          "published_at": "2026-06-11T14:12:43+00:00",
          "updated_at": "2026-06-11T14:12:43+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.13385",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.13385v1"
          },
          "relevance_score": 178,
          "match_reasons": [
            "title matched \"agent\"",
            "title matched \"benchmark\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata",
            "title matched \"prompt injection\""
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.13385"
        },
        {
          "title": "An LLM System for Autonomous Variational Quantum Circuit Design",
          "summary": "The design of high performing quantum circuits remains largely dependent on human expertise. We introduce an autonomous agentic framework that employs large language models (LLMs) to conduct iterative quantum circuit designs under explicit design constraints. Our system integrates seven components: Exploration, Generation, Discussion, Validation, Storage, Evaluation, and Review. These components form a closed-loop workflow that combines web-based knowledge acquisition, literature-grounded critique, executable code generation, and experimental feedback. We evaluate the framework on two tasks: quantum feature map construction for quantum machine learning and ansatz generation for variational quantum eigensolver applications in quantum chemistry. In image classification benchmarks, the best generated feature map outperforms representative quantum feature maps and, when scaled to larger qubit counts, surpasses the classical radial basis function kernel. In molecular ground state estimation across seven molecules, the generated ansatz attains competitive accuracy with widely used chemically inspired and hardware-efficient constructions while satisfying the imposed scaling constraints. These results establish LLM driven agentic system as a viable paradigm for automated quantum circuit design and illustrate how AI systems can participate in iterative scientific optimization workflows across scientific domains.",
          "authors": [
            "Kenya Sakka",
            "Wataru Mizukami",
            "Kosuke Mitarai"
          ],
          "categories": [
            "quant-ph",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.13380v1",
          "abstract_url": "https://arxiv.org/abs/2606.13380v1",
          "pdf_url": "https://arxiv.org/pdf/2606.13380v1",
          "published_at": "2026-06-11T14:08:00+00:00",
          "updated_at": "2026-06-11T14:08:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.13380",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.13380v1"
          },
          "relevance_score": 174,
          "match_reasons": [
            "title matched \"LLM\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"agent\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.13380"
        },
        {
          "title": "Mod-Guide: An LLM-based Content Moderation Feedback System to Address Insensitive Speech toward Indigenous Ethnic and Religious Minority Communities",
          "summary": "Language operates as a mechanism of both marginalization and resistance, especially for minority communities navigating insensitive and harmful speech online. As content moderation increasingly depends on large language models (LLMs), concerns arise about whether these systems can recognize culturally insensitive speech-language that disregards or marginalizes the cultural and religious perspectives of historically underrepresented communities, often through implicit erasure, misrepresentation, or normative framing, rather than overt hostility. Focusing on Bangladesh's Hindu and Chakma communities -- the country's largest religious and Indigenous ethnic minorities, respectively -- this paper investigates the epistemic limits of LLM-based moderation systems and explores methods for incorporating minority perspectives. We co-created a culturally grounded corpus of insensitive speech with community members and integrated their narratives into moderation pipelines using retrieval augmented generation (RAG). Our tool, Mod-Guide, improves LLM sensitivity to minority viewpoints by leveraging contextual cues derived from lived experience. Through mixed-method evaluations involving both minority and majority participants, we demonstrate that RAG-enhanced moderation responses are more contextually accurate and perceived differently across ethnic lines. This work advances research in human-computer interaction, AI ethics, and social computing by foregrounding restorative justice and hermeneutical inclusion in the design of content moderation systems.",
          "authors": [
            "Dipto Das",
            "Achhiya Sultana",
            "Ankit Singh Chauhan",
            "Saadia Binte Alam",
            "Mohammad Shidujaman",
            "Shion Guha",
            "Sunandan Chakraborty",
            "Syed Ishtiaque Ahmed"
          ],
          "categories": [
            "cs.HC",
            "cs.AI",
            "cs.CY"
          ],
          "paper_id": "http://arxiv.org/abs/2606.13397v1",
          "abstract_url": "https://arxiv.org/abs/2606.13397v1",
          "pdf_url": "https://arxiv.org/pdf/2606.13397v1",
          "published_at": "2026-06-11T14:28:18+00:00",
          "updated_at": "2026-06-11T14:28:18+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": "10.1145/3811242.3819096",
          "arxiv_id": "2606.13397",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.13397v1",
            "doi": "https://doi.org/10.1145/3811242.3819096"
          },
          "relevance_score": 168,
          "match_reasons": [
            "title matched \"LLM\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"retrieval augmented generation\"",
            "summary matched \"RAG\"",
            "summary matched \"evaluation\"",
            "has DOI",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1145/3811242.3819096"
        },
        {
          "title": "SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning",
          "summary": "Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.",
          "authors": [
            "Seokju Cho",
            "Ryo Hachiuma",
            "Abhishek Badki",
            "Hang Su",
            "Byung-Kwan Lee",
            "Chan Hee Song",
            "Sifei Liu",
            "Subhashree Radhakrishnan",
            "Seungryong Kim",
            "Yu-Chiang Frank Wang",
            "Min-Hung Chen"
          ],
          "categories": [
            "cs.CV",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.13673v1",
          "abstract_url": "https://arxiv.org/abs/2606.13673v1",
          "pdf_url": "https://arxiv.org/pdf/2606.13673v1",
          "published_at": "2026-06-11T17:59:36+00:00",
          "updated_at": "2026-06-11T17:59:36+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Agent",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.13673",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.13673v1"
          },
          "relevance_score": 164,
          "match_reasons": [
            "title matched \"reasoning\"",
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.13673"
        },
        {
          "title": "ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages",
          "summary": "Multimodal Large Language Models (MLLMs) have shown promising reasoning capabilities in general domains, yet their performance remains limited in specialized settings such as healthcare, especially in multilingual and low-resource scenarios. This gap is critical in regions like rural India, where patients often express complex medical queries in native Indic languages and rely on multimodal inputs such as medical images. Existing English-centric MLLMs struggle to support such use cases, limiting equitable access to AI-driven healthcare assistance. To address this challenge, we introduce ArogyaBodha, a large-scale multilingual multimodal medical question-answer dataset constructed from eight heterogeneous sources, covering 31 body systems, six imaging modalities, and 21 clinical domains across English and seven major Indian languages. We further propose ArogyaSutra, an actor-critic-based multi-agent framework that integrates tool grounding with dual-memory mechanisms for step-wise, reasoning-aware decision making, and uses stored actor-critic simulation trajectories for distillation. Experiments show that our dataset and framework improve multilingual medical reasoning accuracy across all Indic languages, with ablations validating the contribution of each component. The source code and dataset are available at: https://iitp-cse.github.io/ ArogyaSutra/",
          "authors": [
            "Tanmoy Kanti Halder",
            "Akash Ghosh",
            "Subhadip Baidya",
            "Arijit Roy",
            "Sriparna Saha"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.13572v1",
          "abstract_url": "https://arxiv.org/abs/2606.13572v1",
          "pdf_url": "https://arxiv.org/pdf/2606.13572v1",
          "published_at": "2026-06-11T16:59:42+00:00",
          "updated_at": "2026-06-11T16:59:42+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.13572",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.13572v1"
          },
          "relevance_score": 163,
          "match_reasons": [
            "title matched \"reasoning\"",
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.13572"
        },
        {
          "title": "Leveraging Audio-LLMs to Filter Speech-to-Speech Training Data",
          "summary": "Large-scale mined corpora provide abundant training data for end-to-end speech-to-speech translation (S2ST) but may contain noise, misalignment, and semantic errors. Filtering noisy data is crucial to maintain robust speech translation performance. We study how to train an audio-language model to make keep/drop decisions on paired speech directly from audio. To obtain reliable supervision without manual labels, we adopt a scalable two-stage Rank-to-Distill strategy. A lightweight ranker generates keep/drop pseudo-labels from noisy speech pairs, then trains an audio large language model to predict keep/drop directly from raw paired speech. The resulting model jointly captures acoustic fidelity and cross-lingual semantic consistency for the selection of speech-conditioned data. Experiments on CVSS-C and SpeechMatrix show consistent improvements over unfiltered training, yielding up to +1.4 ASR-BLEU for end-to-end S2ST.",
          "authors": [
            "Qixu Chen",
            "Satoshi Nakamura"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.13507v1",
          "abstract_url": "https://arxiv.org/abs/2606.13507v1",
          "pdf_url": "https://arxiv.org/pdf/2606.13507v1",
          "published_at": "2026-06-11T15:55:23+00:00",
          "updated_at": "2026-06-11T15:55:23+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.13507",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.13507v1"
          },
          "relevance_score": 162,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"RAG\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"alignment\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.13507"
        },
        {
          "title": "Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models",
          "summary": "Recent advances in large language models (LLMs) have prompted claims that such systems exhibit agency or qualify as moral agents. This paper argues that these attributions are misguided. We maintain that moral responsibility requires commitment-bearing agency grounded in intrinsic intentionality and self-attributed action, and that such agency constitutes the form of free will relevant to responsibility. Although LLMs generate coherent and normatively evaluable outputs, their operation is fully characterized by probabilistic input-output mappings learned from data. Their apparent intentionality is derived rather than intrinsic, and their outputs are neither owned as commitments nor guided by reasons. Variability introduced by stochastic sampling does not amount to choice or authorship. We address objections from the intentional stance, functionalism, compatibilism, and the presence of moral reasoning in model outputs, arguing that none suffice to establish genuine agency.",
          "authors": [
            "Joseph Keshet"
          ],
          "categories": [
            "cs.AI",
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.13441v1",
          "abstract_url": "https://arxiv.org/abs/2606.13441v1",
          "pdf_url": "https://arxiv.org/pdf/2606.13441v1",
          "published_at": "2026-06-11T15:03:48+00:00",
          "updated_at": "2026-06-11T15:03:48+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.13441",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.13441v1"
          },
          "relevance_score": 161,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.13441"
        },
        {
          "title": "Reward Modeling for Multi-Agent Orchestration",
          "summary": "Multi-Agent Systems (MAS) built on Large Language Models (LLMs) require effective orchestration to coordinate specialized agents, yet training such orchestrators is hindered by limited supervision and high computational cost. We propose Orchestration Reward Modeling (OrchRM), a self-supervised framework for evaluating orchestration quality without human annotations. OrchRM leverages intermediate artifacts from multi-agent executions to construct win-lose pairs for Bradley-Terry reward model training. Unlike existing MAS test-time scaling and orchestrator training frameworks that rely on costly sub-agent rollouts, OrchRM operates directly at the orchestration level, enabling efficient and high-performing reward-guided orchestrator training and MAS test-time scaling. OrchRM improves training efficiency by up to 10x in token usage while improving MAS test-time scaling performance by up to 8% in accuracy. These gains consistently transfer across multiple domains, including mathematical reasoning, web-based question answering, and multi-hop reasoning, demonstrating orchestration-level reward modeling as a scalable direction for robust multi-agent orchestration. Code will be available at https://github.com/Wang-ML-Lab/OrchRM.",
          "authors": [
            "King Yeung Tsang",
            "Zihao Zhao",
            "Vishal Venkataramani",
            "Haizhou Shi",
            "Zixuan Ke",
            "Semih Yavuz",
            "Shafiq Joty",
            "Hao Wang"
          ],
          "categories": [
            "cs.AI",
            "cs.CL",
            "cs.LG",
            "cs.MA"
          ],
          "paper_id": "http://arxiv.org/abs/2606.13598v1",
          "abstract_url": "https://arxiv.org/abs/2606.13598v1",
          "pdf_url": "https://arxiv.org/pdf/2606.13598v1",
          "published_at": "2026-06-11T17:16:24+00:00",
          "updated_at": "2026-06-11T17:16:24+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.13598",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.13598v1"
          },
          "relevance_score": 159,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"RAG\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.13598"
        },
        {
          "title": "From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent",
          "summary": "Large language models (LLMs) have shown promise in automating scientific peer review. However, existing approaches often struggle to generate in-depth reviews supported by concrete evidence. We argue that a key limitation is the lack of flexibility to proactively investigate suspicious parts of a paper based on accumulated evidence, as human reviewers do. In this paper, we explore how to enable an LLM-based review agent to perform such proactive investigation. We find that this can be naturally formulated as a Markov Decision Process (MDP), and propose ProReviewer, a scientific peer review agent that proactively reviews a paper guided by a maintained, structured review log. The structured review log serves as a workspace for the agent to track evidence and intermediate findings collected during review. Experiments show that ProReviewer with an 8B backbone, trained by supervised fine-tuning and optimized by reinforcement learning, achieves the highest average score across five quality dimensions, outperforming prompt-based methods with much larger frontier LLMs by up to 39% and the strongest fine-tuned baseline by 16% relatively. It also attains the highest win rates against baselines in human evaluation.",
          "authors": [
            "Haishuo Fang",
            "Yue Feng",
            "Iryna Gurevych"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.13349v1",
          "abstract_url": "https://arxiv.org/abs/2606.13349v1",
          "pdf_url": "https://arxiv.org/pdf/2606.13349v1",
          "published_at": "2026-06-11T13:38:23+00:00",
          "updated_at": "2026-06-11T13:38:23+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.13349",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.13349v1"
          },
          "relevance_score": 155,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"RAG\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.13349"
        },
        {
          "title": "Operadic consistency: a label-free signal for compositional reasoning failures in LLMs",
          "summary": "Detecting LLM reasoning failures at inference time without ground-truth labels has motivated a wide range of confidence baselines, including self-consistency, semantic entropy, and P(True), built on within-question sampling and self-evaluation. Operad theory, the formalism for systems built by iterated substitution, suggests a complementary diagnostic: a model's direct answer to a compositional query should agree with the answer it produces by composing a stated decomposition of the same query. We instantiate this idea as operadic consistency (OC), a per-question signal. Across twelve instruction-tuned LLMs (4B to 671B parameters, open-weights and closed-source) on four multi-hop QA datasets, OC is strongly correlated with accuracy on every dataset (Pearson $r \\in [0.86, 0.94]$, all $p \\leq 0.0004$), and is the only signal we evaluate with $r \\geq 0.85$ uniformly across all four datasets. Chain-of-thought self-consistency (CoT-SC; Wang et al., 2023) matches OC on HotpotQA and DROP ($r = 0.93, 0.87$) but drops to $r \\approx 0.45$ on MuSiQue and StrategyQA. At the per-question level, OC contributes information beyond CoT-SC and semantic entropy on every dataset (cluster-robust $p \\leq 10^{-16}$ for the OC coefficient), and the conclusion is robust to additionally controlling for constructed decomposition-aware baselines ($p \\leq 10^{-13}$). The same signal yields selective-prediction improvements (accuracy at fixed coverage) over a tuned CoT-SC baseline at the equal-cost $K = 3$ budget (AUARC lifts of +0.086 to +0.096 and AUROC lifts of +0.092 to +0.164; 95% CIs exclude zero on every cell). On five frontier thinking models, where the decomposition is extracted from the model's own chain of thought, the same equal-cost comparison gives positive selective-prediction point-estimate lift on all 16 (dataset, budget, metric) cells tested, with 95% CIs excluding zero on 12 of the 16.",
          "authors": [
            "Nathaniel Bottman",
            "Yinhong Liu",
            "Kyle Richardson"
          ],
          "categories": [
            "cs.CL",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2606.13649v1",
          "abstract_url": "https://arxiv.org/abs/2606.13649v1",
          "pdf_url": "https://arxiv.org/pdf/2606.13649v1",
          "published_at": "2026-06-11T17:50:40+00:00",
          "updated_at": "2026-06-11T17:50:40+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "RAG"
          ],
          "doi": null,
          "arxiv_id": "2606.13649",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.13649v1"
          },
          "relevance_score": 145,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"reasoning\"",
            "summary matched \"RAG\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.13649"
        },
        {
          "title": "Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning",
          "summary": "When large language models (LLMs) fail to generalize or make haphazard errors in reasoning, it is often taken as evidence that LLMs are not truly reasoning, but rather performing a kind of pattern matching. The implication is that people's behavior does not exhibit the same types of failures because human reasoning uses principled and abstract world models. We evaluate human participants and 25 LLMs on their ability to engage in common-sense reasoning about a variety of everyday situations and observe similar patterns of errors in both people and models. We then identify the set of attention heads driving LLM responses and find that these heads implement a form of pattern-matching. These attention heads allow us to predict seemingly inexplicable reasoning errors in people caused by ostensibly irrelevant prompt details. Taken together, our results suggest that everyday causal reasoning in people and LLMs is more consistent with a form of pattern-matching than with abstract world models.",
          "authors": [
            "Zach Studdiford",
            "Gary Lupyan"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.13607v1",
          "abstract_url": "https://arxiv.org/abs/2606.13607v1",
          "pdf_url": "https://arxiv.org/pdf/2606.13607v1",
          "published_at": "2026-06-11T17:23:10+00:00",
          "updated_at": "2026-06-11T17:23:10+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.13607",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.13607v1"
          },
          "relevance_score": 145,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"reasoning\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.13607"
        },
        {
          "title": "AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility",
          "summary": "Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse agent designs. The root problem is the lack of an open, agent-agnostic assessment interface. We advocate Agentified Agent Assessment (AAA), where evaluation is performed by judge agents and all participants interact through standardized protocols: A2A for task management and MCP for tool access. Conventional benchmarking defines two separate interfaces, one for the benchmark and one for the agent, while AAA only needs one; this yields a generic, unified framework that separates assessment logic from agent implementation and enables reproducible, interoperable, and multi-agent evaluation. We further introduce AgentBeats as a concrete realization of AAA: we identify five practical operation modes that make standardized assessment compatible with real-world constraints on openness, privacy, and reproducibility. To evaluate our design at scale, we conduct two studies: a five-month open competition that drew 298 judge agents across 12 categories together with 467 subject agents from independent participants, showing that AAA applies across a heterogeneous range of benchmarks; and a case study on coding agents that confirms agentified evaluation preserves fidelity with the public record while surfacing previously missing head-to-head results, yielding research insights about agent design. Combining a community-scale field study and a controlled coding case study, we verify that AAA delivers coverage, practicality, and fidelity across heterogeneous scenarios at scale. Together, AAA and AgentBeats offer a clear path toward open, standardized, and reproducible agent assessment.",
          "authors": [
            "Xiaoyuan Liu",
            "Jianhong Tu",
            "Yuqi Chen",
            "Siyuan Xie",
            "Sihan Ren",
            "Tianneng Shi",
            "Gal Gantar",
            "Evan Sandoval",
            "Donghyun Lee",
            "Daniel Miao",
            "Peter J. Gilbert",
            "Nick Hynes",
            "Mauro Staver",
            "Warren He",
            "David Marn",
            "Andrew Low",
            "Xi Zhang",
            "Elron Bandel",
            "Michal Shmueli-Scheuer",
            "Siva Reddy",
            "Alexandre Drouin",
            "Alexandre Lacoste",
            "Ramayya Krishnan",
            "Elham Tabassi",
            "Yu Su",
            "Victor Barres",
            "Chenguang Wang",
            "Wenbo Guo",
            "Dawn Song"
          ],
          "categories": [
            "cs.AI",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2606.13608v1",
          "abstract_url": "https://arxiv.org/abs/2606.13608v1",
          "pdf_url": "https://arxiv.org/pdf/2606.13608v1",
          "published_at": "2026-06-11T17:23:54+00:00",
          "updated_at": "2026-06-11T17:23:54+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.13608",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.13608v1"
          },
          "relevance_score": 141,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"LLM\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata",
            "summary matched \"coding agent\""
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.13608"
        },
        {
          "title": "SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents",
          "summary": "Skill self-evolution methods for LLM agents aim to turn execution trajectories into reusable skill documents, but current pipelines typically learn from one trajectory per task, merge candidate skill patches before checking them, and load the full skill corpus before inference. We propose SkillCAT, a training-free framework that separates this process into three stages. Contrastive Causal Extraction (CCE) samples multiple trajectories for each task and compares same-task success/failure pairs to identify evidence that explains outcome differences. Assessment-Augmented Evolution (AAE) replays each candidate patch on source-task clones and keeps only patches that improve or preserve task outcomes before hierarchical skill patch merging. Topology-Aware Task Execution (TTE) compiles the evolved skills into a routable sub-skill topology, so inference loads only the capability nodes relevant to the task. We evaluate SkillCAT on common agent benchmarks, including SpreadsheetBench, WikiTableQuestions, and DocVQA, and further test cross-model and out-of-distribution generalization. Across these settings, SkillCAT raises the average score over baselines by up to 40.40%, demonstrating reliable skill evolution without model training.",
          "authors": [
            "Kunfeng Chen",
            "Qihuang Zhong",
            "Juhua Liu",
            "Bo Du"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.13317v1",
          "abstract_url": "https://arxiv.org/abs/2606.13317v1",
          "pdf_url": "https://arxiv.org/pdf/2606.13317v1",
          "published_at": "2026-06-11T13:12:10+00:00",
          "updated_at": "2026-06-11T13:12:10+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.13317",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.13317v1"
          },
          "relevance_score": 141,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"agent\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.13317"
        },
        {
          "title": "Automated reproducibility assessments in the social and behavioral sciences using large language models",
          "summary": "Reproducibility in the social and behavioral sciences is typically evaluated by independent researchers who reanalyze the original data to assess whether the published findings can be recovered. However, such approaches are resource-intensive and difficult to scale. Here, we show that large language models (LLMs) can automate reproducibility assessments. Using N=76 published studies with predefined claims from the behavioral and social sciences, we compare LLM-generated analysis with the original findings and human reanalysis. For 7 studies, the LLM could not produce a viable effect size estimate. For the remaining studies, our LLM pipeline recovered the original effect sizes in 41% of studies using a +/-0.05 tolerance in Cohen's d. Further, our LLM pipeline reached the same qualitative conclusion as the original study in 96% of cases, where conclusions indicate whether the reanalysis supports the original claim. For comparison, human reanalysts recovered the original effect sizes in 34% of studies and reached the same qualitative conclusion in 74% of cases. Together, these results show that LLMs can serve as a scalable tool for automated reproducibility assessment and provide a foundation for systematic auditing of empirical results in the social and behavioral sciences.",
          "authors": [
            "Tobias Holtdirk",
            "Pietro Marcolongo",
            "Anna Steinberg Schulten",
            "Felix Henninger",
            "Stefan Rose",
            "Sarah Ball",
            "Bolei Ma",
            "Frauke Kreuter",
            "Markus Weinmann",
            "Stefan Feuerriegel"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.13670v1",
          "abstract_url": "https://arxiv.org/abs/2606.13670v1",
          "pdf_url": "https://arxiv.org/pdf/2606.13670v1",
          "published_at": "2026-06-11T17:58:36+00:00",
          "updated_at": "2026-06-11T17:58:36+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.13670",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.13670v1"
          },
          "relevance_score": 128,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "summary matched \"LLM\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.13670"
        }
      ]
    },
    {
      "name": "Agent Runtime Security",
      "key_points": [
        "《Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda》〔应用 / 方法〕：LLM-based agents are entering regulated industries where they automate judgment intensive quality management processes. We argue that symbolic structures alrea…",
        "《ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm》〔评测 / 方法〕：Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long…",
        "《Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents》〔方法〕：Interactive LLM agents are becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in one session m…",
        "《No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions》〔评测 / 应用 / 方法〕：As AI-generated reviews move from experimental tools into peer-review infrastructure, most robustness concerns have focused on explicit attacks such as hidden…",
        "《Nous: An Attempt to Extract and Inject the Cognition Behind Prediction-Market Behavior》〔评测 / 方法〕：As LLM agents proliferate in prediction markets and collective decision-making, they risk a cognitive monoculture: agents built on shared foundation models pro…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda",
          "summary": "LLM-based agents are entering regulated industries where they automate judgment intensive quality management processes. We argue that symbolic structures already embedded in these domains, including regulations, typed process models, and compliance constraints, should be treated not merely as external monitoring mechanisms but as core architectural components that shape the agent's decision-making and behavior. We propose compliance-by-construction as a complementary paradigm to guardrail-based monitoring: a structural foundation that prevents control-flow violations, while guardrails remain essential for catching semantic errors. We identify a structured set of neuro-symbolic research challenges on foundational and capability level and show that addressing them jointly enables compliance-by-construction. We call on the neuro-symbolic community to engage with regulated process automation as a high impact research domain.",
          "authors": [
            "Alexander Rombach",
            "Chantale Lauer",
            "Nijat Mehdiyev"
          ],
          "categories": [
            "cs.AI",
            "cs.MA"
          ],
          "paper_id": "http://arxiv.org/abs/2606.13405v1",
          "abstract_url": "https://arxiv.org/abs/2606.13405v1",
          "pdf_url": "https://arxiv.org/pdf/2606.13405v1",
          "published_at": "2026-06-11T14:34:11+00:00",
          "updated_at": "2026-06-11T14:34:11+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.13405",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.13405v1"
          },
          "relevance_score": 44,
          "match_reasons": [
            "summary matched \"guardrail\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.13405"
        },
        {
          "title": "ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm",
          "summary": "Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long-horizon error accumulation, while API-basedapproaches struggle with heterogeneous protocols and inaccessible commercial interfaces. In this work,we identify the Component Object Model (COM) as a unified executable abstraction, proposing COM-as-Action: a new paradigm that reframes professional software interaction as deterministic program synthesisrather than sequential visual control. To validate this paradigm in the most demanding environments, weintroduce ComCADBench, the first benchmark for agents operating real industrial CAD software. Ourexperiments reveal a substantial paradigm gap: frontier proprietary models achieve near-zero successunder GUI-based interaction, whereas COM-based execution yields substantial immediate gains. Tobridge the remaining gap between syntactic correctness and geometric accuracy, we develop ComActor, aself-correcting agent trained through a progressive three-stage framework, alongside ComForge, a scalableplatform for large-scale training in Windows containers. Extensive experiments show that ComActorachieves state-of-the-art performance on ComCADBench, with strong resilience in long-horizon taskswhere baselines collapse, and generalizes to external CAD benchmark.",
          "authors": [
            "Jiaxin Ai",
            "Tao Hu",
            "Xuemeng Yang",
            "Shu Zou",
            "Hairong Zhang",
            "Daocheng Fu",
            "Yu Yang",
            "Hongbin Zhou",
            "Nianchen Deng",
            "Pinlong Cai",
            "Zhongyuan Wang",
            "Botian Shi",
            "Kaipeng Zhang",
            "Licheng Wen"
          ],
          "categories": [
            "cs.SE",
            "cs.AI",
            "cs.CL",
            "cs.CV"
          ],
          "paper_id": "http://arxiv.org/abs/2606.13239v1",
          "abstract_url": "https://arxiv.org/abs/2606.13239v1",
          "pdf_url": "https://arxiv.org/pdf/2606.13239v1",
          "published_at": "2026-06-11T11:53:32+00:00",
          "updated_at": "2026-06-11T11:53:32+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Agent",
            "RAG"
          ],
          "doi": null,
          "arxiv_id": "2606.13239",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.13239v1"
          },
          "relevance_score": 41,
          "match_reasons": [
            "summary matched \"computer-use agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.13239"
        },
        {
          "title": "Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents",
          "summary": "Interactive LLM agents are becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in one session may still be violated in the next. We study this gap between preference access and preference compliance. In tasks derived from anonymized real-user friction cases, Mem0 memory still leaves 57.5% of applicable preference checks violated. We introduce Test-time Rule Acquisition and Compiled Enforcement (TRACE), a drop-in skill-layer pipeline for coding-agent runtimes that mines user corrections, rewrites them as atomic rules, and compiles them into runtime checks that must pass before an agent completes future tasks. Unlike runtime checks written ahead of time by developers, TRACE skills come from the user's own chat corrections. We evaluate TRACE with simulated user-in-the-loop experiments on ClawArena coding-agent tasks and MemoryArena-derived memory-intensive tasks. On ClawArena, TRACE reduces held-out preference violation from 100.0% to 37.6% on in-distribution tasks and from 100.0% to 2.0% on out-of-distribution tasks. On MemoryArena-derived tasks, TRACE reduces in-distribution violation from 100.0% to 60.5% while matching or exceeding the strongest memory baseline on task pass. These results suggest that compiling corrections into runtime enforcement can address a repeated-friction failure mode that memory alone does not reliably solve, reducing the need for users to restate the same correction across future sessions. Experiment code is available at https://github.com/YujunZhou/TRACE_exp, and the deployable skill is available at https://github.com/YujunZhou/tellonce.",
          "authors": [
            "Yujun Zhou",
            "Kehan Guo",
            "Haomin Zhuang",
            "Xiangqi Wang",
            "Yue Huang",
            "Zhenwen Liang",
            "Pin-Yu Chen",
            "Tian Gao",
            "Nuno Moniz",
            "Nitesh V. Chawla",
            "Xiangliang Zhang"
          ],
          "categories": [
            "cs.LG",
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.13174v1",
          "abstract_url": "https://arxiv.org/abs/2606.13174v1",
          "pdf_url": "https://arxiv.org/pdf/2606.13174v1",
          "published_at": "2026-06-11T10:43:40+00:00",
          "updated_at": "2026-06-11T10:43:40+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.13174",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.13174v1"
          },
          "relevance_score": 40,
          "match_reasons": [
            "summary matched \"agent runtime\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.13174"
        },
        {
          "title": "No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions",
          "summary": "As AI-generated reviews move from experimental tools into peer-review infrastructure, most robustness concerns have focused on explicit attacks such as hidden instructions and prompt injection. We study a harder and more policy-relevant failure mode: no hidden text, no prompt injection, and no changes to methods, experiments, figures, equations, proofs, or numerical results. The attacker modifies only presentation-level content, such as the abstract, contribution framing, related work, discussion, and narrative structure. We introduce adversarial repackaging: a closed-loop attack that uses AI-reviewer feedback to search for presentation-level revisions while keeping the scientific evidence fixed. Across three mainstream AI reviewers, adversarial repackaging achieves a 75.1% attack success rate and a mean score gain of +1.21/10. The effect is not explained by ordinary prose polishing. We also reveal that strategies that change how the reviewer interprets the paper, such as related-work repositioning and analytical discussion expansion, substantially outperform surface edits such as local polishing, table formatting, and algorithm boxes. Our analysis reveals two deeper structural failure modes. First, AI reviewers are easier to impress than to convince: highlighting strengths reliably increases perceived merit, while attempts to dissolve weaknesses frequently backfire. Second, AI reviewers can confuse the appearance of addressing a limitation with actually resolving it, allowing unchanged evidence to be reinterpreted as stronger scientific contribution. These results show that the deployment risk is not only malicious hidden instructions, but the emergence of paper presentation itself as an optimization surface. We release a contamination-free rolling benchmark and attack framework for testing whether AI reviewers remain anchored to scientific content under presentation-only edits.",
          "authors": [
            "Xu Yang",
            "Zhizhou Sha",
            "Junbo Li",
            "Jian Yu",
            "Yifan Sun",
            "Matthew Zhao",
            "Jinrui Fang",
            "Xinyue Guo",
            "Yining Wu",
            "Xu Hu",
            "Yifu Luo",
            "Qiang Liu",
            "Zhangyang Wang"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.13044v1",
          "abstract_url": "https://arxiv.org/abs/2606.13044v1",
          "pdf_url": "https://arxiv.org/pdf/2606.13044v1",
          "published_at": "2026-06-11T08:30:18+00:00",
          "updated_at": "2026-06-11T08:30:18+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Prompt Injection"
          ],
          "doi": null,
          "arxiv_id": "2606.13044",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.13044v1"
          },
          "relevance_score": 38,
          "match_reasons": [
            "summary matched \"prompt injection\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.13044"
        },
        {
          "title": "Nous: An Attempt to Extract and Inject the Cognition Behind Prediction-Market Behavior",
          "summary": "As LLM agents proliferate in prediction markets and collective decision-making, they risk a cognitive monoculture: agents built on shared foundation models produce correlated forecasts, and recent measurement finds frontier-model errors correlated at r ~ 0.77. We ask whether human cognitive diversity can be recovered from behavior and transferred to LLM agents. Nous extracts a structured eight-dimension behavioral profile from real Polymarket trading activity and injects it into agents through prompts. Our central finding is a dissociation between the two halves of that pipeline. Extraction works, partially: across 100 wallets, 8 of 14 parameters are temporally stable (split-half ICC >= 0.5, bootstrap CI lower bound > 0.3; contrarian score reaches ICC ~ 0.9); wallets are identifiable from their profiles well above chance (top-1 retrieval 17-22% vs. 1% chance); and two of four pre-specified dimensions rank-correlate with future realized profit out-of-sample, though the correlations do not survive behavioral-confound controls. Prompt-level injection does not measurably transmit it: on a semantic embedding metric, structured injection shows no significant advantage over a length-matched control on any model, and the diversity it induces neither reduces ensemble error correlation nor improves Brier score -- a null that persists across exploratory checks on sampling temperature, profile diversity, and question difficulty. Measuring the prompts themselves locates the compression before the model: the structure-to-narrative translator emits near-uniform prompts whose spread does not track profile spread. We position Nous as measuring the cognitive-monoculture problem and the limits of a prompt-level remedy, motivating deeper, below-the-prompt injection (fine-tuning, activation steering). Code, frozen profiles, prompts, and model outputs: https://github.com/WillChienT/nous-paper",
          "authors": [
            "Haowei Qian"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.13038v1",
          "abstract_url": "https://arxiv.org/abs/2606.13038v1",
          "pdf_url": "https://arxiv.org/pdf/2606.13038v1",
          "published_at": "2026-06-11T08:18:25+00:00",
          "updated_at": "2026-06-11T08:18:25+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.13038",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.13038v1"
          },
          "relevance_score": 38,
          "match_reasons": [
            "summary matched \"prompt injection\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.13038"
        }
      ]
    },
    {
      "name": "Terminal and SWE Agents",
      "key_points": [
        "《Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset》〔数据 / 方法〕：AI coding agents are increasingly used to generate pull requests (PRs) that propose code fixes in software projects. From a first exploration of the AIDev data…",
        "《Recursive Agent Harnesses》〔评测 / 应用 / 方法〕：Recursive language models (RLMs) showed that recursion over model calls is an effective strategy for long-context reasoning, and production coding agents have…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset",
          "summary": "AI coding agents are increasingly used to generate pull requests (PRs) that propose code fixes in software projects. From a first exploration of the AIDev dataset, we find that 46.41\\% of the fixes proposed by the agents Copilot, Devin, Cursor, and Claude are rejected. This represents a significant amount of wasted resources that require human reviews, verifications, and running tests and validations for fixes that are merely discarded. Our goal in this paper is to understand the failure modes of AI-agents, an understanding that is crucial for better integrating AI-agents as efficient teammates. In this paper, we conduct a qualitative study on a representative sample of 306 non-merged pull requests created or co-authored by the agents mentioned earlier, followed by a quantitative analysis of the reasons for rejection. Our qualitative findings identify 14 reasons divided into four high-level categories for rejecting AI-agent fixes. We observe that developers can reject fixes due to fixes whose implementation is incorrect (e.g., incomplete, wrong approach), fixes that do not pass the continuous integration (CI) pipelines and fail tests, fixes for which the agent is unable to perform the implementation (e.g., no code generated, sessions lost), and fixes whose priority is low. Our results shed light on the importance of better guiding the model at these levels: (1) proposing hints about the approach to follow for fixing an issue, (2) outlining constraints or limitations regarding the approaches that should not be taken, and (3) instructing the agent on how to validate the implementation through CI pipelines and without introducing a breaking change. Our results suggest the need for good prioritization of tasks so that generated fixes do not lead to wasted human review efforts or wasted agent resources (e.g., tokens, compute, or allowed number of requests).",
          "authors": [
            "Mahmoud Abujadallah",
            "Ali Arabat",
            "Mohammed Sayagh"
          ],
          "categories": [
            "cs.SE",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.13468v1",
          "abstract_url": "https://arxiv.org/abs/2606.13468v1",
          "pdf_url": "https://arxiv.org/pdf/2606.13468v1",
          "published_at": "2026-06-11T15:19:36+00:00",
          "updated_at": "2026-06-11T15:19:36+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "方法"
          ],
          "topics": [
            "Agent",
            "Coding Agent"
          ],
          "doi": "10.1145/3793302.3793592",
          "arxiv_id": "2606.13468",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.13468v1",
            "doi": "https://doi.org/10.1145/3793302.3793592"
          },
          "relevance_score": 57,
          "match_reasons": [
            "summary matched \"coding agent\"",
            "has DOI",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1145/3793302.3793592"
        },
        {
          "title": "Recursive Agent Harnesses",
          "summary": "Recursive language models (RLMs) showed that recursion over model calls is an effective strategy for long-context reasoning, and production coding agents have begun to write code that spawns subagents at scale, most recently in Anthropic's dynamic workflows. We name and study the pattern between these two lines of work, where the recursive unit is a full agent harness with filesystem tools, code execution, and planning rather than a model call with no tools. We call this the Recursive Agent Harness (RAH) and frame it as harness recursion, the code-first extension to the model recursion of RLMs. A parent agent generates and runs an executable script that spawns subagent harnesses in parallel for fine-grained workloads and uses structured function calls for small subtasks. We provide a controlled evaluation on long-context reasoning. With the backbone held fixed at GPT-5 to match the published Codex and RLM baselines, RAH improves the Codex coding-agent baseline from 71.75% to 81.36% on Oolong-Synthetic (199 samples, 13 context-length buckets up to 4M tokens), a gain attributable to the harness rather than the model. With a stronger backbone, Claude Sonnet 4.5, the same design reaches 89.77%.",
          "authors": [
            "Elias Lumer",
            "Sahil Sen",
            "Kevin Paul",
            "Vamse Kumar Subbiah"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.13643v1",
          "abstract_url": "https://arxiv.org/abs/2606.13643v1",
          "pdf_url": "https://arxiv.org/pdf/2606.13643v1",
          "published_at": "2026-06-11T17:47:30+00:00",
          "updated_at": "2026-06-11T17:47:30+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Agent",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.13643",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.13643v1"
          },
          "relevance_score": 47,
          "match_reasons": [
            "summary matched \"coding agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.13643"
        }
      ]
    }
  ]
}