{
  "generated_at": "2026-06-10T13:25:04.523670+08:00",
  "timezone": "Asia/Shanghai",
  "lookback_hours": 24,
  "sorting": {
    "default_sort_by": "hybrid",
    "summary": "hybrid (relevance first, published_at tie-break)",
    "weights": {
      "title_match_weight": 40,
      "summary_match_weight": 18,
      "doi_weight": 12,
      "pdf_weight": 8,
      "rich_summary_weight": 6,
      "metadata_weight": 4,
      "multi_source_weight": 10,
      "freshness_weight_cap": 24
    },
    "feeds": [
      {
        "name": "LM",
        "sort_by": "hybrid"
      },
      {
        "name": "Agent Runtime Security",
        "sort_by": "hybrid"
      },
      {
        "name": "Terminal and SWE Agents",
        "sort_by": "hybrid"
      }
    ]
  },
  "highlights": [
    "主题「LLM」：命中 19 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains》、《Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution》。",
    "主题「Benchmark」：命中 16 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains》、《Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution》。",
    "主题「Language Model」：命中 6 篇，覆盖 LM，代表论文包括 《The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models》、《TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning》。",
    "主题「Agent」：命中 4 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation》、《Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields》。",
    "主题「Evaluation」：命中 3 篇，覆盖 Agent Runtime Security，代表论文包括 《Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories》、《When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models》。"
  ],
  "focus_items": [],
  "action_items": [],
  "topic_sections": [
    {
      "name": "LLM",
      "paper_count": 19,
      "feed_names": [
        "LM",
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains",
        "Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution",
        "ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity",
        "Structure from Reasoning, Numbers from Search: On-Premise Open LLMs as Structural Priors for Coupled MIMO Controller Tuning",
        "Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?",
        "The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models",
        "Flaws in the LLM Automation Narrative",
        "ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models",
        "Generative Explainability for Next-Generation Networks: LLM-Augmented XAI with Mutual Feature Interactions",
        "Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models",
        "Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning",
        "Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models",
        "Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and Activation",
        "Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation",
        "It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO",
        "Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization",
        "Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages",
        "AutoPDE: Reliable Agentic PDE Solving via Explicitly Represented Solver Strategies",
        "DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch"
      ],
      "key_points": [
        "《T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains》〔评测 / 应用 / 方法〕：Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic systems. However, existing…",
        "《Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution》〔评测 / 方法〕：Although Large Language Model (LLM) agents have demonstrated strong performance on complex tasks, their learning is often limited by inefficient interaction fe…"
      ]
    },
    {
      "name": "Benchmark",
      "paper_count": 16,
      "feed_names": [
        "LM",
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains",
        "Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution",
        "ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity",
        "Structure from Reasoning, Numbers from Search: On-Premise Open LLMs as Structural Priors for Coupled MIMO Controller Tuning",
        "Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?",
        "Flaws in the LLM Automation Narrative",
        "ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models",
        "TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning",
        "VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation",
        "Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models",
        "Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation",
        "Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields",
        "It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO",
        "Two-Way Confidential VMs (2cVM): Collaborative Confidential Computing for Mutually Distrustful Parties",
        "Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages",
        "DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch"
      ],
      "key_points": [
        "《T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains》〔评测 / 应用 / 方法〕：Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic systems. However, existing…",
        "《Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution》〔评测 / 方法〕：Although Large Language Model (LLM) agents have demonstrated strong performance on complex tasks, their learning is often limited by inefficient interaction fe…"
      ]
    },
    {
      "name": "Language Model",
      "paper_count": 6,
      "feed_names": [
        "LM"
      ],
      "paper_titles": [
        "The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models",
        "TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning",
        "Generative Explainability for Next-Generation Networks: LLM-Augmented XAI with Mutual Feature Interactions",
        "Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning",
        "Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models",
        "Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and Activation"
      ],
      "key_points": [
        "《The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models》〔方法〕：This study investigates cross-lingual distributional skew (the Shibboleth Effect) in frontier large language models (LLMs) subjected to sustained adversarial c…",
        "《TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning》〔评测 / 方法〕：Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, r…"
      ]
    },
    {
      "name": "Agent",
      "paper_count": 4,
      "feed_names": [
        "LM",
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation",
        "Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields",
        "Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories",
        "AutoPDE: Reliable Agentic PDE Solving via Explicitly Represented Solver Strategies"
      ],
      "key_points": [
        "《VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation》〔评测 / 方法〕：Evaluation remains a critical bottleneck for interactive agent development. Existing evaluation methods often rely on static benchmarks, which fail to capture…",
        "《Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields》〔评测 / 应用 / 方法〕：Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evalua…"
      ]
    },
    {
      "name": "Evaluation",
      "paper_count": 3,
      "feed_names": [
        "Agent Runtime Security"
      ],
      "paper_titles": [
        "Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories",
        "When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models",
        "Two-Way Confidential VMs (2cVM): Collaborative Confidential Computing for Mutually Distrustful Parties"
      ],
      "key_points": [
        "《Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories》〔评测 / 方法〕：Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature tak…",
        "《When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models》〔评测 / 数据 / 应用 / 方法〕：Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, ye…"
      ]
    }
  ],
  "template": "zh_daily_brief",
  "feeds": [
    {
      "name": "LM",
      "key_points": [
        "《T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains》〔评测 / 应用 / 方法〕：Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic systems. However, existing…",
        "《Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution》〔评测 / 方法〕：Although Large Language Model (LLM) agents have demonstrated strong performance on complex tasks, their learning is often limited by inefficient interaction fe…",
        "《ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity》〔评测 / 应用 / 方法〕：Large language models (LLMs) are rapidly acquiring capabilities relevant to biological research, from literature synthesis to interpretation of experimental da…",
        "《Structure from Reasoning, Numbers from Search: On-Premise Open LLMs as Structural Priors for Coupled MIMO Controller Tuning》〔评测 / 应用 / 方法〕：Tuning controllers for strongly coupled multi-input multi-output (MIMO) industrial processes is hard: decentralized classical auto-tuning ignores loop interact…",
        "《Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?》〔评测 / 应用 / 方法〕：The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade producti…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains",
          "summary": "Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic systems. However, existing benchmarks remain limited in task complexity, realism, and domain diversity, and often fail to capture interactions that span multiple domains, limiting their ability to evaluate agents in realistic multi-step settings that require sustained reasoning and coordination. To address these limitations, we introduce T1-Bench, a high-fidelity, comprehensive benchmark for evaluating agentic systems in realistic customer-facing, multi-domain environments, featuring interleaved scenarios that require structured reasoning across multi-turn user-assistant interactions and substantially increasing both compositional complexity and evaluative rigor across 25 domains of varying difficulty. We evaluate T1-Bench using 12 proprietary and open-weight models, providing a reproducible and standardized framework for assessing agent behavior, tool utilization, and conversational quality in complex, multi-step environments. We further complement automatic evaluation with human judgments to strengthen the assessment of qualitative performance. Overall, T1-Bench substantially advances prior benchmarks by increasing task complexity, interaction depth, and domain coverage in simulated multi-domain environments. To facilitate future research on agentic systems, we will publicly release data and evaluation code as open source.",
          "authors": [
            "Genta Indra Winata",
            "Amartya Chakraborty",
            "Yuzhen Lin",
            "Swasthi P Rao",
            "Shikhhar Siingh",
            "Houhan Lu",
            "Nadia Bathaee",
            "Sriharsha Hatwar",
            "Paresh Dashore",
            "Anmol Jain",
            "Kshitij Tayal",
            "Xiuzhu Lin",
            "Anirban Das",
            "Sambit Sahu",
            "Shi-Xiong Zhang"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.11070v1",
          "abstract_url": "https://arxiv.org/abs/2606.11070v1",
          "pdf_url": "https://arxiv.org/pdf/2606.11070v1",
          "published_at": "2026-06-09T16:32:14+00:00",
          "updated_at": "2026-06-09T16:32:14+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.11070",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.11070v1"
          },
          "relevance_score": 217,
          "match_reasons": [
            "title matched \"agent\"",
            "title matched \"benchmark\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"RAG\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.11070"
        },
        {
          "title": "Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution",
          "summary": "Although Large Language Model (LLM) agents have demonstrated strong performance on complex tasks, their learning is often limited by inefficient interaction feedback and static training environments, which hinder broader generalization. To address these limitations, this paper introduces Role-Agent, \\textcolor{black}{a framework} that harnesses a single LLM to function concurrently as both the agent and the environment, enabling a bootstrapped co-evolution. Role-Agent comprises two synergistic components: World-In-Agent (WIA) and Agent-In-World (AIW). In WIA, the LLM acts as the agent and predicts future states after each action; the alignment between predicted and actual states is then used as a process reward, encouraging environment-aware reasoning. In AIW, the LLM analyzes failure modes from failed trajectories and retrieves tasks with similar failure patterns, thereby reshaping the training data distribution for targeted practice. Experiments on multiple benchmarks show that Role-Agent consistently improves performance, yielding an average gain of over 4\\% over strong baselines.",
          "authors": [
            "Xucong Wang",
            "Ziyu Ma",
            "Shidong Yang",
            "Tongwen Huang",
            "Pengkun Wang",
            "Yong Wang",
            "Xiangxiang Chu"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.10917v1",
          "abstract_url": "https://arxiv.org/abs/2606.10917v1",
          "pdf_url": "https://arxiv.org/pdf/2606.10917v1",
          "published_at": "2026-06-09T14:28:07+00:00",
          "updated_at": "2026-06-09T14:28:07+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.10917",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.10917v1"
          },
          "relevance_score": 215,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"reasoning\"",
            "summary matched \"alignment\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.10917"
        },
        {
          "title": "ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity",
          "summary": "Large language models (LLMs) are rapidly acquiring capabilities relevant to biological research, from literature synthesis to interpretation of experimental data. Increasingly, LLM agents can also perform in silico biology tasks that previously required experienced human biologists. These emerging AI capabilities offer new opportunities for scientific discovery and biomedical advances, but they also shift the landscape of biosecurity risks. To address this, we introduce the Agentic Bio-Capabilities Benchmark (ABC-Bench), a suite of tasks to measure agentic biosecurity-relevant capabilities. ABC-Bench evaluates LLM agents on both benign and dual-use biology tasks: writing code to operate liquid handling robots, designing DNA fragments for in vitro assembly, and evading DNA synthesis screening. These tasks require a combination of biology and software expertise. All tested LLM agents outperformed the median expert human baseliner on all three tasks. Agents performed highly on tasks drawing on published knowledge and well-documented protocols, and more weakly on a task requiring novel bioinformatics reasoning. In three wet-lab validation experiments, we found that OpenAI's o4-mini-high produced scripts that, when run on an OpenTrons liquid handling robot, successfully assembled DNA with expected sequences.",
          "authors": [
            "Andrew Bo Liu",
            "Samira Nedungadi",
            "Bryce Cai",
            "Alex Kleinman",
            "Harmon Bhasin",
            "Seth Donoughe"
          ],
          "categories": [
            "cs.AI",
            "cs.CY"
          ],
          "paper_id": "http://arxiv.org/abs/2606.11150v1",
          "abstract_url": "https://arxiv.org/abs/2606.11150v1",
          "pdf_url": "https://arxiv.org/pdf/2606.11150v1",
          "published_at": "2026-06-09T17:35:37+00:00",
          "updated_at": "2026-06-09T17:35:37+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.11150",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.11150v1"
          },
          "relevance_score": 200,
          "match_reasons": [
            "title matched \"agent\"",
            "title matched \"benchmark\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"RAG\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.11150"
        },
        {
          "title": "Structure from Reasoning, Numbers from Search: On-Premise Open LLMs as Structural Priors for Coupled MIMO Controller Tuning",
          "summary": "Tuning controllers for strongly coupled multi-input multi-output (MIMO) industrial processes is hard: decentralized classical auto-tuning ignores loop interaction, and local numerical optimization from natural initializations stalls in the resulting non-convex cost landscape. We ask whether on-premise open-source large language models (LLMs), which keep data on-site and need no plant model, can help. On a single-loop CSTR, classical relay-feedback tuning (IAE 0.106, near the 0.102 optimum) beats an LLM tuner (0.162): for simple loops the LLM adds nothing. The picture inverts on a strongly coupled quadruple-tank with conflicting set-points, scored by a penalized cost J = IAE + lambda*TV(u) that rewards tracking without chattering actuators. There, naive relay tuning (J ~ 28.6) and naive LLM tuning (29.7) are no better than open loop (22.7), and a local optimizer from balanced starts fails in 10/10 runs. A scaffolded open LLM instead reasons about the coupling, proposes the counter-intuitive asymmetric structure, and reaches J ~ 16.9 +/- 0.2 from any start; refining it with a classical optimizer attains the smooth global optimum (J ~ 12.0, 10/10 vs. 0/10), which even applies a non-obvious negative integral correction decentralized tuning cannot. A global optimizer (differential evolution) also reaches this optimum, so the LLM is not the only route; its advantage is sample efficiency and interpretability: a usable controller in 18 evaluations (where the global optimizer is worse than open loop) plus a stated rationale. This edge grows with dimension, reaching ~6x fewer evaluations on a 3x3 plant. The behaviour generalizes across four open models, and on a benign plant the LLM offers no advantage, sharpening the boundary. We contribute a reproducible benchmark delimiting when open LLMs help in control tuning: not as optimizers, but as a sample-efficient, interpretable structural prior.",
          "authors": [
            "Jiaxuan Chen",
            "Haonan Li",
            "Yang Shu"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.11015v1",
          "abstract_url": "https://arxiv.org/abs/2606.11015v1",
          "pdf_url": "https://arxiv.org/pdf/2606.11015v1",
          "published_at": "2026-06-09T15:53:40+00:00",
          "updated_at": "2026-06-09T15:53:40+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.11015",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.11015v1"
          },
          "relevance_score": 180,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"reasoning\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.11015"
        },
        {
          "title": "Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?",
          "summary": "The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an ideal environment for benchmarking document-automation capability, as it requires long-horizon planning and reasoning, precise parameter configuration, and multi-application integration. To quantify this capability, we introduce an evaluation based on China's National Computer Rank Examination (NCRE), featuring 200 comprehensive practical-operation tasks across Word, Excel, and PowerPoint. Each task is scored on a 100-point rubric scale using 7,118 machine-gradable criteria, and Score Rate (SR) denotes the mean percentage of rubric points earned across these tasks. We benchmark 7 frontier LLMs and observe stark limitations: single-turn models score a maximum of 36.6%. A stronger agentic system with execution feedback, iterative repair, and broader Office automation access reaches 68.8%, but remains below the 95.5% community-reference score used as a scoring sanity check. Ultimately, our experiments demonstrate that despite recent advancements in code generation, achieving reliable fine-grained Office document automation remains a significant challenge for current code-generating LLM and agent systems.",
          "authors": [
            "Tengchao Lv",
            "Dongdong Zhang",
            "Jiayu Ding",
            "Yilin Jia",
            "Yuzhong Zhao",
            "Yupan Huang",
            "Wenshan Wu",
            "Xiangyang Zhou",
            "Shaohan Huang",
            "Nan Yang",
            "Li Dong",
            "Lei Cui",
            "Furu Wei"
          ],
          "categories": [
            "cs.AI",
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.10956v1",
          "abstract_url": "https://arxiv.org/abs/2606.10956v1",
          "pdf_url": "https://arxiv.org/pdf/2606.10956v1",
          "published_at": "2026-06-09T14:59:14+00:00",
          "updated_at": "2026-06-09T14:59:14+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.10956",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.10956v1"
          },
          "relevance_score": 175,
          "match_reasons": [
            "title matched \"LLM\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"reasoning\"",
            "summary matched \"agent\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.10956"
        },
        {
          "title": "The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models",
          "summary": "This study investigates cross-lingual distributional skew (the Shibboleth Effect) in frontier large language models (LLMs) subjected to sustained adversarial conditions. We develop a multi-agent geopolitical wargame, the Cerulean Sea Crisis, a synthetic maritime territorial dispute designed to mirror the structural dynamics of Eastern Mediterranean conflicts. Six frontier models (GPT-4o, Llama-4, Mistral-Large, Gemini-3.1-Pro, Qwen3.6-Plus, and DeepSeek-R1) participate in a between-groups experiment (N = 10 games per arm, K = 5 rounds per game) in which the sole manipulation is the language of play (English versus Turkish), producing 586 validated statements. A zero-shot classifier assesses behavioral dispositions along two continuous dimensions: Concession Rate and Coercive Rhetoric. The results are heterogeneous. Llama-4 shows a substantial, Holm-corrected increase in coercive rhetoric under Turkish (delta = +0.800, p = .002), whereas Gemini-3.1-Pro displays an equally large decrease (delta = -0.750, p = .005). DeepSeek-R1 exhibits a similar negative shift (delta = -0.860, p = .006) and provides chain-of-thought evidence consistent with a buffering mechanism. GPT-4o shows no detectable effect (delta = +0.130, p = .614). These findings indicate that cross-lingual behavioral skew is contingent on model architecture and training regime rather than a universal property of Western-origin LLMs. We identify two distinct buffering mechanisms, chain-of-thought institutional anchoring and multilingual RLHF alignment, and discuss their implications for integrating LLMs safely into diplomatic and crisis-management settings.",
          "authors": [
            "Hakan Mehmetcik"
          ],
          "categories": [
            "cs.CL",
            "cs.CY"
          ],
          "paper_id": "http://arxiv.org/abs/2606.11082v1",
          "abstract_url": "https://arxiv.org/abs/2606.11082v1",
          "pdf_url": "https://arxiv.org/pdf/2606.11082v1",
          "published_at": "2026-06-09T16:42:00+00:00",
          "updated_at": "2026-06-09T16:42:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.11082",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.11082v1"
          },
          "relevance_score": 163,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"agent\"",
            "summary matched \"alignment\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.11082"
        },
        {
          "title": "Flaws in the LLM Automation Narrative",
          "summary": "Large Language Models (LLMs) are increasingly described as performing at the level of human experts on knowledge economy tasks. These claims are primarily based on how LLMs perform on benchmarking tasks that measure average performance across standardized datasets. Primary limitations of many benchmarking tasks are that they often measure performance based on content directly included in LLM training data, and they frequently do not assess the reliability of LLM performance or the magnitude of LLM errors. However, in high stakes contexts, these qualities are critically important. Through a novel LLM benchmarking task that requires writing computer code to complete a data analysis task, we compare the performance of a frontier LLM against submissions from human experts and explicitly measure the variance of responses and the magnitude of errors. Our study reveals that the human experts perform better on average on a range of metrics and demonstrate less variability in performance. Our results provide evidence that LLMs do not consistently perform at the level of human experts and demonstrate the importance of measuring variance and assessing error magnitude in LLM benchmark evaluations.",
          "authors": [
            "George Perrett",
            "Javae Elliott",
            "Jennifer Hill",
            "Marc Scott"
          ],
          "categories": [
            "stat.OT",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.11166v1",
          "abstract_url": "https://arxiv.org/abs/2606.11166v1",
          "pdf_url": "https://arxiv.org/pdf/2606.11166v1",
          "published_at": "2026-06-09T17:46:10+00:00",
          "updated_at": "2026-06-09T17:46:10+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.11166",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.11166v1"
          },
          "relevance_score": 160,
          "match_reasons": [
            "title matched \"LLM\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.11166"
        },
        {
          "title": "ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models",
          "summary": "Long chain-of-thought (CoT) trajectories in large language model (LLM) reasoning cause severe inference bottlenecks due to rapid key-value (KV) cache growth. Current decoding-time compression methods mitigate this issue via token eviction, but typically assume a uniform budget distribution across all layers and heads. In contrast, existing non-uniform budget allocation methods are predominantly designed for the static prompt prefill phase, and they do not capture the stepwise context demands of autoregressive reasoning. To bridge this gap, we propose ReasonAlloc, a training-free framework that recasts decoding-time KV compression as a hierarchical budget allocation problem. ReasonAlloc operates at two complementary levels: an offline layer-wise preallocation strategy captures an architecture-driven demand pattern which we call ``\\textit{Reasoning Wave}'', while an online head-wise strategy reallocates resources during decoding to information-rich heads based on real-time utility. Evaluations on mathematical reasoning benchmarks (MATH-500, AIME~2024) using DeepSeek-R1-Distill-Llama-8B, DeepSeek-R1-Distill-Qwen-14B, and AceReason-14B show that ReasonAlloc outperforms uniform-budget R-KV, SnapKV, and Pyramid-RKV (a baseline enforcing a static, monotonically decreasing layer budget), with the largest gains at small budgets (128-512 tokens). ReasonAlloc is plug-and-play with existing token-eviction policies and introduces negligible inference-time overhead.",
          "authors": [
            "Wenhao Liu",
            "Hao Shi",
            "Yunhe Li",
            "Weizhi Fei",
            "Xiangyuan Wang",
            "Mengzhe Ruan",
            "Hanxu Hou",
            "Peisong Wang",
            "Linqi Song",
            "Shuang Qiu"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.11164v1",
          "abstract_url": "https://arxiv.org/abs/2606.11164v1",
          "pdf_url": "https://arxiv.org/pdf/2606.11164v1",
          "published_at": "2026-06-09T17:44:23+00:00",
          "updated_at": "2026-06-09T17:44:23+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.11164",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.11164v1"
          },
          "relevance_score": 160,
          "match_reasons": [
            "title matched \"reasoning\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.11164"
        },
        {
          "title": "TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning",
          "summary": "Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources to promising prompts, yet they only leverage sample informativeness at the prompt level and neglect variation in prefix-level informativeness across turns within the same rollout. This work targets multi-turn agentic RL by modeling each ReAct-style thought-action-observation turn as a semantically distinct node, allowing budget allocation to extend from prompt roots to turn-level prefixes with further continuations, which naturally forms tree-structured rollouts. We introduce Tree Rollout Allocation for Contrastive Exploration (TRACE), a unified rollout allocation framework that enhances reward contrast within a fixed sampling budget. Technically, TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimates conditional success probability at these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies the policy-update signal. Empirically, TRACE achieves competitive performance and efficiency gains on typical agentic benchmarks, e.g., improving Qwen3-14B Multi-Hop QA average accuracy by 2.8 points over competitive baselines at equal sampling cost.",
          "authors": [
            "Heming Zou",
            "Qi Wang",
            "Yun Qu",
            "Yuhang Jiang",
            "Lizhou Cai",
            "Yixiu Mao",
            "Ru Peng",
            "Xin Xu",
            "Weijie Liu",
            "Kai Yang",
            "Saiyong Yang",
            "Xiangyang Ji"
          ],
          "categories": [
            "cs.LG",
            "cs.AI",
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.11119v1",
          "abstract_url": "https://arxiv.org/abs/2606.11119v1",
          "pdf_url": "https://arxiv.org/pdf/2606.11119v1",
          "published_at": "2026-06-09T17:16:03+00:00",
          "updated_at": "2026-06-09T17:16:03+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.11119",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.11119v1"
          },
          "relevance_score": 159,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"reasoning\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.11119"
        },
        {
          "title": "Generative Explainability for Next-Generation Networks: LLM-Augmented XAI with Mutual Feature Interactions",
          "summary": "As artificial intelligence and machine learning (AI/ML) models become integral to network operations, their lack of transparency poses a significant barrier to operator trust. Existing explainable artificial intelligence (XAI) techniques often fail to bridge this gap for non-specialists, producing technical outputs that are difficult to translate into actionable insights. This paper presents a framework specifically designed to address this shortcoming. It leverages a moderately sized large language model (LLM) and extends beyond the standard use of SHapley Additive exPlanations (SHAP) feature influence values. The framework employs a structured prompt enriched with mutual feature interaction data to generate human-understandable natural language explanations. To validate our framework, we performed an empirical evaluation on an optical quality of transmission (QoT) estimation use case with human evaluators. We collected independent performance evaluations from specialists, which showed a high inter-evaluator agreement. Compared to a state-of-the-art baseline that uses only SHAP feature influence values in a straightforward prompt, our approach improves the explanation usefulness and scope by 12.2% and 6.2%, while achieving 97.5% correctness.",
          "authors": [
            "Kiarash Rezaei",
            "Omran Ayoub",
            "Sebastian Troia",
            "Francesco Lelli",
            "Paolo Monti",
            "Carlos Natalino"
          ],
          "categories": [
            "cs.NI",
            "cs.AI",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2606.10942v1",
          "abstract_url": "https://arxiv.org/abs/2606.10942v1",
          "pdf_url": "https://arxiv.org/pdf/2606.10942v1",
          "published_at": "2026-06-09T14:48:26+00:00",
          "updated_at": "2026-06-09T14:48:26+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": "10.1109/wimob66857.2025.11257542",
          "arxiv_id": "2606.10942",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.10942v1",
            "doi": "https://doi.org/10.1109/wimob66857.2025.11257542"
          },
          "relevance_score": 151,
          "match_reasons": [
            "title matched \"LLM\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"RAG\"",
            "summary matched \"evaluation\"",
            "has DOI",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1109/wimob66857.2025.11257542"
        },
        {
          "title": "VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation",
          "summary": "Evaluation remains a critical bottleneck for interactive agent development. Existing evaluation methods often rely on static benchmarks, which fail to capture the dynamic, multi-step nature of agentic behavior and struggle to expose meaningful failure modes. While user-simulation-based evaluation offers a promising alternative, existing simulation frameworks suffer from two major limitations. First, they provide limited mechanisms for evaluating the quality and comprehensiveness of simulated interactions, making it difficult to assess whether a simulator sufficiently explores an agent's capabilities and failure modes. Second, most frameworks are restricted to either UI-only actions or API-only actions, limiting their ability to model the full range of realistic user behaviors. To address these limitations, we propose VISTA, a Versatile Interactive user Simulation Toolkit for Agent evaluation. Our toolkit includes a suite of six metrics for measuring the realism, capability coverage, and interaction effectiveness of simulated interactions. In addition, we develop a hybrid user simulator that integrates both UI-based interactions and API-based interactions, enabling more realistic and comprehensive evaluation across diverse interactive environments. We evaluate VISTA in e-commerce shopping and education customer service settings and demonstrate that it produces more realistic and comprehensive evaluations than existing methods.",
          "authors": [
            "Yunan Lu",
            "Ryan Shea",
            "Yusen Zhang",
            "Zhou Yu"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.11079v1",
          "abstract_url": "https://arxiv.org/abs/2606.11079v1",
          "pdf_url": "https://arxiv.org/pdf/2606.11079v1",
          "published_at": "2026-06-09T16:39:32+00:00",
          "updated_at": "2026-06-09T16:39:32+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.11079",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.11079v1"
          },
          "relevance_score": 145,
          "match_reasons": [
            "title matched \"agent\"",
            "title matched \"evaluation\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.11079"
        },
        {
          "title": "Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models",
          "summary": "Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the alignment behavior of the instruction-tuned model, such as safe refusal, bias avoidance, and privacy protection. We ask: does this conversion preserve alignment? We study this question through a trustworthiness audit and find that it is not behavior-preserving by default. For a systematic analysis, we compare reasoning models produced via supervised fine-tuning, RL-based post-training, and distillation against matched instruction-tuned baselines across six trustworthiness dimensions: safety, toxicity, stereotyping and bias, machine ethics, privacy, and out-of-distribution robustness. We observe that reasoning models often improve on reasoning benchmarks but exhibit alignment regressions, including increased toxicity, amplified stereotyping, miscalibrated refusal, and contextual privacy leakage. These regressions are consistent with behavioral drift from the instruction-tuned baseline, measured by KL divergence. Overall, our results point to the broader conclusion that trustworthiness metrics are essential for evaluating reasoning models and should be reported alongside gains in reasoning capability.",
          "authors": [
            "Prajakta Kini",
            "Avinash Reddy",
            "Souradip Chakraborty",
            "Satya Sai Srinath Namburi GNVV",
            "Furong Huang",
            "Amrit Singh Bedi",
            "Alvaro Velasquez"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.11046v1",
          "abstract_url": "https://arxiv.org/abs/2606.11046v1",
          "pdf_url": "https://arxiv.org/pdf/2606.11046v1",
          "published_at": "2026-06-09T16:14:27+00:00",
          "updated_at": "2026-06-09T16:14:27+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.11046",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.11046v1"
          },
          "relevance_score": 144,
          "match_reasons": [
            "title matched \"reasoning\"",
            "title matched \"alignment\"",
            "summary matched \"LLM\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.11046"
        },
        {
          "title": "Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning",
          "summary": "Large language model unlearning aims to suppress designated undesirable knowledge while preserving benign capabilities. Many unlearning objectives focus on suppressing undesired answers, while recent target-guided variants specify replacement behavior but still leave update locality largely unconstrained. This paper introduces \\emph{Null-Space Constrained Response-Specified Unlearning} (NSRU), a projection-constrained low-rank framework for controlled LLM unlearning. NSRU uses an explicitly structured safe target response to specify the desired behavior for each forget query, while suppressing the original undesired content. To localize adaptation, NSRU estimates per-module retain subspaces from benign hidden representations and uses an orthogonal-projected low-rank parameterization to confine LoRA updates to the null space of the retain subspace. The resulting objective jointly optimizes safe-target learning, undesired-response suppression, and retention preservation under this constrained parameterization. We provide a local first-order analysis showing that the projected update reduces retain-side perturbations while preserving editable directions for shaping forget-query behavior. Experiments on TOFU show that NSRU effectively suppresses extractable forget-set knowledge while improving retain QA performance, model utility, and safe-target alignment over representative baselines. On WMDP, NSRU keeps hazardous-domain accuracy near the random-choice region while preserving broad and domain-adjacent MMLU utility. Ablation studies support the complementary roles of safe-target supervision, undesired-response suppression, retention loss, and null-space projected updates, while sensitivity and robustness analyses indicate stable behavior across the tested hyperparameter and prompt variations.",
          "authors": [
            "Bocheng Ju",
            "Jianhua Wang",
            "Chengliang Liu",
            "Xiaolin Chang"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.10989v1",
          "abstract_url": "https://arxiv.org/abs/2606.10989v1",
          "pdf_url": "https://arxiv.org/pdf/2606.10989v1",
          "published_at": "2026-06-09T15:26:36+00:00",
          "updated_at": "2026-06-09T15:26:36+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.10989",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.10989v1"
          },
          "relevance_score": 144,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"alignment\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.10989"
        },
        {
          "title": "Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models",
          "summary": "With the widespread deployment of Multimodal Large Language Models (MLLMs) in social interaction, understanding and controlling their behavior under complex personality conditions is essential. This paper introduces explicit personality conditioning and establishes a systematic evaluation framework encompassing single-personality induction, multi-personality induction, and personality switching. Experiments show that personality induction improves image captioning performance but can impair performance on tasks requiring precise reasoning, such as visual question answering (VQA). Balancing and residual effects are observed during multi-trait composition and dynamic switching, indicating that model behavior is co-modulated by both previous and current personality constraints. Existing prompt-based personality induction methods show limited transferability to multimodal settings. Our work reveals the dynamic and complex nature of personality modeling in MLLMs and underscores the need for robust, tailored methods for personality induction and evaluation. The code will be released when the paper is accepted.",
          "authors": [
            "Peiqi Jia",
            "Haonan Jia",
            "Ziqi Miao",
            "Linkang Du",
            "Yuntao Wang",
            "Zhou Su"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.11074v1",
          "abstract_url": "https://arxiv.org/abs/2606.11074v1",
          "pdf_url": "https://arxiv.org/pdf/2606.11074v1",
          "published_at": "2026-06-09T16:34:37+00:00",
          "updated_at": "2026-06-09T16:34:37+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.11074",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.11074v1"
          },
          "relevance_score": 141,
          "match_reasons": [
            "title matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.11074"
        },
        {
          "title": "Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and Activation",
          "summary": "Large language models (LLMs) rely on tool use to act as autonomous agents, yet often fail in multi-step execution due to insufficient tool-related knowledge and ineffective knowledge activation. Therefore, we present a systematic study on how knowledge influences tool-use performance, covering the stages of knowledge acquisition, activation, and internalization. In the knowledge acquisition stage, we acquire and evaluate various forms of experiential knowledge, and our analysis shows that simple instance-level knowledge can already provide strong and reliable gains, while abstract intent-level knowledge offers limited benefits. At inference time, to activate knowledge, we find that prompting LLM to expand the depth of reasoning yields diminishing returns, whereas expanding the width of reasoning by parallel sampling with aggregation more effectively activates latent experiential knowledge. At training time, for knowledge internalization, post-training with knowledge-augmented data further improves performance, with reinforcement learning outperforming supervised fine-tuning. Based on these insights, we propose the Knowledge-Augmented Tool Execution (KATE), a knowledge-augmented tool execution framework that integrates experiential knowledge with reasoning-width-expanded inference and knowledge-aware training. Experiments on BFCL-V3 and AppWorld demonstrate consistent and substantial improvements over strong baselines across model scales. Our Code is available at https://github.com/hypasd-art/KATE.",
          "authors": [
            "Yupu Hao",
            "Zhuoran Jin",
            "Huanxuan Liao",
            "Kang Liu",
            "Jun Zhao"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.10875v1",
          "abstract_url": "https://arxiv.org/abs/2606.10875v1",
          "pdf_url": "https://arxiv.org/pdf/2606.10875v1",
          "published_at": "2026-06-09T13:51:32+00:00",
          "updated_at": "2026-06-09T13:51:32+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2606.10875",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.10875v1"
          },
          "relevance_score": 138,
          "match_reasons": [
            "title matched \"LLM\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"reasoning\"",
            "summary matched \"agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.10875"
        }
      ]
    },
    {
      "name": "Agent Runtime Security",
      "key_points": [
        "《Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation》〔评测 / 应用 / 方法〕：Large language model (LLM) agents are rapidly moving from conversational interfaces to software components that plan, invoke tools, maintain memory, and act on…",
        "《Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields》〔评测 / 应用 / 方法〕：Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evalua…",
        "《Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories》〔评测 / 方法〕：Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature tak…",
        "《It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO》〔评测 / 方法〕：Warning: This paper contains several toxic and offensive statements. Modern large language models (LLMs) are typically aligned through large-scale post-trainin…",
        "《Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization》〔方法〕：Production LLMs receive instructions from sources with very different levels of trust, yet attend to every token with uniform architectural privilege. This is…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation",
          "summary": "Large language model (LLM) agents are rapidly moving from conversational interfaces to software components that plan, invoke tools, maintain memory, and act on external environments. This transition changes the nature of security risk. In agentic settings, failures are no longer limited to unsafe text generation. Untrusted content may redirect control flow, misuse tool privileges, corrupt persistent state, leak sensitive information, or trigger harmful external actions. At the same time, research on LLM agent security is expanding quickly but remains fragmented across attack families, defense layers, application domains, and evaluation settings. This paper synthesizes 247 papers through a lifecycle-based, systems-oriented framework that models agent security around the interaction of information flow, delegated authority, and persistent state. We organize the literature around four questions: how LLM agent security should be modeled, which threat surfaces and attack families dominate, what defenses have been proposed and with what tradeoffs, and how security claims are evaluated. We find that prompt injection and tool-mediated control-flow hijacking still dominate the field, while persistent state corruption and multi-agent propagation are becoming central emerging concerns. We further find that current defenses provide useful building blocks but remain weakly compositional, and that existing benchmarks still underrepresent long-horizon, stateful, and deployment-sensitive risks. We argue that secure LLM agents require explicit trust boundaries, principled privilege control, provenance-aware state management, and evaluation practices aligned with realistic operational settings.",
          "authors": [
            "Yuchen Ling",
            "Shengcheng Yu",
            "Zhenyu Chen",
            "Chunrong Fang"
          ],
          "categories": [
            "cs.CR",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.10749v1",
          "abstract_url": "https://arxiv.org/abs/2606.10749v1",
          "pdf_url": "https://arxiv.org/pdf/2606.10749v1",
          "published_at": "2026-06-09T12:01:07+00:00",
          "updated_at": "2026-06-09T12:01:07+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.10749",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.10749v1"
          },
          "relevance_score": 78,
          "match_reasons": [
            "summary matched \"agent security\"",
            "summary matched \"LLM agent security\"",
            "summary matched \"prompt injection\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.10749"
        },
        {
          "title": "Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields",
          "summary": "Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.",
          "authors": [
            "Liya Zhu",
            "Jingzhe Ding",
            "Jian Zhang",
            "Jianbo Xue",
            "Shihao Liang",
            "Ge Zhang",
            "Xiang Gao",
            "Qingshui Gu",
            "Mailun Gao",
            "Huimin Che",
            "Yan Zhao",
            "Peiheng Zhou",
            "Haojun Wang",
            "Chaobo Xian",
            "Lili Le",
            "Chi Wu",
            "Yiwei Liu",
            "Shengda Long",
            "Jiale Yang",
            "Fangzhi Xu",
            "Sijin Wu",
            "Haodong Duan",
            "Yi Zhu",
            "Chao He",
            "Zhaojian Li",
            "Minchao Wang",
            "Huan Zhou",
            "Jiani Hou",
            "Chuqian Yu",
            "Weiran Shi",
            "Hongwan Gao",
            "Jiamin Chen",
            "Guanhong Chen",
            "Tingqin Luo",
            "Kaiyuan Zhang",
            "Zhixin Yao",
            "Qing Hua",
            "Yuhao Jiang",
            "Jin Chen",
            "Pu Chen",
            "Zhenyu Hu",
            "Xingyu Li",
            "Zhengxuan Jiang",
            "Meng Cao",
            "Tianfeng Long",
            "Haozhe Wang",
            "Mingzhang Wang",
            "Yichen Zhang",
            "Yiming Dai",
            "Chenchen Zhang",
            "Jiaying Wang",
            "Zhiyong Wu",
            "Shen Yan",
            "Yujia Qin",
            "Wenhao Huang",
            "Zaiyuan Wang",
            "Xiaolong Chang"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.11042v1",
          "abstract_url": "https://arxiv.org/abs/2606.11042v1",
          "pdf_url": "https://arxiv.org/pdf/2606.11042v1",
          "published_at": "2026-06-09T16:10:16+00:00",
          "updated_at": "2026-06-09T16:10:16+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.11042",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.11042v1"
          },
          "relevance_score": 68,
          "match_reasons": [
            "title matched \"computer-use agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.11042"
        },
        {
          "title": "Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories",
          "summary": "Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an angle, and designing visuals. Recent agents handle individual steps well: data-science agents close the analysis loop, while design agents synthesize beautiful websites. But can an agent serve as a data journalist end to end? We introduce Data Journalist Agent (Data2Story), a multi-agent framework that orchestrates specialized roles into a single virtual newsroom. Data2Story contributes two innovations. (i) Claims are evidence-grounded: an Inspector links every number, angle, and asset back to data, code, or an external reference. (ii) Articles are multimodally generative: rather than defaulting to plain text and static charts, Data2Story reasons about what readers will want to see, then deploys multimodal tools, such as interactive maps for geography and audio for music. We evaluate Data2Story on 18 articles, each paired with the originally published expert piece, along four axes: (a) human-agent angle coverage; (b) rubric evaluation with 53 participants across five dimensions; (c) computer-use agents as judges, a cost-saving proxy for how readers navigate interactive articles; and (d) verifiability, where a coding verifier re-executes statements against the data and checks claims against references. Data2Story produces competitive, evidence-traceable multimedia stories, with particular strength in transparency and auditability. Human articles retain an edge in editorial angle, creative design, and presentation. We position Data2Story as a collaborator for journalists, enabling more evidence-based, transparent, and verifiable reporting. Code and demos are available at https://data2story.github.io.",
          "authors": [
            "Kevin Qinghong Lin",
            "Batu EI",
            "Yuhong Shi",
            "Pan Lu",
            "Philip Torr",
            "James Zou"
          ],
          "categories": [
            "cs.CV",
            "cs.CL",
            "cs.CY",
            "cs.HC"
          ],
          "paper_id": "http://arxiv.org/abs/2606.11176v1",
          "abstract_url": "https://arxiv.org/abs/2606.11176v1",
          "pdf_url": "https://arxiv.org/pdf/2606.11176v1",
          "published_at": "2026-06-09T17:51:55+00:00",
          "updated_at": "2026-06-09T17:51:55+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Agent",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2606.11176",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.11176v1"
          },
          "relevance_score": 48,
          "match_reasons": [
            "summary matched \"computer-use agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.11176"
        },
        {
          "title": "It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO",
          "summary": "Warning: This paper contains several toxic and offensive statements. Modern large language models (LLMs) are typically aligned through large-scale post-training to ensure fair and reliable behavior. In this work, we investigate how easily such guardrails can be broken by Group Relative Policy Optimization (GRPO). We show that one-shot GRPO training on a single biased example is sufficient to induce systematic bias, with stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. We further find that models differ in their susceptibility based on the initial likelihood of producing biased outputs. Our results reveal a critical vulnerability in post-training: alignment can be overridden by a single example.",
          "authors": [
            "Naihao Deng",
            "Yilun Zhu",
            "Naichen Shi",
            "Clayton Scott",
            "Rada Mihalcea"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.10931v1",
          "abstract_url": "https://arxiv.org/abs/2606.10931v1",
          "pdf_url": "https://arxiv.org/pdf/2606.10931v1",
          "published_at": "2026-06-09T14:44:01+00:00",
          "updated_at": "2026-06-09T14:44:01+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.10931",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.10931v1"
          },
          "relevance_score": 45,
          "match_reasons": [
            "summary matched \"guardrail\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.10931"
        },
        {
          "title": "Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization",
          "summary": "Production LLMs receive instructions from sources with very different levels of trust, yet attend to every token with uniform architectural privilege. This is the structural vulnerability that enables malicious prompt injections and, more broadly, leaves models without a principled way to resolve conflicts between legitimate but competing instructions. A common training-based response is to teach models an explicit instruction hierarchy; existing approaches, however, formalize hierarchies of only three or four levels, treat all violations as equally severe, and rarely evaluate the full set of pairwise level interactions. We formalize a k-level instruction hierarchy problem and instantiate it for k=5, yielding ten pairwise priority relations that a compliant model must enforce. We then introduce Gravity-Weighted DPO (GW-DPO), a preference-optimization objective whose per-sample offset scales with the structural distance between conflicting levels under a linear or bilateral schedule, the latter weighting severity by both the privilege gap and the privilege of the victim level. Combined with hierarchy-specific delimiter tokens (Chen et al., 2025) and Instructional Segment Embeddings (ISE; Wu et al., 2025), GW-DPO with the bilateral schedule Pareto-improves over standard DPO and the linear variant on Llama-3.1-8B-Instruct, raising macro pairwise priority adherence while keeping over-refusal at half the standard DPO rate. Ablations isolate ISE as a refusal-threshold calibrator and recast five- versus three-level training as a generality-specialization tradeoff.",
          "authors": [
            "Lena S. Bolliger",
            "Lena A. Jäger"
          ],
          "categories": [
            "cs.CR",
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2606.10860v1",
          "abstract_url": "https://arxiv.org/abs/2606.10860v1",
          "pdf_url": "https://arxiv.org/pdf/2606.10860v1",
          "published_at": "2026-06-09T13:39:17+00:00",
          "updated_at": "2026-06-09T13:39:17+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Prompt Injection"
          ],
          "doi": null,
          "arxiv_id": "2606.10860",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.10860v1"
          },
          "relevance_score": 44,
          "match_reasons": [
            "summary matched \"prompt injection\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.10860"
        },
        {
          "title": "When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models",
          "summary": "Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we propose a trace-level diagnostic - the CoT-Output 2x2 safety matrix. This framework labels every turn along two independent axes (internal reasoning and visible output), yielding four operationally defined failure cells: robust alignment, alignment faking, overt jailbreak, and a distinct failure mode we term context-injection failure (where the CoT maintains safe reasoning, but the visible output produces harm, highlighting a multi-turn manifestation of reasoning unfaithfulness). We evaluate three distilled reasoning targets against a fixed attacker across five oversight conditions, collecting 6750 turn-level observations on the Information-Hazard scenario. Our analysis reveals two reproducible vulnerabilities: an oversight paradox where explicit monitoring cues paradoxically increase alignment-faking rates rather than suppress them, and a context-injection failure where models lock onto unsafe external outputs despite safe internal states. We release the full dataset of multi-turn dialogues and CoT traces to support follow-up trace-diagnostic research.",
          "authors": [
            "Sai Kartheek Reddy Kasu",
            "Nils Lukas",
            "Samuele Poppi"
          ],
          "categories": [
            "cs.AI",
            "cs.CL",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2606.10740v1",
          "abstract_url": "https://arxiv.org/abs/2606.10740v1",
          "pdf_url": "https://arxiv.org/pdf/2606.10740v1",
          "published_at": "2026-06-09T11:50:28+00:00",
          "updated_at": "2026-06-09T11:50:28+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Evaluation",
            "Reasoning"
          ],
          "doi": null,
          "arxiv_id": "2606.10740",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.10740v1"
          },
          "relevance_score": 42,
          "match_reasons": [
            "summary matched \"jailbreak\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.10740"
        },
        {
          "title": "Two-Way Confidential VMs (2cVM): Collaborative Confidential Computing for Mutually Distrustful Parties",
          "summary": "Collaborative computation across organizations is often constrained by the need to process sensitive data and proprietary code without exposing them to untrusted infrastructure or participants. Cryptographic approaches such as fully homomorphic encryption and secure multi-party computation provide strong confidentiality but remain impractical for general workloads due to their extreme computational cost. We present the Two-Way Confidential Virtual Machine (2cVM), a two-layer architecture that pairs a hardware trusted execution environment with an intra-workload isolation layer. Unlike regular Confidential Virtual Machines, 2cVM enforces mutual isolation between co-resident workloads, ensuring that participants retain control over their data and code. All computation in 2cVM is governed by a Commitment Manifest that enumerates participants, component composition, permitted data channels, and authorized outputs; the manifest is locked to the VM and incorporated into attestation evidence, making the policy immutable and independently verifiable throughout the VM's lifetime. A proof-of-concept realization combines AMD SEV-SNP for hardware protection with the WebAssembly Component Model for fine-grained sandboxing of participant code. Evaluation on commodity hardware across four benchmark classes shows that the two isolation layers do not accumulate linearly: once a workload executes inside the WebAssembly sandbox, the marginal cost of enabling hardware memory protection is small. Overhead is workload-dependent, governed primarily by memory access pattern, ranging from negligible for sequential workloads to approximately 2x for irregular, pointer-chasing access patterns. These results indicate that 2cVM provides a practical and verifiable foundation for privacy-preserving collaborative computation.",
          "authors": [
            "Jordi Thijsman",
            "Merlijn Sebrechts",
            "Stefan Lefever",
            "Filip De Turck",
            "Bruno Volckaert"
          ],
          "categories": [
            "cs.CR"
          ],
          "paper_id": "http://arxiv.org/abs/2606.10615v1",
          "abstract_url": "https://arxiv.org/abs/2606.10615v1",
          "pdf_url": "https://arxiv.org/pdf/2606.10615v1",
          "published_at": "2026-06-09T09:15:15+00:00",
          "updated_at": "2026-06-09T09:15:15+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2606.10615",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.10615v1"
          },
          "relevance_score": 39,
          "match_reasons": [
            "summary matched \"sandboxing\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.10615"
        }
      ]
    },
    {
      "name": "Terminal and SWE Agents",
      "key_points": [
        "《Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages》〔评测 / 方法〕：LLM-based coding agents are usually evaluated in familiar software settings: mainstream languages, common libraries, and public repositories. These benchmarks…",
        "《AutoPDE: Reliable Agentic PDE Solving via Explicitly Represented Solver Strategies》〔方法〕：Numerical solvers for partial differential equations (PDEs) are core computational tools in science and engineering. Building reliable PDE solvers requires not…",
        "《DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch》〔评测 / 数据 / 应用 / 方法〕：As the capabilities of LLM-based code agents continue to advance, their expected role is expanding beyond localized bug fixing in existing codebases toward arc…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages",
          "summary": "LLM-based coding agents are usually evaluated in familiar software settings: mainstream languages, common libraries, and public repositories. These benchmarks remain important, but they can hide how agents behave when the language itself is unfamiliar. We evaluate six contemporary coding agents on four esoteric programming languages using a sequential setup with file editing, local execution, and hidden-test grading. Our protocol exposes capability differences between these agents that mainstream coding and agentic benchmarks such as SWE-Bench Verified and Terminal-Bench 2.0 compress into much narrower bands. We observe that the strongest agents, Claude Opus 4.6 and GPT-5.4 xhigh, often avoid writing the target language directly. On Brainfuck and Befunge-98, they write Python programs that generate target-language code and debug those generators locally. Forbidding this metaprogramming strategy causes large performance drops. Text guidance distilled from this strategy does not materially improve weaker agents. In contrast, Opus-derived Python helper code for building generators, with no solved benchmark programs or hidden-test answers, sharply improves Sonnet 4.6 and GPT-5.4 mini on the same problems, while Haiku 4.5 remains low. More interpreter calls and output tokens improve stronger agents but leave weaker agents near their original performance, indicating that these resources amplify useful strategies rather than create them. Together, these results show that strong coding agents adapt to unfamiliar languages by using tools, feedback, and workspace state to build a working model of the target language. Metaprogramming is the clearest case, but the broader gap is constructing and debugging a strategy that works under the target language's rules.",
          "authors": [
            "Aman Sharma",
            "Sushrut Thorat",
            "Paras Chopra"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.10933v1",
          "abstract_url": "https://arxiv.org/abs/2606.10933v1",
          "pdf_url": "https://arxiv.org/pdf/2606.10933v1",
          "published_at": "2026-06-09T14:44:43+00:00",
          "updated_at": "2026-06-09T14:44:43+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.10933",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.10933v1"
          },
          "relevance_score": 103,
          "match_reasons": [
            "title matched \"coding agent\"",
            "summary matched \"Terminal-Bench\"",
            "summary matched \"SWE-bench\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.10933"
        },
        {
          "title": "AutoPDE: Reliable Agentic PDE Solving via Explicitly Represented Solver Strategies",
          "summary": "Numerical solvers for partial differential equations (PDEs) are core computational tools in science and engineering. Building reliable PDE solvers requires not only executable code, but a numerical solver strategy, a set of decisions about discretization, stabilization, solver configuration, and resolution control, that matches the PDE structure. Recent LLM-based coding agents have begun to reduce the programming burden by generating and debugging solver implementations. However, they typically move directly from a PDE problem to solver code, leaving the solver strategy implicit in implementation details. Feedback from a failed solve is therefore routed back to code edits rather than to the underlying strategy, so numerical decisions remain hard to check before code is generated and hard to revise using numerical evidence when it fails. To address this limitation, we propose AutoPDE, a code agent that maintains the solver strategy as an explicitly represented object throughout the solving process: an independent, inspectable object that is built before any code is written and can be revised, using numerical evidence, whenever a solve fails. AutoPDE builds and maintains this object in three stages, all drawing from a library of reusable PDE-solving skills: PDE analysis identifies the equation type and algebraic structure; numerical method selection chooses a numerical method that matches the analysis result and commits to a discretization, stabilization, and linear solver accordingly; and adaptive tuning runs low-cost pilot solves to calibrate resolution and tolerances under the prescribed accuracy and runtime budget. We evaluate AutoPDE on the PDE Agent Bench, where experimental results show that AutoPDE achieves a pass rate of $54.5%$, improving over the strongest baseline by $14.2$ percentage points.",
          "authors": [
            "Huanshuo Dong",
            "Keyao Zhang",
            "Hong Wang",
            "Zhezheng Hao",
            "Zhiwei Zhuang",
            "Ziyan Liu",
            "Jiacong Wang",
            "Gengyuan Liu",
            "Xin Jin"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2606.10752v1",
          "abstract_url": "https://arxiv.org/abs/2606.10752v1",
          "pdf_url": "https://arxiv.org/pdf/2606.10752v1",
          "published_at": "2026-06-09T12:02:58+00:00",
          "updated_at": "2026-06-09T12:02:58+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2606.10752",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.10752v1"
          },
          "relevance_score": 60,
          "match_reasons": [
            "summary matched \"coding agent\"",
            "summary matched \"code agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.10752"
        },
        {
          "title": "DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch",
          "summary": "As the capabilities of LLM-based code agents continue to advance, their expected role is expanding beyond localized bug fixing in existing codebases toward architecting and implementing complete software repositories from high-level specifications. However, training agents for such long-horizon software engineering tasks remains difficult due to the scarcity of large-scale, verifiable whole-repository generation data. In this paper, we introduce \\textbf{DeNovoSWE}, a large-scale dataset for whole-repository generation. DeNovoSWE comprises 4,818 high-quality instances, where each instance requires generating a complete repository from documentation. Our dataset is automatically constructed through a carefully designed sandboxed agentic workflow, enabling scalable curation without human annotation. DeNovoSWE is constructed with \"divide and conquer\" and critic-repair philosophy. To balance data quality and diversity, we further introduce a difficulty-aware trajectory filtering strategy. Fine-tuning Qwen3-30B-A3B on DeNovoSWE substantially improves long-horizon SWE performance, raising its score on the challenging BeyondSWE-Doc2Repo benchmark from 5.8% to 47.2%.",
          "authors": [
            "Jiale Zhao",
            "Guoxin Chen",
            "Fanzhe Meng",
            "Wayne Xin Zhao",
            "Ruihua Song",
            "Ji-Rong Wen",
            "Kai Jia"
          ],
          "categories": [
            "cs.SE"
          ],
          "paper_id": "http://arxiv.org/abs/2606.10728v1",
          "abstract_url": "https://arxiv.org/abs/2606.10728v1",
          "pdf_url": "https://arxiv.org/pdf/2606.10728v1",
          "published_at": "2026-06-09T11:37:15+00:00",
          "updated_at": "2026-06-09T11:37:15+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2606.10728",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2606.10728v1"
          },
          "relevance_score": 60,
          "match_reasons": [
            "summary matched \"code agent\"",
            "summary matched \"bug fixing\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2606.10728"
        }
      ]
    }
  ]
}