{
  "generated_at": "2026-05-18T13:13:17.749999+08:00",
  "timezone": "Asia/Shanghai",
  "lookback_hours": 24,
  "sorting": {
    "default_sort_by": "hybrid",
    "summary": "hybrid (relevance first, published_at tie-break)",
    "weights": {
      "title_match_weight": 40,
      "summary_match_weight": 18,
      "doi_weight": 12,
      "pdf_weight": 8,
      "rich_summary_weight": 6,
      "metadata_weight": 4,
      "multi_source_weight": 10,
      "freshness_weight_cap": 24
    },
    "feeds": [
      {
        "name": "LM",
        "sort_by": "hybrid"
      },
      {
        "name": "Agent Runtime Security",
        "sort_by": "hybrid"
      },
      {
        "name": "Terminal and SWE Agents",
        "sort_by": "hybrid"
      }
    ]
  },
  "highlights": [
    "主题「LLM」：命中 15 篇，覆盖 LM，代表论文包括 《CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency》、《FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models》。",
    "主题「Language Model」：命中 14 篇，覆盖 LM，代表论文包括 《CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency》、《FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models》。",
    "主题「Evaluation」：命中 1 篇，覆盖 LM，代表论文包括 《Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP》。"
  ],
  "focus_items": [],
  "action_items": [],
  "topic_sections": [
    {
      "name": "LLM",
      "paper_count": 15,
      "feed_names": [
        "LM"
      ],
      "paper_titles": [
        "CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency",
        "FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models",
        "MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models",
        "Large Language Models Could Be Rote Learners",
        "Look Before You Leap: Autonomous Exploration for LLM Agents",
        "Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP",
        "DebiasRAG: A Tuning-Free Path to Fair Generation in Large Language Models through Retrieval-Augmented Generation",
        "DiscussLLM: Teaching Large Language Models When to Speak",
        "ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models",
        "PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization",
        "SGR: A Stepwise Reasoning Framework for LLMs with External Subgraph Generation",
        "Syntax Without Semantics: Teaching Large Language Models to Code in an Unseen Language",
        "Calibrating LLMs with Semantic-level Reward",
        "Capability Conditioned Scaffolding for Professional Human LLM Collaboration",
        "DimMem: Dimensional Structuring for Efficient Long-Term Agent Memory"
      ],
      "key_points": [
        "《CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency》〔评测 / 应用 / 方法〕：This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate the real-world capabilities of Large Language Mo…",
        "《FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models》〔评测 / 数据 / 方法〕：Large language models (LLMs) are increasingly being applied to financial analysis, reporting, investment decision support, risk management, compliance, and pro…"
      ]
    },
    {
      "name": "Language Model",
      "paper_count": 14,
      "feed_names": [
        "LM"
      ],
      "paper_titles": [
        "CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency",
        "FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models",
        "MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models",
        "Large Language Models Could Be Rote Learners",
        "Look Before You Leap: Autonomous Exploration for LLM Agents",
        "DebiasRAG: A Tuning-Free Path to Fair Generation in Large Language Models through Retrieval-Augmented Generation",
        "DiscussLLM: Teaching Large Language Models When to Speak",
        "ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models",
        "PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization",
        "SGR: A Stepwise Reasoning Framework for LLMs with External Subgraph Generation",
        "Syntax Without Semantics: Teaching Large Language Models to Code in an Unseen Language",
        "Calibrating LLMs with Semantic-level Reward",
        "Capability Conditioned Scaffolding for Professional Human LLM Collaboration",
        "DimMem: Dimensional Structuring for Efficient Long-Term Agent Memory"
      ],
      "key_points": [
        "《CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency》〔评测 / 应用 / 方法〕：This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate the real-world capabilities of Large Language Mo…",
        "《FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models》〔评测 / 数据 / 方法〕：Large language models (LLMs) are increasingly being applied to financial analysis, reporting, investment decision support, risk management, compliance, and pro…"
      ]
    },
    {
      "name": "Evaluation",
      "paper_count": 1,
      "feed_names": [
        "LM"
      ],
      "paper_titles": [
        "Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP"
      ],
      "key_points": [
        "《Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP》〔评测 / 方法〕：Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent se…"
      ]
    }
  ],
  "template": "zh_daily_brief",
  "feeds": [
    {
      "name": "LM",
      "key_points": [
        "《CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency》〔评测 / 应用 / 方法〕：This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate the real-world capabilities of Large Language Mo…",
        "《FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models》〔评测 / 数据 / 方法〕：Large language models (LLMs) are increasingly being applied to financial analysis, reporting, investment decision support, risk management, compliance, and pro…",
        "《MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models》〔评测 / 应用 / 方法〕：Large language models (LLMs) are increasingly used in the mental health domain, yet it remains unclear how well they capture related biomedical knowledge and h…",
        "《Large Language Models Could Be Rote Learners》〔评测 / 数据 / 方法〕：Benchmark-based evaluation, e.g., multiple-choice questions (MCQs) and open-ended questions (OEQs), is widely used for evaluating Large Language Models (LLMs),…",
        "《Look Before You Leap: Autonomous Exploration for LLM Agents》〔评测 / 方法〕：Large language model based agents often fail in unfamiliar environments due to premature exploitation: a tendency to act on prior knowledge before acquiring su…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency",
          "summary": "This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate the real-world capabilities of Large Language Model (LLM) agents in the uniquely demanding and fast-paced cryptocurrency domain. Unlike general-purpose agent benchmarks for search and prediction, professional crypto analysis presents specific challenges: \\emph{extreme time-sensitivity}, \\emph{a highly adversarial information environment}, and the critical need to synthesize data from \\emph{diverse, specialized sources}, such as on-chain intelligence platforms and real-time Decentralized Finance (DeFi) dashboards. CryptoBench thus serves as a much more challenging and valuable scenario for LLM agent assessment. To address these challenges, we constructed a live, dynamic benchmark featuring 50 questions per month, expertly designed by crypto-native professionals to mirror actual analyst workflows. These tasks are rigorously categorized within a four-quadrant system: Simple Retrieval, Complex Retrieval, Simple Prediction, and Complex Prediction. This granular categorization enables a precise assessment of an LLM agent's foundational data-gathering capabilities alongside its advanced analytical and forecasting skills. Our evaluation of ten LLMs, both directly and within an agentic framework, reveals a performance hierarchy and uncovers a failure mode. We observe a \\textit{retrieval-prediction imbalance}, where many leading models, despite being proficient at data retrieval, demonstrate a pronounced weakness in tasks requiring predictive analysis. This highlights a problematic tendency for agents to appear factually grounded while lacking the deeper analytical capabilities to synthesize information.",
          "authors": [
            "Jiacheng Guo",
            "Suozhi Huang",
            "Zixin Yao",
            "Yifan Zhang",
            "Yifu Lu",
            "Jiashuo Liu",
            "Zihao Li",
            "Nicholas Deng",
            "Qixin Xiao",
            "Jia Tian",
            "Kanghong Zhan",
            "Tianyi Li",
            "Xiaochen Liu",
            "Jason Ge",
            "Chaoyang He",
            "Kaixuan Huang",
            "Lin Yang",
            "Wenhao Huang",
            "Mengdi Wang"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2512.00417",
          "abstract_url": "https://arxiv.org/abs/2512.00417",
          "pdf_url": "https://arxiv.org/pdf/2512.00417",
          "published_at": "2026-05-18T04:00:00+00:00",
          "updated_at": "2026-05-18T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2512.00417",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2512.00417"
          },
          "relevance_score": 236,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"agent\"",
            "title matched \"benchmark\"",
            "title matched \"evaluation\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2512.00417"
        },
        {
          "title": "FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models",
          "summary": "Large language models (LLMs) are increasingly being applied to financial analysis, reporting, investment decision support, risk management, compliance, and professional training. However, robust evaluation of their domain competence in finance remains incomplete. Widely used open benchmarks such as FinQA, ConvFinQA, and TAT-QA have played an important role in advancing financial question answering and numerical reasoning, but they focus primarily on question answering over financial reports and do not provide an explicit hierarchy of professional difficulty. Broader resources, including FinanceBench, PIXIU, FinBen, and FLaME, expand the coverage of financial tasks, yet the problem of evaluating the transition from foundational knowledge to expert-level financial reasoning remains open. In this work, we present FINESSE-Bench, a suite of eight specialized benchmarks comprising 3,993 questions for hierarchical evaluation of financial competencies in LLMs. FINESSE-Bench combines exam-oriented datasets inspired by professional certifications (CFA-like Levels 1-3, CMT-like Level 2, and CFTe-like Level 1), applied trading task collections, and a Russian-language olympiad benchmark. This design enables evaluation of domain breadth, performance degradation as difficulty increases, the ability to solve computational tasks, and model behavior in specialized financial domains. We also describe a unified evaluation protocol covering multiple-choice questions, numerical answers, and short open-ended responses, together with an automated scoring scheme for freeform answers based on the LLM-as-judge paradigm. FINESSE-Bench is intended both as a complement to existing open financial benchmarks and as a tool for more substantive evaluation of professionally relevant financial competencies in large language models.",
          "authors": [
            "Dmitry Stanishevskii",
            "Nini Kamkia",
            "Alexey Khoroshilov",
            "Dmitry Zmitrovich",
            "Denis Kokosinskii",
            "Zhirayr Hayrapetyan",
            "Andrei Kalmykov"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.15482",
          "abstract_url": "https://arxiv.org/abs/2605.15482",
          "pdf_url": "https://arxiv.org/pdf/2605.15482",
          "published_at": "2026-05-18T04:00:00+00:00",
          "updated_at": "2026-05-18T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.15482",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.15482"
          },
          "relevance_score": 232,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "title matched \"benchmark\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"RAG\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.15482"
        },
        {
          "title": "MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models",
          "summary": "Large language models (LLMs) are increasingly used in the mental health domain, yet it remains unclear how well they capture related biomedical knowledge and how reliably they apply it to clinically salient structured judgments. Here, we present a knowledge-graph (KG)-grounded benchmark for assessing LLMs on mental-health entity recognition, relation judgment, and two-hop reasoning. The benchmark is derived from PrimeKG and comprises nine task families with KG-supported answers and controlled negative options. Experiments across 15 closed- and open-source LLMs reveal a persistent recognition-to-judgment gap: leading models achieve near-ceiling performance on entity typing and on the small relation-typing subset, yet they still struggle with relation prediction and two-hop reasoning. Additionally, short KG-derived snippets benefit some models but degrade performance for others. Moreover, output-format reliability can substantially influence measured performance under constrained multiple-choice settings, highlighting the critical role of response validity in benchmark-based evaluation. MHGraphBench should therefore be interpreted as evaluating agreement with a curated mental-health slice of PrimeKG under a constrained multiple-choice interface, rather than as a direct assessment of real-world clinical safety.",
          "authors": [
            "Weixin Liu",
            "Congning Ni",
            "Shelagh A. Mulvaney",
            "Susannah L. Rose",
            "Murat Kantarcioglu",
            "Bradley A. Malin",
            "Zhijun Yin"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.15589",
          "abstract_url": "https://arxiv.org/abs/2605.15589",
          "pdf_url": "https://arxiv.org/pdf/2605.15589",
          "published_at": "2026-05-18T04:00:00+00:00",
          "updated_at": "2026-05-18T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.15589",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.15589"
          },
          "relevance_score": 214,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "title matched \"benchmark\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.15589"
        },
        {
          "title": "Large Language Models Could Be Rote Learners",
          "summary": "Benchmark-based evaluation, e.g., multiple-choice questions (MCQs) and open-ended questions (OEQs), is widely used for evaluating Large Language Models (LLMs), yet their reliability is undermined by benchmark contamination. When pre-exposed to the testing benchmark during training, less capable LLMs have been found to achieve inflated performance, thereby yielding erroneous results in LLM evaluation. In this study, we reframe contamination as an inherent aspect of learning and seek to disentangle and expose genuine capability acquisition from superficial memorization in LLM evaluation. Following this, firstly, by analyzing model performance under different memorization conditions of MCQs, we uncover a counterintuitive trend: LLMs perform worse on memorized benchmarks than on non-memorized ones, indicating the coexistence of two learning phenomena, i.e., rote memorization and genuine capability learning. To disentangle them, we propose TrinEval, a novel evaluation framework that reformulates MCQs into an alternative knowledge-centric trinity format, reducing memorization while preserving inherent knowledge, enabling the evaluation of genuine capability in the presence of memorization. Extensive experiments validate the effectiveness and robustness of TrinEval in reformulating benchmarks, and the evaluation results further reveal that mainstream LLMs rely on rote memorization for an average of 19.6% of knowledge points across the MMLU and the GSM8K dataset.",
          "authors": [
            "Yuyang Xu",
            "Renjun Hu",
            "Haochao Ying",
            "Jian Wu",
            "Xing Shi",
            "Wei Lin"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "https://arxiv.org/abs/2504.08300",
          "abstract_url": "https://arxiv.org/abs/2504.08300",
          "pdf_url": "https://arxiv.org/pdf/2504.08300",
          "published_at": "2026-05-18T04:00:00+00:00",
          "updated_at": "2026-05-18T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2504.08300",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2504.08300"
          },
          "relevance_score": 192,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2504.08300"
        },
        {
          "title": "Look Before You Leap: Autonomous Exploration for LLM Agents",
          "summary": "Large language model based agents often fail in unfamiliar environments due to premature exploitation: a tendency to act on prior knowledge before acquiring sufficient environment-specific information. We identify autonomous exploration as a critical yet underexplored capability for building adaptive agents. To formalize and quantify this capability, we introduce Exploration Checkpoint Coverage, a verifiable metric that measures how broadly an agent discovers key states, objects, and affordances. Our systematic evaluation reveals that agents trained with standard task-oriented reinforcement learning consistently exhibit narrow and repetitive behaviors that impede downstream performance. To address this limitation, we develop a training strategy that interleaves task-execution rollouts and exploration rollouts, with each type of rollout optimized by its corresponding verifiable reward. Building on this training strategy, we propose the Explore-then-Act paradigm, which decouples information-gathering from task execution: agents first utilize an interaction budget to acquire grounded environmental knowledge, then leverage it for task resolution. Our results demonstrate that learning to systematically explore is imperative for building generalizable and real-world-ready agents.",
          "authors": [
            "Ziang Ye",
            "Wentao Shi",
            "Yuxin Liu",
            "Yu Wang",
            "Zhengzhou Cai",
            "Yaorui Shi",
            "Qi Gu",
            "Xunliang Cai",
            "Fuli Feng"
          ],
          "categories": [
            "cs.AI",
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.16143",
          "abstract_url": "https://arxiv.org/abs/2605.16143",
          "pdf_url": "https://arxiv.org/pdf/2605.16143",
          "published_at": "2026-05-18T04:00:00+00:00",
          "updated_at": "2026-05-18T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.16143",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.16143"
          },
          "relevance_score": 192,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"RAG\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.16143"
        },
        {
          "title": "Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP",
          "summary": "Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs. We present a controlled study of compound LLM agent design in CybORG CAGE-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non-positive, so all configurations operate in a failure-mitigation mode. Our evaluation spans five model families, six models, and twelve configurations (3,475 episodes) with token-level cost accounting. We vary context representation (raw observations vs. a deterministic state-tracking layer with compressed history), deliberation (self-questioning, self-critique, and self-improvement tools, with optional chain-of-thought prompting), and hierarchical decomposition (monolithic ReAct vs. delegation to specialized sub-agents). We find that: (1) Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4$\\times$ worse mean return while using 1.8-2.7$\\times$ more tokens. We call this destructive pattern a deliberation cascade. (3) Hierarchical decomposition without deliberation achieves the best absolute performance for most models, and context engineering is generally more cost-effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning, as these strategies can interfere when combined.",
          "authors": [
            "Igor Bogdanov",
            "Chung-Horng Lung",
            "Thomas Kunz",
            "Jie Gao",
            "Adrian Taylor",
            "Marzia Zaman"
          ],
          "categories": [
            "cs.AI",
            "cs.CL",
            "cs.LG",
            "cs.MA",
            "cs.SY",
            "eess.SY"
          ],
          "paper_id": "https://arxiv.org/abs/2605.16205",
          "abstract_url": "https://arxiv.org/abs/2605.16205",
          "pdf_url": "https://arxiv.org/pdf/2605.16205",
          "published_at": "2026-05-18T04:00:00+00:00",
          "updated_at": "2026-05-18T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2605.16205",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.16205"
          },
          "relevance_score": 178,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"reasoning\"",
            "title matched \"agent\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.16205"
        },
        {
          "title": "DebiasRAG: A Tuning-Free Path to Fair Generation in Large Language Models through Retrieval-Augmented Generation",
          "summary": "Large language models (LLMs) have achieved unprecedented success due to their exceptional generative capabilities. However, because they depend on knowledge encapsulated from training corpora, they may produce hallucinations, stereotypes, and socially biased content. In particular, LLMs are prone to prejudiced responses involving race, gender, and age, which are collectively referred to as social biases. Prior studies have used fine-tuning and prompt engineering to mitigate such biases in LLMs, but these methods require additional training resources or domain knowledge to design the framework. Moreover, they may degrade the original capabilities of LLMs and often overlook the need for dynamic debiasing contexts for fairer inference. In this paper, we propose DebiasRAG, a novel tuning-free and dynamic query-specific debiasing framework based on retrieval-augmented generation (RAG). DebiasRAG improves fairness while preserving the intrinsic properties of LLMs, such as representation ability. DebiasRAG consists of three stages: (1) query-specific debiasing candidate generation; (2) context candidate pool construction; and (3) gradient-updated debiasing-guided context piece reranking. First, DebiasRAG leverages self-diagnosed bias contexts relevant to the query through regular retrieval, where the bias contexts are prepared offline by the DebiasRAG provider. Given the query-specific bias contexts, DebiasRAG reversely produces debiasing contexts, which are provided as additional fairness constraints for LLM outputs. Second, a regular RAG retrieval process produces query-related contexts from the regular RAG document database, such as a chunked Wikipedia dataset.",
          "authors": [
            "Rui Chu",
            "Bingyin Zhao",
            "Thanh Quoc Hung Le",
            "Duy Cao Hoang",
            "Huawei Lin",
            "Ping Li",
            "Weijie Zhao",
            "Khoa D Doan",
            "Yingjie Lao"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "https://arxiv.org/abs/2605.16113",
          "abstract_url": "https://arxiv.org/abs/2605.16113",
          "pdf_url": "https://arxiv.org/pdf/2605.16113",
          "published_at": "2026-05-18T04:00:00+00:00",
          "updated_at": "2026-05-18T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.16113",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.16113"
          },
          "relevance_score": 178,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "title matched \"RAG\"",
            "summary matched \"LLM\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.16113"
        },
        {
          "title": "DiscussLLM: Teaching Large Language Models When to Speak",
          "summary": "Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like text, yet they largely operate as reactive agents, responding only when directly prompted. This passivity creates an \"awareness gap,\" limiting their potential as truly collaborative partners in dynamic human discussions. We introduce $\\textit{DiscussLLM}$, a framework designed to bridge this gap by training models to proactively decide not just $\\textit{what}$ to say, but critically, $\\textit{when}$ to speak. Our primary contribution is a scalable two-stage data generation pipeline that synthesizes a large-scale dataset of realistic multi-turn human discussions. Each discussion is annotated with one of five intervention types (e.g., Factual Correction, Concept Definition) and contains an explicit conversational trigger where an AI intervention adds value. By training models to predict a special silent token when no intervention is needed, they learn to remain quiet until a helpful contribution can be made. We explore two architectural baselines: an integrated end-to-end model and a decoupled classifier-generator system optimized for low-latency inference. We evaluate these models on their ability to accurately time interventions and generate helpful responses, paving the way for more situationally aware and proactive conversational AI.",
          "authors": [
            "Deep Anil Patel",
            "Iain Melvin",
            "Christopher Malon",
            "Martin Renqiang Min"
          ],
          "categories": [
            "cs.CL",
            "cs.HC"
          ],
          "paper_id": "https://arxiv.org/abs/2508.18167",
          "abstract_url": "https://arxiv.org/abs/2508.18167",
          "pdf_url": "https://arxiv.org/pdf/2508.18167",
          "published_at": "2026-05-18T04:00:00+00:00",
          "updated_at": "2026-05-18T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2508.18167",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2508.18167"
          },
          "relevance_score": 178,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "title matched \"LLM\"",
            "summary matched \"agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2508.18167"
        },
        {
          "title": "ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models",
          "summary": "Multimodal large language models (MLLMs) may memorize sensitive cross-modal information during pretraining, making machine unlearning (MU) crucial. Existing methods typically evaluate unlearning effectiveness based on output deviations, while overlooking the generation quality after unlearning. This can easily lead to hallucinated or rigid responses, thereby affecting the usability and safety of the unlearned model. To address this issue, we propose ASRU, a controllable multimodal unlearning framework that incorporates generation quality as a core evaluation objective. ASRU first induces initial refusal behavior through activation redirection, and then optimizes fine-grained refusal boundaries using a customized reward function, thereby achieving a better trade-off between target knowledge unlearning and model utility. Experiments on Qwen3-VL show that ASRU significantly improves unlearning effectiveness (+24.6%) on average and generation quality (5.8x) on average while effectively preserving model utility, using only a small amount of retained supervision data.",
          "authors": [
            "Jiahui Guang",
            "Yingjie Zhu",
            "Cuiyun Gao",
            "Haiyan Wang",
            "Jing Li",
            "Di Shao",
            "Zhaoquan Gu"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "https://arxiv.org/abs/2605.15687",
          "abstract_url": "https://arxiv.org/abs/2605.15687",
          "pdf_url": "https://arxiv.org/pdf/2605.15687",
          "published_at": "2026-05-18T04:00:00+00:00",
          "updated_at": "2026-05-18T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.15687",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.15687"
          },
          "relevance_score": 174,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"RAG\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.15687"
        },
        {
          "title": "PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization",
          "summary": "Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness or algorithmic problem solving, while realistic systems-level optimization is still underexplored. To address this gap, we introduce PerfCodeBench, an executable benchmark for evaluating LLMs on high-performance code optimization. The tasks require system-level implementation choices, hardware-aware optimization, and careful handling of performance bottlenecks. Each task includes executable correctness checks, a baseline implementation, and a reference optimized solution. This allows us to evaluate both correctness and runtime-oriented efficiency. Our evaluation on a broad set of state-of-the-art LLMs shows a clear gap between model-generated code and expert-optimized implementations. The gap is especially large on tasks involving parallelism and GPU operations. Current models also show weaknesses in cross-language robustness and in consistently reaching expert-level efficiency. These results suggest that performance-aware evaluation are still needed. LLMs should move beyond generating merely correct code toward producing efficient systems software. We submit the benchmark data, evaluation infrastructure, and complete logs of all LLMs-generated code at https://anonymous.4open.science/r/perfcodebench-7CDE.",
          "authors": [
            "Huihao Jing",
            "Wenbin Hu",
            "Haochen Shi",
            "Hanyu Yang",
            "Sirui Zhang",
            "Shaojin Chen",
            "Haoran Li",
            "Yangqiu Song"
          ],
          "categories": [
            "cs.SE",
            "cs.CL",
            "cs.PL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.15222",
          "abstract_url": "https://arxiv.org/abs/2605.15222",
          "pdf_url": "https://arxiv.org/pdf/2605.15222",
          "published_at": "2026-05-18T04:00:00+00:00",
          "updated_at": "2026-05-18T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.15222",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.15222"
          },
          "relevance_score": 174,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"benchmark\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.15222"
        },
        {
          "title": "SGR: A Stepwise Reasoning Framework for LLMs with External Subgraph Generation",
          "summary": "Large Language Models (LLMs) have demonstrated strong capabilities across diverse NLP applications, such as translation, text generation, and question answering. Nevertheless, they remain limited in complex settings that demand deep reasoning and logical inference. Since these models are trained on large-scale text corpora, their generation process may still introduce irrelevant, noisy, or factually inconsistent content. To mitigate this problem, we introduce SGR, a stepwise framework that enhances LLM reasoning through external subgraph generation. SGR builds query-specific subgraphs from external knowledge bases and uses their semantic structure to support multi-step inference. By grounding intermediate reasoning steps in structured external knowledge, the framework helps the model concentrate on relevant entities, relations, and supporting evidence. In particular, SGR first constructs a subgraph tailored to the input question. It then guides the model to reason progressively over the generated structure and combines multiple reasoning trajectories to obtain the final prediction. Experimental results across several benchmark datasets show that SGR achieves consistent improvements over competitive baselines, highlighting its value for improving both reasoning accuracy and factual reliability.",
          "authors": [
            "Xin Zhang",
            "Yang Cao",
            "Baoxing Wu",
            "Kai Song",
            "Siying Li"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.16117",
          "abstract_url": "https://arxiv.org/abs/2605.16117",
          "pdf_url": "https://arxiv.org/pdf/2605.16117",
          "published_at": "2026-05-18T04:00:00+00:00",
          "updated_at": "2026-05-18T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.16117",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.16117"
          },
          "relevance_score": 174,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"reasoning\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.16117"
        },
        {
          "title": "Syntax Without Semantics: Teaching Large Language Models to Code in an Unseen Language",
          "summary": "Large language models (LLMs) achieve high pass rates on code generation benchmarks, yet whether they can transfer this ability to languages absent from pretraining remains poorly understood. We introduce PyLang, a minimal imperative language absent from all pretraining corpora, and evaluate frontier models zero-shot and fine-tuned Qwen3 (4B, 8B, 32B) on 352 problems. We find that fine-tuning quickly teaches syntax but fails to transfer semantic competence: Python outperforms PyLang by up to 19% across all configurations, and no intervention (multi-task learning, preference tuning, code infilling, or latent-space objectives) closes the gap. An LLM judge reveals that frontier models select an identical algorithm to Python 80% of the time, yet cannot translate it into a working PyLang implementation., and CKA analysis confirms that fine-tuned models converge to nearly identical internal representations across languages (CKA > 0.97) while diverging at the output stage. We term this the implementation fidelity gap: models possess language-agnostic algorithmic understanding but cannot express it in an unfamiliar language. Our findings highlight the need for training methods that decouple reasoning from language-specific realization.",
          "authors": [
            "Vinayshekhar Bannihatti Kumar",
            "Disha Makhija",
            "Manoj Ghuhan Arivazhagan",
            "Rashmi Gangadharaiah"
          ],
          "categories": [
            "cs.CL",
            "cs.LG"
          ],
          "paper_id": "https://arxiv.org/abs/2605.15607",
          "abstract_url": "https://arxiv.org/abs/2605.15607",
          "pdf_url": "https://arxiv.org/pdf/2605.15607",
          "published_at": "2026-05-18T04:00:00+00:00",
          "updated_at": "2026-05-18T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.15607",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.15607"
          },
          "relevance_score": 174,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.15607"
        },
        {
          "title": "Calibrating LLMs with Semantic-level Reward",
          "summary": "As large language models (LLMs) are deployed in consequential settings such as medical question answering and legal reasoning, the ability to estimate when their outputs are likely to be correct is essential for safe and reliable use, requiring well-calibrated uncertainty. Standard reinforcement learning with verifiable rewards (RLVR) trains models with a binary correctness reward that is indifferent to confidence, providing no penalty for confident but wrong predictions and thereby degrading calibration. Recent work addresses this by training models to produce verbalized confidence scores alongside answers and rewarding agreement with correctness. However, verbalized confidence is calibrated at the token level and thus exhibits inconsistency across textual variations with same semantic meaning. We propose \\textbf{Calibration with Semantic Reward (CSR)}, a framework that calibrates language models directly in semantic space without a verbalized confidence interface. CSR combines the correctness reward with a novel semantic calibration reward that encourages exploitation among correct rollouts by promoting semantic agreement, and exploration among incorrect ones by discouraging spurious consistency. Experiments across three model families on HotpotQA (in-distribution) and TriviaQA, MSMARCO, and NQ-Open (out-of-distribution) show that CSR consistently achieves lower ECE and higher AUROC than verbalized-confidence baselines across nearly all settings, reducing ECE by up to $40\\%$ and improving AUROC by up to $31\\%$ over verbalized-confidence baselines, with calibration behavior generalizing robustly across all four evaluation settings.",
          "authors": [
            "Fengfei Yu",
            "Ruijia Niu",
            "Dongxia Wu",
            "Yian Ma",
            "Rose Yu"
          ],
          "categories": [
            "cs.CL",
            "cs.LG"
          ],
          "paper_id": "https://arxiv.org/abs/2605.15588",
          "abstract_url": "https://arxiv.org/abs/2605.15588",
          "pdf_url": "https://arxiv.org/pdf/2605.15588",
          "published_at": "2026-05-18T04:00:00+00:00",
          "updated_at": "2026-05-18T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.15588",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.15588"
          },
          "relevance_score": 170,
          "match_reasons": [
            "title matched \"LLM\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"reasoning\"",
            "summary matched \"RAG\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.15588"
        },
        {
          "title": "Capability Conditioned Scaffolding for Professional Human LLM Collaboration",
          "summary": "Large language model personalization typically adapts outputs to user preferences and style but does not account for differences in user evaluation capacity across domains of expertise. This limitation can encourage Professional Domain Drift, where users rely on AI generated reasoning in domains they cannot reliably evaluate. We introduce Capability Conditioned Scaffolding, a typed framework that partitions expertise into strong, mixed, and weak domains and conditions intervention behavior on structured capability profiles. A pilot evaluation across multiple MMLU subsets and four LLM substrates shows consistent profile conditioned intervention behavior, including categorical inversion under profile swapping and selective activation in mixed domain risk zones. These findings suggest that capability aware scaffolding can support more reliable professional human AI collaboration beyond stylistic personalization.",
          "authors": [
            "Sen Yang",
            "Yinglei Ma"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.15404",
          "abstract_url": "https://arxiv.org/abs/2605.15404",
          "pdf_url": "https://arxiv.org/pdf/2605.15404",
          "published_at": "2026-05-18T04:00:00+00:00",
          "updated_at": "2026-05-18T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.15404",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.15404"
          },
          "relevance_score": 170,
          "match_reasons": [
            "title matched \"LLM\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"reasoning\"",
            "summary matched \"RAG\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.15404"
        },
        {
          "title": "DimMem: Dimensional Structuring for Efficient Long-Term Agent Memory",
          "summary": "Large language model (LLM) agents require long-term memory to leverage information from past interactions. However, existing memory systems often face a fidelity--efficiency trade-off: raw dialogue histories are expensive, while flat facts or summaries may discard the structure needed for precise recall. We propose \\textbf{DimMem}, a lightweight dimensional memory framework that represents each memory as an atomic, typed, and self-contained unit with explicit fields such as time, location, reason, purpose, and keywords. This representation exposes the structure needed for dimension-aware retrieval, memory update, and selective assistant-context recall without storing full histories in the model context. Across LoCoMo-10 and LongMemEval-S, DimMem achieves \\textbf{81.43\\%} and \\textbf{78.20\\%} overall accuracy, respectively, outperforming existing lightweight memory systems while reducing LoCoMo per-query token cost by \\textbf{24\\%}. We further show that dimensional memory extraction is learnable by compact models: after fine-tuning on the DimMem schema, a Qwen3-4B extractor surpasses LightMem with GPT-4.1-mini on both benchmarks and reaches performance comparable to, or better than, much larger extractors in key settings. These results suggest that explicit dimensional structuring is an effective and efficient foundation for long-term memory in LLM agents. Code is available at https://github.com/ChowRunFa/DimMem.",
          "authors": [
            "Wentao Qiu",
            "Haotian Hu",
            "Fanyi Wang",
            "Jinwei Kong",
            "Yu Zhang"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.15759",
          "abstract_url": "https://arxiv.org/abs/2605.15759",
          "pdf_url": "https://arxiv.org/pdf/2605.15759",
          "published_at": "2026-05-18T04:00:00+00:00",
          "updated_at": "2026-05-18T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.15759",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.15759"
          },
          "relevance_score": 170,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.15759"
        }
      ]
    },
    {
      "name": "Agent Runtime Security",
      "key_points": [],
      "sort_by": "hybrid",
      "papers": []
    },
    {
      "name": "Terminal and SWE Agents",
      "key_points": [],
      "sort_by": "hybrid",
      "papers": []
    }
  ]
}