{
  "generated_at": "2026-05-22T13:08:19.299880+08:00",
  "timezone": "Asia/Shanghai",
  "lookback_hours": 24,
  "sorting": {
    "default_sort_by": "hybrid",
    "summary": "hybrid (relevance first, published_at tie-break)",
    "weights": {
      "title_match_weight": 40,
      "summary_match_weight": 18,
      "doi_weight": 12,
      "pdf_weight": 8,
      "rich_summary_weight": 6,
      "metadata_weight": 4,
      "multi_source_weight": 10,
      "freshness_weight_cap": 24
    },
    "feeds": [
      {
        "name": "LM",
        "sort_by": "hybrid"
      },
      {
        "name": "Agent Runtime Security",
        "sort_by": "hybrid"
      },
      {
        "name": "Terminal and SWE Agents",
        "sort_by": "hybrid"
      }
    ]
  },
  "highlights": [
    "主题「LLM」：命中 17 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents》、《ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning》。",
    "主题「Benchmark」：命中 14 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents》、《Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety》。",
    "主题「Agent」：命中 8 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety》、《Planning in the LLM Era: Building for Reliability and Efficiency》。",
    "主题「Evaluation」：命中 2 篇，覆盖 Agent Runtime Security、Terminal and SWE Agents，代表论文包括 《Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents》、《Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study》。",
    "主题「Language Model」：命中 1 篇，覆盖 LM，代表论文包括 《From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment》。"
  ],
  "focus_items": [],
  "action_items": [],
  "topic_sections": [
    {
      "name": "LLM",
      "paper_count": 17,
      "feed_names": [
        "LM",
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents",
        "ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning",
        "LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance",
        "From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment",
        "Understanding Data Temporality Impact on Large Language Models Pre-training",
        "A Comparative Study of Language Models for Khmer Retrieval-Augmented Question Answering",
        "LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning",
        "Planning in the LLM Era: Building for Reliability and Efficiency",
        "When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering",
        "Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency",
        "Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles",
        "Polite on the Surface, Wrong in Practice: A Curated Dataset for Fixing Honorific Failures in Multilingual Bangla Generation",
        "Efficient Agentic Reasoning Through Self-Regulated Simulative Planning",
        "From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning",
        "DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback",
        "HarnessAPI: A Skill-First Framework for Unified Streaming APIs and MCP Tools",
        "\"Refactoring Runaway\": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution"
      ],
      "key_points": [
        "《Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents》〔评测 / 方法〕：Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses serious challe…",
        "《ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning》〔评测 / 应用 / 方法〕：Biomedical knowledge graphs (KGs) treat disease associations as static facts, but temporal information is crucial for clinical reasoning, e.g., a symptom diagn…"
      ]
    },
    {
      "name": "Benchmark",
      "paper_count": 14,
      "feed_names": [
        "LM",
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents",
        "Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety",
        "ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning",
        "LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance",
        "Understanding Data Temporality Impact on Large Language Models Pre-training",
        "A Comparative Study of Language Models for Khmer Retrieval-Augmented Question Answering",
        "LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning",
        "When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering",
        "Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency",
        "Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles",
        "Polite on the Surface, Wrong in Practice: A Curated Dataset for Fixing Honorific Failures in Multilingual Bangla Generation",
        "From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning",
        "DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback",
        "TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks"
      ],
      "key_points": [
        "《Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents》〔评测 / 方法〕：Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses serious challe…",
        "《Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety》〔评测 / 应用 / 方法〕：Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harm…"
      ]
    },
    {
      "name": "Agent",
      "paper_count": 8,
      "feed_names": [
        "LM",
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety",
        "Planning in the LLM Era: Building for Reliability and Efficiency",
        "Efficient Agentic Reasoning Through Self-Regulated Simulative Planning",
        "HarnessAPI: A Skill-First Framework for Unified Streaming APIs and MCP Tools",
        "Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents",
        "\"Refactoring Runaway\": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution",
        "TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks",
        "Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study"
      ],
      "key_points": [
        "《Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety》〔评测 / 应用 / 方法〕：Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harm…",
        "《Planning in the LLM Era: Building for Reliability and Efficiency》〔应用 / 方法〕：Growing attention to intelligent agents has put a spotlight on one of their central capabilities: planning. Early attempts to leverage large language models (L…"
      ]
    },
    {
      "name": "Evaluation",
      "paper_count": 2,
      "feed_names": [
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents",
        "Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study"
      ],
      "key_points": [
        "《Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents》〔评测 / 应用 / 方法〕：Skills are increasingly used to package agent instructions, workflows, scripts, and reference materials. In enterprise settings, however, skills often need to…",
        "《Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study》〔评测 / 应用 / 方法〕：AI coding agents increasingly submit pull requests (Agentic-PRs) to open-source repositories, yet their performance is commonly assessed using merge and reject…"
      ]
    },
    {
      "name": "Language Model",
      "paper_count": 1,
      "feed_names": [
        "LM"
      ],
      "paper_titles": [
        "From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment"
      ],
      "key_points": [
        "《From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment》〔评测 / 方法〕：Adapting Large Language Models (LLMs) to specialized domains typically incurs high data and computational overhead. While prior efficiency efforts have largely…"
      ]
    }
  ],
  "template": "zh_daily_brief",
  "feeds": [
    {
      "name": "LM",
      "key_points": [
        "《Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents》〔评测 / 方法〕：Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses serious challe…",
        "《Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety》〔评测 / 应用 / 方法〕：Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harm…",
        "《ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning》〔评测 / 应用 / 方法〕：Biomedical knowledge graphs (KGs) treat disease associations as static facts, but temporal information is crucial for clinical reasoning, e.g., a symptom diagn…",
        "《LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance》〔评测 / 方法〕：Reinforcement learning has proven effective for enhancing multi-step reasoning in large language models (LLMs), yet its benefits have not fully translated to m…",
        "《From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment》〔评测 / 方法〕：Adapting Large Language Models (LLMs) to specialized domains typically incurs high data and computational overhead. While prior efficiency efforts have largely…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents",
          "summary": "Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses serious challenges for overseeing and assessing agent behavior. Most current tools are limited, focusing on observability with basic evaluation capabilities or imposing static, hand-crafted error taxonomies that cannot adapt to new domains. To address this gap, we present Agentic CLEAR, an automatic, dynamic, and easy-to-use evaluation framework. It produces textual insights into the agent behavior on three levels of granularity: system, trace, and node. Agentic CLEAR operates above the observability layer, enabling seamless integration and featuring an intuitive UI that makes agent evaluation highly accessible. In our experiments on four benchmarks, seven agentic settings, and tens of thousands of LLM calls, we show that Agentic CLEAR produces high-quality, data-driven, insightful feedback. Our analysis shows strong alignment with human-annotated errors and the ability to predict task success rate.",
          "authors": [
            "Asaf Yehudai",
            "Lilach Eden",
            "Michal Shmueli-Scheuer"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "https://arxiv.org/abs/2605.22608",
          "abstract_url": "https://arxiv.org/abs/2605.22608",
          "pdf_url": "https://arxiv.org/pdf/2605.22608",
          "published_at": "2026-05-22T04:00:00+00:00",
          "updated_at": "2026-05-22T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.22608",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.22608"
          },
          "relevance_score": 196,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"agent\"",
            "title matched \"evaluation\"",
            "summary matched \"alignment\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.22608"
        },
        {
          "title": "Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety",
          "summary": "Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the safety-relevant object shifts from what the system says to what it does within an environment, and evaluating model responses under prompting is no longer sufficient to address the safety challenges posed by artificial intelligence. Recent developments have seen the rise of benchmarks that evaluate large language models as agents. We contribute to this strand of research. Approach. We introduce Boiling the Frog, a benchmark that evaluates whether tool-using AI models deployed in corporate and office settings are susceptible to incremental attacks. Each scenario begins with benign workspace edits and later introduces a risk-bearing request. The benchmark focuses on stateful multi-turn evaluation: chains expose a persistent workspace, place the risk-bearing payload at controlled positions in the turn sequence, and score whether the resulting artifact state becomes unsafe. Scenarios are organized through a three-level operational risk taxonomy grounded in the Boiling the Frog risks, the AI Act Annex I and Annex III high-risk contexts, and EU AI Act's Code of Practice on General-Purpose AI (GPAI). Results. Across a nine-model panel, aggregate strict attack success rate (ASR) is 44.4%. Model-level ASR ranges from 20.5% for Claude Haiku 4.5 to 92.9% for Gemini 3.1 Flash Lite, with Seed 2.0 Lite also above 80%. Average chain category-level ASR reaches 93.3% for Code of Practice loss-of-control scenarios.",
          "authors": [
            "Piercosma Bisconti",
            "Matteo Prandi",
            "Federico Pierucci",
            "Federico Sartore",
            "Enrico Panai",
            "Laura Caroli",
            "Yue Zhu",
            "Adam Leon Smith",
            "Luca Nannini",
            "Marcello Galisai",
            "Susanna Cifani",
            "Francesco Giarrusso",
            "Marcantonio Bracale Syrnikov",
            "Daniele Nardi"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.22643",
          "abstract_url": "https://arxiv.org/abs/2605.22643",
          "pdf_url": "https://arxiv.org/pdf/2605.22643",
          "published_at": "2026-05-22T04:00:00+00:00",
          "updated_at": "2026-05-22T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.22643",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.22643"
          },
          "relevance_score": 192,
          "match_reasons": [
            "title matched \"agent\"",
            "title matched \"benchmark\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"RAG\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.22643"
        },
        {
          "title": "ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning",
          "summary": "Biomedical knowledge graphs (KGs) treat disease associations as static facts, but temporal information is crucial for clinical reasoning, e.g., a symptom diagnostic of one disease at age 3 may imply a different disease at age 13. Existing KGs such as PrimeKG, Hetionet, and iKraph do not encode when a finding becomes clinically relevant over the course of a disease. This limits their usefulness for longitudinal clinical reasoning and retrieval augmentation. We introduce ChronoMedKG, a temporal biomedical knowledge graph that contains 460,497 evidence-linked triples (filtered from 13M raw extractions) covering 13,431 diseases. Each association is tied to temporal components like onset window or progression stage, which are backed by PMID-traceable evidence and a multi-signal credibility score. The graph is constructed through a disease-autonomous multi-agent pipeline in which multiple frontier LLMs independently extract knowledge from PubMed and PMC literature. Only those relations are kept that are supported by multi-model consensus, survive credibility filtering, as well as ontology alignment. ChronoMedKG scored 92.7% agreement against Orphadata and adds temporal grounding for 6,250 diseases absent from HPOA, Orphadata, and Phenopackets, including 1,657 Orphanet-coded rare diseases. We further introduce ChronoTQA, a benchmark of 3,341 questions across eight task types (six temporal plus two static controls), with a 12-question supplementary probe. Frontier LLMs lose roughly 30 points moving from static to temporal questions; ChronoMedKG retrieval rescues 47-65% of their long-tail failures, against 17-29% for HPOA-RAG. As such, ChronoMedKG provides a crucial temporal axis for retrieval-augmented clinical systems that was previously absent.",
          "authors": [
            "Md Shamim Ahmed",
            "Farzaneh Firoozbakht",
            "Lukas Galke Poech",
            "Jan Baumbach",
            "Richard R\\\"ottger"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.22734",
          "abstract_url": "https://arxiv.org/abs/2605.22734",
          "pdf_url": "https://arxiv.org/pdf/2605.22734",
          "published_at": "2026-05-22T04:00:00+00:00",
          "updated_at": "2026-05-22T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.22734",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.22734"
          },
          "relevance_score": 192,
          "match_reasons": [
            "title matched \"reasoning\"",
            "title matched \"benchmark\"",
            "summary matched \"LLM\"",
            "summary matched \"agent\"",
            "summary matched \"alignment\"",
            "summary matched \"RAG\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.22734"
        },
        {
          "title": "LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance",
          "summary": "Reinforcement learning has proven effective for enhancing multi-step reasoning in large language models (LLMs), yet its benefits have not fully translated to multilingual contexts. Existing methods struggle with a fundamental trade-off: prioritizing input-language consistency severely hampers reasoning quality, while prioritizing reasoning often leads to unintended language drift toward English. We address this challenge with LANG, a novel framework that leverages language-conditioned hints to guide exploration in non-English reasoning tasks. Our method incorporates two key mechanisms to prevent dependency on these hints: a progressive decay schedule that gradually withdraws scaffolding, and a language-adaptive switch that tailors learning horizons to specific language difficulties. Empirical results on challenging multilingual mathematical benchmarks reveal that LANG substantially enhances reasoning performance without compromising language consistency. Moreover, we show that our framework generalizes beyond mathematics, fostering more consistent language alignment across model layers",
          "authors": [
            "Yuchun Fan",
            "Bei Li",
            "Peiguang Li",
            "Yilin Wang",
            "Yongyu Mu",
            "Jian Yang",
            "Xin Chen",
            "Rongxiang Weng",
            "Jingang Wang",
            "Xunliang Cai",
            "Jingbo Zhu",
            "Tong Xiao"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.22567",
          "abstract_url": "https://arxiv.org/abs/2605.22567",
          "pdf_url": "https://arxiv.org/pdf/2605.22567",
          "published_at": "2026-05-22T04:00:00+00:00",
          "updated_at": "2026-05-22T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.22567",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.22567"
          },
          "relevance_score": 188,
          "match_reasons": [
            "title matched \"reasoning\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"alignment\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.22567"
        },
        {
          "title": "From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment",
          "summary": "Adapting Large Language Models (LLMs) to specialized domains typically incurs high data and computational overhead. While prior efficiency efforts have largely treated data selection and parameter-efficient fine-tuning as isolated processes, our empirical analysis suggests they may be intrinsically coupled. We posit the Strong Map Hypothesis: a sparse subset of attention heads plays a dominant role in task-specific adaptation, acting as keys that unlock specific data patterns. Building on this observation, we propose From Parameters to Data (P2D), a unified framework that leverages these task-sensitive attention heads as a dual compass for both sample mining and structural pruning. To rigorously quantify the total pipeline cost, we introduce the Alignment Efficiency Ratio (AER) metric for both selection latency and training time. Mechanistically, P2D identifies critical heads via a lightweight proxy and uses them as a functional filter to curate high-affinity data, establishing a synergistic pipeline. Empirically, by updating merely 10% of attention heads on 10% of the data, P2D achieves an 8.3 pp performance gain over strong baselines and delivers a 7.0x end-to-end time speedup. These results validate that precise parameter-data synchronization eliminates redundancy, offering a new paradigm for efficient alignment.",
          "authors": [
            "Hao Chen",
            "Qi Zhang",
            "Liyao Li",
            "Zhanming Shen",
            "Wentao Ye",
            "Lirong Gao",
            "Ningtao Wang",
            "Xing Fu",
            "Xiaoyu Shen",
            "Junbo Zhao"
          ],
          "categories": [
            "cs.LG",
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.21558",
          "abstract_url": "https://arxiv.org/abs/2605.21558",
          "pdf_url": "https://arxiv.org/pdf/2605.21558",
          "published_at": "2026-05-22T04:00:00+00:00",
          "updated_at": "2026-05-22T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.21558",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.21558"
          },
          "relevance_score": 174,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"alignment\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"RAG\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.21558"
        },
        {
          "title": "Understanding Data Temporality Impact on Large Language Models Pre-training",
          "summary": "Large language models (LLMs) are typically trained on shuffled corpora, yielding models whose knowledge is frozen at train time and whose temporal grounding remains poorly understood. In this work, we study the impact of pre-training dynamics on the acquisition of time-sensitive factual knowledge, focusing specifically on data ordering. Our main contributions are twofold. First, we introduce a comprehensive benchmark of over 7,000 temporally grounded questions and an evaluation protocol that enables analysis of whether models correctly associate facts with their corresponding time periods. Second, we pretrain 6B-parameter models on temporally ordered Common Crawl snapshots and compare them against standard shuffled pre-training. Our results show that sequentially trained models match shuffled baselines on general language understanding and common knowledge while consistently exhibiting more up-to-date and temporally precise knowledge. Temporally ordered pre-training yields improved factual freshness, while shuffled pre-training peaks on older data, possibly due to increased factual repetition. These findings, along with the release of our code at https://github.com/kyutai-labs/kairos , checkpoints, and datasets at https://huggingface.co/collections/kyutai/kairos provide a foundation for future research on continual learning for LLMs.",
          "authors": [
            "Pilchen Hippolyte",
            "Fabre Romain",
            "Signe Talla Franck",
            "Perez Patrick",
            "Grave Edouard"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "https://arxiv.org/abs/2605.22769",
          "abstract_url": "https://arxiv.org/abs/2605.22769",
          "pdf_url": "https://arxiv.org/pdf/2605.22769",
          "published_at": "2026-05-22T04:00:00+00:00",
          "updated_at": "2026-05-22T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.22769",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.22769"
          },
          "relevance_score": 174,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.22769"
        },
        {
          "title": "A Comparative Study of Language Models for Khmer Retrieval-Augmented Question Answering",
          "summary": "Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for grounding large language model (LLM) outputs in retrieved evidence, thereby reducing hallucination and improving factual accuracy. Its efficacy, however, remains largely unexamined for low-resource, non-Latin-script languages such as Khmer. In this paper, we present a RAG-based question answering system for Khmer-language telecom-domain documents. We conduct a two-phase comparative evaluation. First, we benchmark three embedding models: BGE-M3 (567M), Jina-Embeddings-v3 (570M), and Qwen3-Embedding (597M), for dense retrieval over Khmer documents. BGE-M3 consistently performs best, achieving a Hit Rate@3 of 0.285, File Hit Rate@3 of 0.700, MRR@3 of 0.221, and Precision@3 of 0.112, substantially outperforming the other retrievers. Second, using BGE-M3 as the selected retriever, we evaluate five generator backends: Qwen3 (8B), Qwen3.5 (9B), Sailor2-8B-Chat, SeaLLMs-v3-7B-Chat, and Llama-SEA-LION-v2-8B-IT, on a curated golden dataset of 200 Khmer question-answer pairs. To quantify system performance, we apply six RAGAS-inspired metrics: faithfulness, answer relevance, context relevance, factual correctness, answer similarity, and answer correctness. The results show no single model dominates across all metrics: Qwen3.5-9B achieves the highest faithfulness (0.859) and context relevance (0.726), Qwen3-8B attains the highest factual correctness (0.380), and SeaLLMs-v3-7B-Chat performs best on answer relevance (0.867), answer similarity (0.836), and answer correctness (0.599). These findings highlight that retriever choice remains a major bottleneck for Khmer RAG, while generator strengths vary depending on whether the priority is grounding, factual precision, or semantic similarity.",
          "authors": [
            "Sereiwathna Ros",
            "Phannet Pov",
            "Ratanaktepi Chhor",
            "Kimleang Ly",
            "Wan-Sup Cho",
            "Saksonita Khoeurn"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.22099",
          "abstract_url": "https://arxiv.org/abs/2605.22099",
          "pdf_url": "https://arxiv.org/pdf/2605.22099",
          "published_at": "2026-05-22T04:00:00+00:00",
          "updated_at": "2026-05-22T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.22099",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.22099"
          },
          "relevance_score": 170,
          "match_reasons": [
            "title matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.22099"
        },
        {
          "title": "LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning",
          "summary": "Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose \\textbf{LatentOmni}, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct \\textbf{LatentOmni-Instruct-35K}, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.",
          "authors": [
            "Yifan Dai",
            "Zhenhua Wu",
            "Bohan Zeng",
            "Daili Hua",
            "Jialing Liu",
            "Bozhou Li",
            "Yuran Wang",
            "Chengzhuo Tong",
            "Hao Liang",
            "Xiaochen Ma",
            "Junbo Niu",
            "Tianyu Guo",
            "Yang Shi",
            "Yue Ding",
            "Yiyan Ji",
            "Bingyin Mei",
            "Yushuo Guan",
            "Yuanxing Zhang",
            "Pengfei Wan",
            "Fangcheng Fu",
            "Wentao Zhang"
          ],
          "categories": [
            "cs.CL",
            "cs.CV"
          ],
          "paper_id": "https://arxiv.org/abs/2605.22012",
          "abstract_url": "https://arxiv.org/abs/2605.22012",
          "pdf_url": "https://arxiv.org/pdf/2605.22012",
          "published_at": "2026-05-22T04:00:00+00:00",
          "updated_at": "2026-05-22T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.22012",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.22012"
          },
          "relevance_score": 170,
          "match_reasons": [
            "title matched \"reasoning\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.22012"
        },
        {
          "title": "Planning in the LLM Era: Building for Reliability and Efficiency",
          "summary": "Growing attention to intelligent agents has put a spotlight on one of their central capabilities: planning. Early attempts to leverage large language models (LLMs) for planning relied on single-shot plan generation, followed by hybrid approaches that coupled LLMs with limited external search. These methods, unsound and incomplete by their very nature, often require substantial resources without yielding better solutions on unseen problems. As the limitations of LLMs become clearer, recent work has shifted toward using them at solution construction time -- generating symbolic solvers for a family of problems that can be verified and then used efficiently at inference time. This trend reflects the growing need for agents that are both reliable and resource-efficient. It also offers a path towards generating maintainable planners with minimal dependence on language models at inference time. In this paper, we argue that this shift reflects a broader realignment of the planning field in the LLM era. We examine three major categories of planner-generation methods, discuss their current limitations, and outline research steps towards a more reliable and efficient LLM-based generation of planners.",
          "authors": [
            "Michael Katz",
            "Harsha Kokel",
            "Kavitha Srinivas",
            "Shirin Sohrabi"
          ],
          "categories": [
            "cs.AI",
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.21902",
          "abstract_url": "https://arxiv.org/abs/2605.21902",
          "pdf_url": "https://arxiv.org/pdf/2605.21902",
          "published_at": "2026-05-22T04:00:00+00:00",
          "updated_at": "2026-05-22T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.21902",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.21902"
          },
          "relevance_score": 170,
          "match_reasons": [
            "title matched \"LLM\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"agent\"",
            "summary matched \"alignment\"",
            "summary matched \"RAG\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.21902"
        },
        {
          "title": "When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering",
          "summary": "Across medical specialties, clinical practice is anchored in evidence-based guidelines that codify best studied diagnostic and treatment pathways. These pathways routinely fall short for the long tail of real-world care not covered by guidelines. Most medical large language models (LLMs), however, are trained to encode common, guideline-focused medical knowledge in their parameters. Current evaluations test models primarily on recalling and reasoning with this memorized content, often in multiple-choice settings. Given the fundamental importance of evidence-based reasoning in medicine, it is neither feasible nor reliable to depend on memorization in practice. To address this gap, we introduce OGCaReBench, a free-form retrieval-focused benchmark aimed at evaluating LLMs at answering clinical questions that require going beyond typical guidelines. Extracted from published medical case reports and validated by medical experts, OGCaReBench contains long-form clinical questions requiring free-text answers, providing a systematic framework for assessing open-ended medical reasoning in rare, case-based scenarios. Our experiments reveal that even the best-performing baseline (GPT-5.2) correctly answers only 56% of our benchmark with specialized models only reaching 42%. Augmenting models with retrieved medical articles improves this performance to up to 82% (using GPT-5.2) highlighting the importance of evidence-grounding for real-world medical reasoning tasks. This work thus establishes a foundation for benchmarking and advancing both general-purpose and medical LLMs to produce reliable answers in challenging clinical contexts.",
          "authors": [
            "Doeun Lee",
            "Muge Zhang",
            "Yi Yu",
            "Ashish Manne",
            "Stephen Koesters",
            "Frank Wen",
            "Brady Buchanan",
            "Lynda Villagomez",
            "Oluwatoba Moninuola",
            "James Lim",
            "Kathryn Tobin",
            "Andrew Srisuwananukorn",
            "Ping Zhang",
            "Sachin Kumar"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.21807",
          "abstract_url": "https://arxiv.org/abs/2605.21807",
          "pdf_url": "https://arxiv.org/pdf/2605.21807",
          "published_at": "2026-05-22T04:00:00+00:00",
          "updated_at": "2026-05-22T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.21807",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.21807"
          },
          "relevance_score": 170,
          "match_reasons": [
            "title matched \"benchmark\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.21807"
        },
        {
          "title": "Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency",
          "summary": "Although Large Language Models (LLMs) demonstrate strong capabilities across various tasks, they exhibit significant performance discrepancies across languages. While prompting LLMs in English typically yields the highest general performance, it often induces a Western-centric bias, hindering the model's ability to accurately reflect diverse cultural knowledge. We hypothesize that LLMs already possess rich cultural knowledge embedded within local-language representations, but fail to retrieve it when prompted in English. To bridge this cross-lingual knowledge gap, we propose a novel self-supervised framework. Our method leverages multilingual self-consistency to identify the most reliable cultural responses across languages, combined with a self-critique mechanism to transfer this knowledge to the weaker language. Evaluations on the BLEnD benchmark demonstrate that our approach significantly improves cultural alignment-boosting performance on English queries by an average of 5.03%-relying entirely on self-generated data. Ultimately, our work demonstrates that latent cultural knowledge can be successfully surfaced and propagated across languages, enabling more culturally equitable and consistent LLMs.",
          "authors": [
            "Andrew Ivan Soegeng",
            "Patrick Sutanto",
            "Tan Sang Nguyen"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.22137",
          "abstract_url": "https://arxiv.org/abs/2605.22137",
          "pdf_url": "https://arxiv.org/pdf/2605.22137",
          "published_at": "2026-05-22T04:00:00+00:00",
          "updated_at": "2026-05-22T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.22137",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.22137"
          },
          "relevance_score": 166,
          "match_reasons": [
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"alignment\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.22137"
        },
        {
          "title": "Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles",
          "summary": "The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs offer distinct advantages across diverse domains, yet current frameworks fail to exploit the complementary strengths of models and skills, thereby limiting their performance on downstream tasks. In this paper, we present Maestro (Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration), a Reinforcement Learning (RL)-driven orchestration framework that reframes heterogeneous multimodal tasks as a sequential decision-making process over a hierarchical model-skill registry. Rather than consolidating all knowledge into a single model, Maestro trains a lightweight policy to dynamically compose ensembles of frozen expert models and a two-tier skill library, deciding at each step whether to invoke an external expert, which model-skill pair to select, and when to terminate. The policy is optimized via outcome-based RL, requiring no step-level supervision. We evaluate Maestro across ten representative multimodal benchmarks spanning mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis. With only a 4B orchestrator, Maestro achieves an average accuracy of 70.1%, surpassing both GPT-5 (69.3%) and Gemini-2.5-Pro (68.7%). Crucially, the learned coordination policy generalizes to unseen models and skills without retraining: augmenting the registry with out-of-domain experts yields a 59.5% average on four challenging benchmarks, outperforming all closed-source baselines. Maestro further maintains high computational efficiency with low latency. The source code is available at https://github.com/jinyangwu/Maestro.",
          "authors": [
            "Jinyang Wu",
            "Guocheng Zhai",
            "Ruihan Jin",
            "Yuhao Shen",
            "Zhengxi Lu",
            "Fan Zhang",
            "Haoran Luo",
            "Zheng Lian",
            "Zhengqi Wen",
            "Jianhua Tao"
          ],
          "categories": [
            "cs.LG",
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.22177",
          "abstract_url": "https://arxiv.org/abs/2605.22177",
          "pdf_url": "https://arxiv.org/pdf/2605.22177",
          "published_at": "2026-05-22T04:00:00+00:00",
          "updated_at": "2026-05-22T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.22177",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.22177"
          },
          "relevance_score": 166,
          "match_reasons": [
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"agent\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.22177"
        },
        {
          "title": "Polite on the Surface, Wrong in Practice: A Curated Dataset for Fixing Honorific Failures in Multilingual Bangla Generation",
          "summary": "Recent advances in Multilingual Large Language Models (MLLMs) have significantly enhanced cross-lingual conversational capabilities, yet modeling culturally nuanced and context-dependent communication remains a critical bottleneck. Specifically, existing state-of-the-art models exhibit a severe pragmatic gap when handling structural variations, regional idioms, and honorific consistencies in low-resource contexts like Bangla. To address this limitation, we introduce a novel, culturally aligned instruction-tuning dataset for \\textbf{BangLa Application and DialoguE generation - BLADE} and benchmarking framework comprising $4,196$ meticulously curated interaction pairs. We leverage this resource to systematically fine-tune and evaluate leading open-weight architectures, including DeepSeek-8B and LLaMA-3.2-3B, utilizing parameter-efficient fine-tuning via LoRA adapters in a 4-bit NormalFloat (NF4) quantization framework. Our empirical evaluations demonstrate that models fine-tuned on our dataset yield substantial improvements in structural fidelity and honorific alignment, providing a rigorous benchmark for bridging pragmatic disparities in low-resource multilingual text generation. Code and dataset: https://github.com/ashuvo25/Bangla_Application_LLM/tree/main",
          "authors": [
            "Md. Asaduzzaman Shuvo",
            "Mahedi Hasan",
            "Md. Tashin Parvez",
            "Azizul Haque Noman",
            "Md. Shafayet Hossain Ovi"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.22487",
          "abstract_url": "https://arxiv.org/abs/2605.22487",
          "pdf_url": "https://arxiv.org/pdf/2605.22487",
          "published_at": "2026-05-22T04:00:00+00:00",
          "updated_at": "2026-05-22T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.22487",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.22487"
          },
          "relevance_score": 166,
          "match_reasons": [
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"alignment\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.22487"
        },
        {
          "title": "Efficient Agentic Reasoning Through Self-Regulated Simulative Planning",
          "summary": "How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain-of-thought), trained end-to-end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision-making into three systems: simulative reasoning (System II) grounding deliberation in future-state prediction via a world model; self-regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine-grained action. Simulative reasoning provides unified planning across diverse tasks without per-domain engineering, while self-regulation ensures the planner is invoked only when needed. To test this, we develop SR$^2$AM (Self-Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain-of-thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi-module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1-8B and v1.0-30B achieve Pass@1 competitive with 120-355B and 685B-1T parameter systems respectively, while v1.0-30B uses 25.8-95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self-regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.",
          "authors": [
            "Mingkai Deng",
            "Jinyu Hou",
            "Lara S\\'a Neves",
            "Varad Pimpalkhute",
            "Taylor W. Killian",
            "Zhengzhong Liu",
            "Eric P. Xing"
          ],
          "categories": [
            "cs.AI",
            "cs.CL",
            "cs.LG",
            "cs.RO"
          ],
          "paper_id": "https://arxiv.org/abs/2605.22138",
          "abstract_url": "https://arxiv.org/abs/2605.22138",
          "pdf_url": "https://arxiv.org/pdf/2605.22138",
          "published_at": "2026-05-22T04:00:00+00:00",
          "updated_at": "2026-05-22T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.22138",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.22138"
          },
          "relevance_score": 156,
          "match_reasons": [
            "title matched \"reasoning\"",
            "title matched \"agent\"",
            "summary matched \"LLM\"",
            "summary matched \"RAG\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.22138"
        },
        {
          "title": "From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning",
          "summary": "Reinforcement learning from verifiable rewards (RLVR) has shown strong promise for LLM reasoning, but outcome-based RLVR remains inefficient on hard problems because correct final-answer rollouts are rare and sample-level credit assignment cannot use partial progress in failed attempts. We introduce SCRL (Subproblem Curriculum Reinforcement Learning), a curriculum RL framework that derives verifiable subproblems from reference reasoning chains and fixes the final subproblem as the original problem. This turns partial progress on hard problems into verifiable learning signals. Algorithmically, SCRL uses subproblem-level normalization, which normalizes rewards independently at each subproblem position and assigns the resulting advantages to the corresponding answer spans, enabling finer-grained credit assignment without external rubrics or reward models. Our analysis shows that subproblem curricula lift hard problems out of gradient dead zones, with larger relative gains as the original problem becomes harder. Across seven mathematical reasoning benchmarks, SCRL outperforms strong curriculum-learning baselines, improving average accuracy over GRPO by +4.1 points on Qwen3-4B-Base and +1.9 points on Qwen3-14B-Base. On AIME24, AIME25, and IMO-Bench, SCRL further improves pass@1 by +3.7 points and pass@64 by +4.6 points on Qwen3-4B-Base, indicating better exploration on hard reasoning problems.",
          "authors": [
            "Xitai Jiang",
            "Zihan Tang",
            "Wenze Lin",
            "Yang Yue",
            "Shenzhi Wang",
            "Gao Huang"
          ],
          "categories": [
            "cs.LG",
            "cs.AI",
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.22074",
          "abstract_url": "https://arxiv.org/abs/2605.22074",
          "pdf_url": "https://arxiv.org/pdf/2605.22074",
          "published_at": "2026-05-22T04:00:00+00:00",
          "updated_at": "2026-05-22T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.22074",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.22074"
          },
          "relevance_score": 156,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"reasoning\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.22074"
        }
      ]
    },
    {
      "name": "Agent Runtime Security",
      "key_points": [
        "《DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback》〔评测 / 应用 / 方法〕：LLM-powered AI agents require high-frequency state exploration (e.g., test-time tree search and reinforcement learning), relying on rapid checkpoint and rollba…",
        "《HarnessAPI: A Skill-First Framework for Unified Streaming APIs and MCP Tools》〔数据 / 应用 / 方法〕：Every Python function deployed as an LLM tool must today exist in two forms: an HTTP endpoint for human-facing clients and CI pipelines, and an MCP tool regist…",
        "《Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents》〔评测 / 应用 / 方法〕：Skills are increasingly used to package agent instructions, workflows, scripts, and reference materials. In enterprise settings, however, skills often need to…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback",
          "summary": "LLM-powered AI agents require high-frequency state exploration (e.g., test-time tree search and reinforcement learning), relying on rapid checkpoint and rollback (C/R) of the complete sandbox state, including files and process state (e.g., memory, contexts, etc.). Existing mechanisms duplicate the entire state, causing hundreds of milliseconds to seconds of latency per C/R, which severely bottlenecks deep search and large-scale fan-outs. This paper observes that subsequent checkpoints in AI agents are highly similar. Therefore, instead of full duplication, a sandbox should only duplicate the changes between consecutive checkpoints (Key Insight). However, it is non-trivial to realize the idea, mainly due to the missing OS supports. This paper proposes a new OS-level abstraction, DeltaState, to enable the change-based transactional C/R for AI agents with two co-designed OS mechanisms. First, DeltaFS enables change-based filesystem C/R by organizing the file states into layers and dynamically freezing the writable layer and inserting a new one during checkpoint, reducing file updates to copy-on-write, and making rollback a simple layer switch. Second, DeltaCR enables change-based process state C/R using incremental dumps, and accelerates rollback by bypassing traditional pipelines to directly fork() from a frozen template process. We then present DeltaBox, a novel agent sandbox achieving millisecond level C/R through the two new mechanisms. Evaluations on SWE-bench and RL micro-benchmarks show DeltaBox completes checkpoint and rollback in millisecond-level latency (14ms and 5ms, respectively), empowering agents to explore substantially more nodes under fixed time budgets.",
          "authors": [
            "Yunpeng Dong",
            "Jingkai He",
            "Yuze Hou",
            "Dong Du",
            "Zhonghu Xu",
            "Si Yu",
            "Yubin Xia",
            "Haibo Chen"
          ],
          "categories": [
            "cs.OS",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2605.22781v1",
          "abstract_url": "https://arxiv.org/abs/2605.22781v1",
          "pdf_url": "https://arxiv.org/pdf/2605.22781v1",
          "published_at": "2026-05-21T17:36:17+00:00",
          "updated_at": "2026-05-21T17:36:17+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.22781",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.22781v1"
          },
          "relevance_score": 48,
          "match_reasons": [
            "summary matched \"agent sandbox\"",
            "has PDF",
            "has rich summary",
            "has complete metadata",
            "summary matched \"SWE-bench\""
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.22781"
        },
        {
          "title": "HarnessAPI: A Skill-First Framework for Unified Streaming APIs and MCP Tools",
          "summary": "Every Python function deployed as an LLM tool must today exist in two forms: an HTTP endpoint for human-facing clients and CI pipelines, and an MCP tool registration for agent runtimes such as Claude and Cursor. These representations share business logic yet diverge in all the surrounding machinery (routing, validation, serialisation, streaming, and schema maintenance), and they drift apart as the underlying code evolves. We present HarnessAPI, a Python framework that eliminates this duplication by treating a typed skill folder as the single source of truth. From one handler.py plus Pydantic schemas, the framework automatically derives a streaming HTTP endpoint with Server-Sent Events, an interactive OpenAPI/Swagger UI, and a zero-configuration MCP tool, all served from a single process. Dual-mode content negotiation lets the same handler serve SSE-streaming and JSON-returning clients with no handler changes. A dynamic code-generation mechanism ensures Pydantic type annotations propagate correctly to FastMCP's inspection layer, resolving a technical limitation that prevents naive closure-based registration. Measured across six representative skills using cloc, HarnessAPI reduces framework-facing boilerplate by 74% compared with a manually maintained dual-stack implementation (FastAPI server + FastMCP server). HarnessAPI subclasses FastAPI, inheriting its full middleware, dependency-injection, and deployment ecosystem. It is available at https://github.com/edwinjosechittilappilly/harnessapi and on PyPI (pip install harnessapi)",
          "authors": [
            "Edwin Jose"
          ],
          "categories": [
            "cs.AI",
            "cs.SE"
          ],
          "paper_id": "http://arxiv.org/abs/2605.22733v1",
          "abstract_url": "https://arxiv.org/abs/2605.22733v1",
          "pdf_url": "https://arxiv.org/pdf/2605.22733v1",
          "published_at": "2026-05-21T17:03:44+00:00",
          "updated_at": "2026-05-21T17:03:44+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.22733",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.22733v1"
          },
          "relevance_score": 47,
          "match_reasons": [
            "summary matched \"agent runtime\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.22733"
        },
        {
          "title": "Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents",
          "summary": "Skills are increasingly used to package agent instructions, workflows, scripts, and reference materials. In enterprise settings, however, skills often need to express more than task guidance: they must make goals, input boundaries, permissions, evidence requirements, output contracts, quality criteria, verification steps, human approval points, and handoff rules inspectable. This paper proposes contractual skills, a GovernSpec-inspired design framework for organizing SKILL.md files as readable task contracts while preserving lightweight skill discovery and progressive loading. The framework clarifies the boundary between contractual skills, GovernSpec YAML contracts, Model Context Protocol surfaces, tool adapters, runtime guardrails, tracing, and evaluation systems. We evaluate the framework with two offline experiments. A text-generation study covers three enterprise skills, fifteen synthetic tasks, four instruction conditions, and eight generation models, yielding 960 outputs and 1680 cross-judge score records. Contractual skills outperform no-skill and minimal-skill baselines on all tested models. Relative to information-rich plain expanded skills, the gains are small and mixed, suggesting that contractual fields mainly improve checkability and maintainability rather than raw generation quality. A tool-calling challenge covers eight models and 192 simulated tool-call records. Skills usually reduce high-risk tool attempts, but model differences remain and runtime tool guardrails are still required. The results suggest that contractual skills are best understood as a governance layer that makes task intent, boundaries, and acceptance criteria explicit, not as a standalone safety mechanism.",
          "authors": [
            "Ting Liu"
          ],
          "categories": [
            "cs.SE",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2605.22634v1",
          "abstract_url": "https://arxiv.org/abs/2605.22634v1",
          "pdf_url": "https://arxiv.org/pdf/2605.22634v1",
          "published_at": "2026-05-21T15:40:05+00:00",
          "updated_at": "2026-05-21T15:40:05+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Agent",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2605.22634",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.22634v1"
          },
          "relevance_score": 46,
          "match_reasons": [
            "summary matched \"guardrail\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.22634"
        }
      ]
    },
    {
      "name": "Terminal and SWE Agents",
      "key_points": [
        "《\"Refactoring Runaway\": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution》〔方法〕：Recent advances in coding agents have shown remarkable progress in software issue resolution. In practice, real-world issues are typically bug fixes or feature…",
        "《TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks》〔评测 / 应用 / 方法〕：We introduce TerminalWorld, a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from \"in-the-wild\" terminal recordings.…",
        "《Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study》〔评测 / 应用 / 方法〕：AI coding agents increasingly submit pull requests (Agentic-PRs) to open-source repositories, yet their performance is commonly assessed using merge and reject…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "\"Refactoring Runaway\": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution",
          "summary": "Recent advances in coding agents have shown remarkable progress in software issue resolution. In practice, real-world issues are typically bug fixes or feature requests in which human developers naturally incorporate refactoring as part of the resolution process, resulting in tangled refactoring. Since LLMs are trained on large-scale open-source repositories, coding agents may inherit such behaviors. In this paper, we conduct an empirical study on Multi-SWE-bench, analyzing 3,691 valid patches generated by three agent frameworks with 12 LLMs. We find that coding agents introduce tangled refactorings less frequently (21.43% vs. 36.72%) and with lower intensity (0.66 vs. 1.75) than human developers, although they exhibit a broader diversity of refactoring types. Logistic regression analysis further shows that tangled refactorings are strongly associated with reduced compilability, while exhibiting no significant association with functional correctness. Based on these findings, we propose a refactoring-aware refinement approach that assesses the necessity and safety of tangled refactorings and selectively removes or repairs problematic operations. Our approach improves compilability from 19.34% to 38.33%, and additionally resolves 2.79% previously unresolved issues. Overall, this work presents the first step towards understanding tangled refactoring practices in agentic issue resolution and opens up avenues for future work.",
          "authors": [
            "Zhao Tian",
            "Zifan Zhang",
            "Tao Xiao",
            "Dong Wang",
            "Masanari Kondo",
            "Junjie Chen",
            "Yasutaka Kamei"
          ],
          "categories": [
            "cs.SE"
          ],
          "paper_id": "http://arxiv.org/abs/2605.22526v1",
          "abstract_url": "https://arxiv.org/abs/2605.22526v1",
          "pdf_url": "https://arxiv.org/pdf/2605.22526v1",
          "published_at": "2026-05-21T14:18:29+00:00",
          "updated_at": "2026-05-21T14:18:29+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.22526",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.22526v1"
          },
          "relevance_score": 125,
          "match_reasons": [
            "title matched \"coding agent\"",
            "title matched \"issue resolution\"",
            "summary matched \"SWE-bench\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.22526"
        },
        {
          "title": "TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks",
          "summary": "We introduce TerminalWorld, a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from \"in-the-wild\" terminal recordings. Processing 80,870 terminal recordings, the engine yields a full benchmark of 1,530 validated tasks, spanning 18 real-world categories, ranging from short everyday operations to workflows exceeding 50 steps, and covering 1,280 unique commands. From these, we curate a Verified subset of 200 representative, manually reviewed tasks. Comprehensive benchmarking on TerminalWorld-Verified across eight frontier models and six agents reveals that current systems still struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%. Moreover, TerminalWorld captures real-world terminal capabilities distinct from existing expert-curated benchmarks (e.g., Terminal-Bench), with only a weak correlation to their scores (Pearson r=0.20). The automated engine makes TerminalWorld authentic and scalable by construction, enabling it to evaluate agents in real-world terminal environments as developer practices evolve. Data and code are available at https://github.com/EuniAI/TerminalWorld.",
          "authors": [
            "Zhaoyang Chu",
            "Jiarui Hu",
            "Xingyu Jiang",
            "Pengyu Zou",
            "Han Li",
            "Chao Peng",
            "Peter O'Hearn",
            "Earl T. Barr",
            "Mark Harman",
            "Federica Sarro",
            "He Ye"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2605.22535v1",
          "abstract_url": "https://arxiv.org/abs/2605.22535v1",
          "pdf_url": "https://arxiv.org/pdf/2605.22535v1",
          "published_at": "2026-05-21T14:24:43+00:00",
          "updated_at": "2026-05-21T14:24:43+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.22535",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.22535v1"
          },
          "relevance_score": 45,
          "match_reasons": [
            "summary matched \"Terminal-Bench\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.22535"
        },
        {
          "title": "Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study",
          "summary": "AI coding agents increasingly submit pull requests (Agentic-PRs) to open-source repositories, yet their performance is commonly assessed using merge and rejection outcomes alone. We hypothesized that these outcome labels do not reliably reflect agent capability without considering review interactions. To test this, we conducted a decision-oriented analysis of 11,048 closed Agentic Pull Requests, refined to 9,799 human-reviewed PRs, and manually inspected 717 representative cases to recover decision rationale from interaction artifacts. We found that rejection outcomes substantially overstate agent error: only 35.7% of rejected PRs reflected clear agentic failures, while 31.2% were driven by workflow constraints and 33.1% lacked observable decision rationale. Among merged PRs, 15.4% required explicit reviewer involvement through feedback or direct commits, and 5.5% showed no visible interaction trace. We further observed systematic differences across agents, with Copilot and Devin more often embedded in reviewer-mediated workflows, while Codex and Cursor PRs were typically merged with minimal interaction. These results reject the assumption that PR outcomes alone capture agent performance and demonstrate the need for interaction-aware evaluation grounded in review behavior.",
          "authors": [
            "Sien Reeve O. Peralta",
            "Fumika Hoshi",
            "Hironori Washizaki",
            "Naoyasu Ubayashi",
            "Inase Kondo",
            "Yoshiki Higo",
            "Hiroki Mukai",
            "Norihiro Yoshida",
            "Kazuki Kusama",
            "Hidetake Tanaka",
            "Youmei Fan"
          ],
          "categories": [
            "cs.SE"
          ],
          "paper_id": "http://arxiv.org/abs/2605.22534v1",
          "abstract_url": "https://arxiv.org/abs/2605.22534v1",
          "pdf_url": "https://arxiv.org/pdf/2605.22534v1",
          "published_at": "2026-05-21T14:24:20+00:00",
          "updated_at": "2026-05-21T14:24:20+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Agent",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2605.22534",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.22534v1"
          },
          "relevance_score": 45,
          "match_reasons": [
            "summary matched \"coding agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.22534"
        }
      ]
    }
  ]
}