{
  "generated_at": "2026-05-19T13:08:04.119878+08:00",
  "timezone": "Asia/Shanghai",
  "lookback_hours": 24,
  "sorting": {
    "default_sort_by": "hybrid",
    "summary": "hybrid (relevance first, published_at tie-break)",
    "weights": {
      "title_match_weight": 40,
      "summary_match_weight": 18,
      "doi_weight": 12,
      "pdf_weight": 8,
      "rich_summary_weight": 6,
      "metadata_weight": 4,
      "multi_source_weight": 10,
      "freshness_weight_cap": 24
    },
    "feeds": [
      {
        "name": "LM",
        "sort_by": "hybrid"
      },
      {
        "name": "Agent Runtime Security",
        "sort_by": "hybrid"
      },
      {
        "name": "Terminal and SWE Agents",
        "sort_by": "hybrid"
      }
    ]
  },
  "highlights": [
    "主题「LLM」：命中 19 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark》、《SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science》。",
    "主题「Agent」：命中 13 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark》、《MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion》。",
    "主题「Language Model」：命中 7 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science》、《Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning》。",
    "主题「Benchmark」：命中 2 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《Ancient Greek to Modern Greek Machine Translation: A Novel Benchmark and Fine-Tuning Experiments on LLMs and NMT Models》、《Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks》。",
    "主题「RAG」：命中 2 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《Query-Conditioned Knowledge Alignment for Reliable Cross-System Medical Reasoning》、《Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models》。"
  ],
  "focus_items": [],
  "action_items": [],
  "topic_sections": [
    {
      "name": "LLM",
      "paper_count": 19,
      "feed_names": [
        "LM",
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark",
        "SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science",
        "MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion",
        "SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents",
        "LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems",
        "Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning",
        "STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics",
        "Estimating Item Difficulty with Large Language Models as Experts",
        "Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation",
        "What Does the AI Doctor Value? Auditing Pluralism in the Clinical Ethics of Language Models",
        "Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches",
        "Latent Action Reparameterization for Efficient Agent Inference",
        "Ancient Greek to Modern Greek Machine Translation: A Novel Benchmark and Fine-Tuning Experiments on LLMs and NMT Models",
        "AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment",
        "An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments",
        "Multilingual jailbreaking of LLMs using low-resource languages",
        "Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents",
        "SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution",
        "Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents"
      ],
      "key_points": [
        "《CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark》〔评测 / 数据 / 方法〕：Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view perception and reason consistently about objects, visibility,…",
        "《SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science》〔评测 / 应用 / 方法〕：Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across know…"
      ]
    },
    {
      "name": "Agent",
      "paper_count": 13,
      "feed_names": [
        "LM",
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark",
        "MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion",
        "SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents",
        "LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems",
        "STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics",
        "Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation",
        "Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches",
        "Latent Action Reparameterization for Efficient Agent Inference",
        "An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments",
        "Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks",
        "Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents",
        "SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution",
        "Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents"
      ],
      "key_points": [
        "《CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark》〔评测 / 数据 / 方法〕：Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view perception and reason consistently about objects, visibility,…",
        "《MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion》〔评测 / 方法〕：Persuasive dialogue generation plays a vital role in decision-making, negotiation, counseling, and behavior change, yet it remains a challenging problem. In co…"
      ]
    },
    {
      "name": "Language Model",
      "paper_count": 7,
      "feed_names": [
        "LM",
        "Agent Runtime Security"
      ],
      "paper_titles": [
        "SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science",
        "Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning",
        "Estimating Item Difficulty with Large Language Models as Experts",
        "What Does the AI Doctor Value? Auditing Pluralism in the Clinical Ethics of Language Models",
        "AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment",
        "Multilingual jailbreaking of LLMs using low-resource languages",
        "Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models"
      ],
      "key_points": [
        "《SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science》〔评测 / 应用 / 方法〕：Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across know…",
        "《Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning》〔评测 / 方法〕：Large language models (LLMs) have increasingly leveraged tool invocation to enhance their reasoning capabilities. However, existing approaches typically tightl…"
      ]
    },
    {
      "name": "Benchmark",
      "paper_count": 2,
      "feed_names": [
        "LM",
        "Agent Runtime Security"
      ],
      "paper_titles": [
        "Ancient Greek to Modern Greek Machine Translation: A Novel Benchmark and Fine-Tuning Experiments on LLMs and NMT Models",
        "Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks"
      ],
      "key_points": [
        "《Ancient Greek to Modern Greek Machine Translation: A Novel Benchmark and Fine-Tuning Experiments on LLMs and NMT Models》〔评测 / 数据 / 方法〕：Machine Translation (MT) for Ancient Greek (AG) to Modern Greek (MG) is a low-resource task, constrained by the lack of large-scale, high-quality parallel data…",
        "《Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks》〔评测 / 数据 / 方法〕：Coding agents now run autonomously with shell, file, and network privileges. When a user issues a benign request, the agent sometimes does more than asked: it…"
      ]
    },
    {
      "name": "RAG",
      "paper_count": 2,
      "feed_names": [
        "LM",
        "Agent Runtime Security"
      ],
      "paper_titles": [
        "Query-Conditioned Knowledge Alignment for Reliable Cross-System Medical Reasoning",
        "Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models"
      ],
      "key_points": [
        "《Query-Conditioned Knowledge Alignment for Reliable Cross-System Medical Reasoning》〔评测 / 应用 / 方法〕：Cross-domain knowledge alignment is essential for integrating heterogeneous medical systems, yet existing approaches typically treat entity alignment as a stat…",
        "《Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models》〔数据 / 方法〕：The integration of audio modality into Large Audio Language Models (LALMs) significantly expands their attack surface. Existing jailbreak paradigms predominant…"
      ]
    }
  ],
  "template": "zh_daily_brief",
  "feeds": [
    {
      "name": "LM",
      "key_points": [
        "《CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark》〔评测 / 数据 / 方法〕：Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view perception and reason consistently about objects, visibility,…",
        "《SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science》〔评测 / 应用 / 方法〕：Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across know…",
        "《MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion》〔评测 / 方法〕：Persuasive dialogue generation plays a vital role in decision-making, negotiation, counseling, and behavior change, yet it remains a challenging problem. In co…",
        "《SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents》〔评测 / 应用 / 方法〕：As LLM agents are increasingly built around reusable skills, a central challenge is no longer only whether agents can use provided skills, but whether they can…",
        "《LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems》〔评测 / 方法〕：Real-world agents operate over long and evolving horizons, where information is repeatedly updated and may interfere across memories, requiring accurate recall…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark",
          "summary": "Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view perception and reason consistently about objects, visibility, geometry, and interactions across multiple viewpoints. However, progress in cross-view reasoning remains limited by three major gaps: the scarcity of large-scale well-annotated training data, the lack of comprehensive benchmarks for systematic evaluation, and the absence of explicit alignment mechanisms that establish object-level consistency across views. To address these gaps, we thoroughly develop CrossView Suite across three coordinated components: CrossViewSet, CrossViewBench, and CrossViewer. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality cross-view instruction dataset, termed CrossViewSet, covering 17 fine-grained task types with 1.6M samples. Second, we meticulously create a scene-disjoint CrossViewBench to comprehensively assess the cross-view spatial understanding capability of an MLLM, evaluating it across various aspects. Finally, we propose CrossViewer, a progressive three-stage framework for cross-view spatial reasoning in MLLMs, following a Perception -> Alignment -> Reasoning paradigm. Our method equips an adaptive spatial region tokenizer to capture fine-grained object representations, and then aligns the multi-view objects explicitly, and thus fuses aligned features for boosting the cross-view inference capacity for MLLMs. Extensive experiments and analyses show that large-scale training data, systematic evaluation, and explicit cross-view alignment are all critical for advancing MLLMs from single-view perception toward real-world spatial intelligence. The project page is available at https://github.com/Thinkirin/Crossview-Suite.",
          "authors": [
            "Wei Wang",
            "Yuqian Yuan",
            "Tianwei Lin",
            "Wenqiao Zhang",
            "Siliang Tang",
            "Jun Xiao",
            "Yueting Zhuang"
          ],
          "categories": [
            "cs.CV",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2605.18621v1",
          "abstract_url": "https://arxiv.org/abs/2605.18621v1",
          "pdf_url": "https://arxiv.org/pdf/2605.18621v1",
          "published_at": "2026-05-18T16:31:31+00:00",
          "updated_at": "2026-05-18T16:31:31+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.18621",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18621v1"
          },
          "relevance_score": 217,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"benchmark\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"reasoning\"",
            "summary matched \"agent\"",
            "summary matched \"alignment\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18621"
        },
        {
          "title": "SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science",
          "summary": "Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification in scientific task formulation across four computational science problem domains: fluid mechanics, solid mechanics, materials science, and par- tial differential equations (PDEs). SCICONVBENCH targets two complementary capabilities: eliciting missing information (disambiguation) and detecting and correcting erroneous requests containing internally contradictory information (in- consistency resolution). Our benchmark pairs a structured task ontology with a rubric-based evaluation framework, enabling systematic measurement of LLM per- formance across three dimensions: clarification behavior, conversational grounding, and final-specification fidelity. Current frontier models perform relatively well on inconsistency resolution, but even the best model resolves only 52.7% of the disambiguation cases in fluid mechanics. We further find that frontier LLMs fre- quently make silent assumptions and perform implicit specification repairs that are not grounded in the conversation with users. SCICONVBENCH establishes a foundation for evaluating the upstream conversational reasoning that a reliable computational science assistant requires. The code and data can be found at https://github.com/csml-rpi/SciConvBench.",
          "authors": [
            "Nithin Somasekharan",
            "Youssef Hassan",
            "Shiyao Lin",
            "Gihan Panapitiya",
            "Patrick Emami",
            "Anurag Acharya",
            "Sameera Horawalavithana",
            "Shaowu Pan"
          ],
          "categories": [
            "cs.AI",
            "physics.comp-ph"
          ],
          "paper_id": "http://arxiv.org/abs/2605.18630v1",
          "abstract_url": "https://arxiv.org/abs/2605.18630v1",
          "pdf_url": "https://arxiv.org/pdf/2605.18630v1",
          "published_at": "2026-05-18T16:34:45+00:00",
          "updated_at": "2026-05-18T16:34:45+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.18630",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18630v1"
          },
          "relevance_score": 181,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"benchmark\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"reasoning\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18630"
        },
        {
          "title": "MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion",
          "summary": "Persuasive dialogue generation plays a vital role in decision-making, negotiation, counseling, and behavior change, yet it remains a challenging problem. In complex persuasion where the persuadee's internal states are not expressed clearly, the persuader must interpret responses, infer the persuadee's latent mental states (e.g., beliefs and desires), and translate them into targeted, strategy-consistent actions; however, current approaches often produce generic or weakly grounded responses even when such cues are identified. Moreover, although large language models (LLMs) can generate persuasive content, their performance varies substantially across domains due to uneven knowledge coverage and limited reasoning generalization. To address these challenges, we propose MA$^{2}$P, a meta-cognitive autonomous intelligent agent framework for complex persuasion. Specifically, we develop an autonomous multi-agent architecture that coordinates perception management, mental-state inference, strategy execution, memory maintenance, and performance evaluation. To mitigate cross-domain performance variation, we further design a meta-cognitive configurator that selects an appropriate meta-strategy from a structured knowledge base at the outset, thereby guiding subsequent reasoning and planning. Experimental results show that our approach achieves a higher persuasion success rate than baselines.",
          "authors": [
            "Dingyi Zhang",
            "Ziqing Zhuang",
            "Linhai Zhang",
            "Ziyang Gao",
            "Deyu Zhou"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2605.18572v1",
          "abstract_url": "https://arxiv.org/abs/2605.18572v1",
          "pdf_url": "https://arxiv.org/pdf/2605.18572v1",
          "published_at": "2026-05-18T15:53:12+00:00",
          "updated_at": "2026-05-18T15:53:12+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.18572",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18572v1"
          },
          "relevance_score": 176,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"RAG\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18572"
        },
        {
          "title": "SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents",
          "summary": "As LLM agents are increasingly built around reusable skills, a central challenge is no longer only whether agents can use provided skills, but whether they can generate correct, reusable, and executable skills from repositories and documents. Existing benchmarks primarily evaluate the efficacy of given skills or the ability of agents to solve downstream tasks from raw context, but they do not isolate skill generation itself as the object of study. We introduce SkillGenBench, a benchmark for evaluating skill generation pipelines under a unified and controlled protocol. In SkillGenBench, a generator receives raw corpora and produces standardized skill artifacts, which are then executed under fixed harnesses and assessed with unified evaluation procedures. The benchmark covers two generation regimes: task-conditioned generation, where a task-specific skill is synthesized after the task is revealed, and task-agnostic generation, where a reusable skill library must be distilled before downstream tasks are known. It also spans two complementary procedural sources: repository-grounded instances, where procedures are distributed across code, configuration, and scripts, and document-grounded instances, where procedures and constraints must be distilled from long-form text. We provide standardized task specifications, pinned environments, and evaluation protocols centered on deterministic execution-based checks, supplemented by auxiliary signals for diagnosis. Experiments across a range of skill-generation methods and backbones show substantial performance variation, highlight the difficulty of reusable skill distillation, and reveal distinct failure modes in skill generation from software repositories versus long-form documents. SkillGenBench establishes a reproducible testbed for studying skill generation as an independent research problem in agent systems.",
          "authors": [
            "Yifan Zhou",
            "Zhentao Zhang",
            "Ziming Cheng",
            "Shuo Zhang",
            "Qizhen Lan",
            "Zhangquan Chen",
            "Zhi Yang",
            "QianyuXu",
            "Ronghao Chen",
            "Huacan Wang",
            "Sen Hu"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2605.18693v1",
          "abstract_url": "https://arxiv.org/abs/2605.18693v1",
          "pdf_url": "https://arxiv.org/pdf/2605.18693v1",
          "published_at": "2026-05-18T17:28:36+00:00",
          "updated_at": "2026-05-18T17:28:36+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.18693",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18693v1"
          },
          "relevance_score": 168,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"agent\"",
            "title matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18693"
        },
        {
          "title": "LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems",
          "summary": "Real-world agents operate over long and evolving horizons, where information is repeatedly updated and may interfere across memories, requiring accurate recall and aggregated reasoning over multiple pieces of information. However, existing benchmarks focus on static, independent recall and fail to capture these dynamic interactions between evolving memories. In this paper, we study how current memory-augmented agents perform in realistic, interference-heavy, long-horizon settings across diverse domains and question types. We introduce LongMINT (Long-Horizon Memory under INTerference), a benchmark featuring (1) long, highly interconnected contexts with frequently updated information that induces substantial interference, (2) diverse domains (state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits), enabling evaluation of domain generalization, and (3) diverse question types that assess robustness to interference, including (i) single-target recall tasks requiring retrieval of a specific target from long contexts, and (ii) multi-target aggregation tasks requiring reasoning over multiple relevant pieces of information. Overall, LongMINT has 15.6k question-answering pairs over long-horizon contexts averaging 138.8k tokens and extending up to 1.8M tokens per instance. We evaluate 7 representative systems, including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks. Across all systems, we observe consistently low performance (avg. 27.9% accuracy), especially on questions requiring aggregated reasoning over multiple pieces of evidence. Our analysis shows that performance is primarily limited by retrieval and memory construction. Furthermore, current memory systems struggle to recall and reason over earlier facts that are later revised or interfered with by subsequent context, with performance degrading as the number of intervening updates increases.",
          "authors": [
            "Hyunji Lee",
            "Justin Chih-Yao Chen",
            "Joykirat Singh",
            "Zaid Khan",
            "Elias Stengel-Eskin",
            "Mohit Bansal"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2605.18565v1",
          "abstract_url": "https://arxiv.org/abs/2605.18565v1",
          "pdf_url": "https://arxiv.org/pdf/2605.18565v1",
          "published_at": "2026-05-18T15:43:35+00:00",
          "updated_at": "2026-05-18T15:43:35+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.18565",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18565v1"
          },
          "relevance_score": 158,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18565"
        },
        {
          "title": "Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning",
          "summary": "Large language models (LLMs) have increasingly leveraged tool invocation to enhance their reasoning capabilities. However, existing approaches typically tightly couple tool invocation with immediate execution. Such immediate tool interaction may disrupt the reasoning coherence of LLMs and constrain their expressivity, ultimately degrading reasoning performance. To this end, for the first time, we propose and formalize the problem of decoupling tool invocation from execution during reasoning, and introduce delayed execution with explicit control to enhance tool-integrated reasoning (TIR). Furthermore, we propose a hierarchical control framework and theoretically derive a surrogate loss that enables an implicitly hierarchical policy to learn behavior equivalent to that of an explicit hierarchical policy, leading to the proposed IH-GRPO algorithm. Extensive experiments on IH-GRPO achieve absolute improvements of 1.87\\%, 2.16\\%, and 2.53\\% on Qwen3-1.7B, Qwen3-4B, and Qwen3-8B across six out-of-domain mathematical reasoning benchmarks over the strongest baseline method, while also yielding consistent performance gains in other domains. Our code is available at https://github.com/Lumina04/IH-GRPO-01.",
          "authors": [
            "Li Wang",
            "Xiaohan Wang",
            "Xiaodong Lu",
            "Zipeng Zhang",
            "Jinyang Wu",
            "Jiajun Chai",
            "Wei Lin",
            "Guojun Yin"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2605.18500v1",
          "abstract_url": "https://arxiv.org/abs/2605.18500v1",
          "pdf_url": "https://arxiv.org/pdf/2605.18500v1",
          "published_at": "2026-05-18T14:54:49+00:00",
          "updated_at": "2026-05-18T14:54:49+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.18500",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18500v1"
          },
          "relevance_score": 157,
          "match_reasons": [
            "title matched \"reasoning\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18500"
        },
        {
          "title": "STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics",
          "summary": "Large language models (LLMs) deployed in real-world agentic applications must be capable of replanning and adapting when mid-task disruptions invalidate their prior decisions. Existing dynamic benchmarks primarily measure whether LLMs can detect temporal changes in a timely manner, leaving the complementary challenge of adaptive replanning under spatio-temporal dynamics largely unexplored. We introduce STT-Arena (Spatio-Temporal Tool-Use Arena), a benchmark of 227 high-quality interactive tasks spanning nine spatio-temporal conflict types and four solvability levels. Each task is grounded in a realistic, executable environment equipped with injected spatio-temporal triggers that can abruptly invalidate an ongoing plan, forcing the model to detect the state shift and construct a revised execution strategy. Extensive evaluation of frontier LLMs reveals that even the SOTA proprietary models, including Claude-4.6-Opus, achieves less than 40\\% overall accuracies, highlighting the fundamental difficulty of spatio-temporal dynamic reasoning. Systematic analysis of failure trajectories uncovers three recurring error modes of existing models: Stale-State Execution, Misdiagnosis of Dynamic Triggers, and Missing Post-Adaptation Verification. Guided by these findings, we propose an iterative trajectory refinement technique that eliminates these failure patterns from training data, and combine it with online RL to produce STT-Agent-4B which outperforms frontier LLMs on STT-Arena.",
          "authors": [
            "Tingfeng Hui",
            "Hao Xu",
            "Pengyu Zhu",
            "Hongsheng Xin",
            "Kun Zhan",
            "Sen Su",
            "Chunxiao Liu",
            "Ning Miao"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2605.18548v1",
          "abstract_url": "https://arxiv.org/abs/2605.18548v1",
          "pdf_url": "https://arxiv.org/pdf/2605.18548v1",
          "published_at": "2026-05-18T15:27:52+00:00",
          "updated_at": "2026-05-18T15:27:52+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.18548",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18548v1"
          },
          "relevance_score": 154,
          "match_reasons": [
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"agent\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18548"
        },
        {
          "title": "Estimating Item Difficulty with Large Language Models as Experts",
          "summary": "Accurate estimates of item difficulty are essential for valid assessment and effective adaptive learning. However, for newly created tasks, response data are typically unavailable. Pretesting and expert judgement can be costly and slow, while machine learning methods often require large labelled training datasets. Recent work suggests that large language models (LLMs) may help. However, there is limited evidence on the elicitation procedures and prompt configurations used to emulate experts for difficulty estimation. This study addresses this gap by evaluating three off-the-shelf LLMs as difficulty raters for newly created items without access to response data. Using an item bank from an online learning system, the study examined 6 domains of primary-school mathematics, with empirical difficulty estimates treated as empirical reference. The study used a full factorial design crossing three factors: judgement format (absolute vs pairwise), decision type (hard decisions vs token-probability-based estimates), and prompting strategy (zero-shot vs few-shot). LLM-derived difficulty estimates were compared with empirical difficulties using Spearman rank correlations. Across domains, LLM-based estimates exhibited moderate to strong positive correlations with empirical item difficulties. For simpler arithmetic tasks, some configurations approached the upper end of the accuracy range reported for human experts in previous research. Pairwise comparison consistently outperformed absolute judgement in the absence of additional refinements. However, when token-level probabilities were incorporated and examples of items with known empirical difficulty were provided, the absolute judgement configuration likewise demonstrated moderate-to-high alignment. The study positions LLMs as a promising tool for initial item calibration and offers insights into effective workflow configuration.",
          "authors": [
            "Diana Kolesnikova",
            "Kirill Fedyanin",
            "Abe D. Hofman",
            "Matthieu J. S. Brinkhuis",
            "Maria Bolsinova"
          ],
          "categories": [
            "stat.ME",
            "cs.AI",
            "cs.LG",
            "stat.AP"
          ],
          "paper_id": "http://arxiv.org/abs/2605.18562v1",
          "abstract_url": "https://arxiv.org/abs/2605.18562v1",
          "pdf_url": "https://arxiv.org/pdf/2605.18562v1",
          "published_at": "2026-05-18T15:42:13+00:00",
          "updated_at": "2026-05-18T15:42:13+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.18562",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18562v1"
          },
          "relevance_score": 144,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"alignment\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18562"
        },
        {
          "title": "Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation",
          "summary": "Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned on evidence-centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant evidence rather than insufficient local recognition ability. Motivated by this observation, we propose Vision-OPD (Vision On-Policy Distillation), a regional-to-global self-distillation framework that transfers the model's own privileged regional perception to its full-image policy. Vision-OPD instantiates two conditional policies from the same MLLM: a crop-conditioned teacher and a full-image-conditioned student. The student generates on-policy rollouts, and Vision-OPD minimizes token-level divergence between the teacher and student next-token distributions along these rollouts. This enables the model to internalize the benefit of visual zooming without external teacher models, ground-truth labels, reward verifiers, or inference-time tool use. Experiments on multiple fine-grained visual understanding benchmarks show that Vision-OPD models achieve competitive or superior performance against much larger open-source, closed-source, and \"Thinking-with-Images\" agentic models.",
          "authors": [
            "Qianhao Yuan",
            "Jie Lou",
            "Xing Yu",
            "Hongyu Lin",
            "Le Sun",
            "Xianpei Han",
            "Yaojie Lu"
          ],
          "categories": [
            "cs.CV",
            "cs.AI",
            "cs.CL",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2605.18740v1",
          "abstract_url": "https://arxiv.org/abs/2605.18740v1",
          "pdf_url": "https://arxiv.org/pdf/2605.18740v1",
          "published_at": "2026-05-18T17:57:04+00:00",
          "updated_at": "2026-05-18T17:57:04+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.18740",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18740v1"
          },
          "relevance_score": 142,
          "match_reasons": [
            "title matched \"LLM\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"agent\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18740"
        },
        {
          "title": "What Does the AI Doctor Value? Auditing Pluralism in the Clinical Ethics of Language Models",
          "summary": "Medicine is inherently pluralistic. Principles such as autonomy, beneficence, nonmaleficence, and justice routinely conflict, and such ethical dilemmas often sharply divide reasonable physicians. Good clinical practice navigates these tensions in concert with each patient's values rather than imposing a single ethical stance. The ethical values that large language models bring to medical advice, however, have not been systematically examined. We present a framework for auditing value pluralism in medical AI, comprising a benchmark of clinician-verified dilemmas and an attribution method that recovers value priorities directly from decisions. The ecosystem of frontier models spans physician-level value heterogeneity, and models discuss competing values in their reasoning (Overton pluralism) before committing to a decision. However, individual model decisions are near-deterministic across repeated sampling and semantic variations, failing to reproduce the distributional pluralism of the physician panel. Across benchmark cases, these consistent decisions reflect committed, systematic value preferences. While most model priorities fall within the natural range of inter-physician variation, some significantly underweight patient autonomy. A single LLM deployed without regard for its value priorities could amplify those priorities at scale to every patient it serves. Without explicit efforts to balance ethical perspectives with one or multiple models, these tools risk replacing clinical pluralism with a deployment monoculture.",
          "authors": [
            "Payal Chandak",
            "Victoria Alkin",
            "David Wu",
            "Maya Dagan",
            "Taposh Dutta Roy",
            "Maria Clara Saad Menezes",
            "Ayush Noori",
            "Nirali Somia",
            "John S. Brownstein",
            "Ran Balicer",
            "Rebecca W. Brendel",
            "Noa Dagan",
            "Isaac S. Kohane",
            "Gabriel A. Brat"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2605.18738v1",
          "abstract_url": "https://arxiv.org/abs/2605.18738v1",
          "pdf_url": "https://arxiv.org/pdf/2605.18738v1",
          "published_at": "2026-05-18T17:56:13+00:00",
          "updated_at": "2026-05-18T17:56:13+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.18738",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18738v1"
          },
          "relevance_score": 142,
          "match_reasons": [
            "title matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18738"
        },
        {
          "title": "Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches",
          "summary": "Optimization models developed by operations research (OR) experts are often deployed as decision-support systems in industrial settings. However, real-world environments are dynamic, with evolving business rules, previously overlooked constraints, and unforeseen perturbations. In such contexts, end users must rapidly re-optimize models to recover feasible and implementable solutions. This paper introduces an agentic re-optimization framework in which a large language model (LLM) acts as an OR expert, dynamically supporting end users through natural-language interaction. The LLM translates user prompts into structured updates of the underlying optimization model, selects suitable re-optimization techniques from an optimization toolbox, and solves the resulting instance to return implementable solutions. The toolbox leverages primal information, including historical solutions, valid inequalities, solver configurations, and metaheuristics, to accelerate re-optimization while preserving solution quality. The proposed framework enables interactive and continuous adaptation of deployed optimization models, reducing dependence on OR experts and improving the sustainability of decision-support systems. Extensive experiments on two complementary large-scale real-world case studies demonstrate the effectiveness and scalability of the proposed framework. The first considers online supply chain re-optimization, where solutions must be generated rapidly while remaining close to the deployed plan, whereas the second focuses on offline university exam scheduling, where solution quality is prioritized over runtime. Results show that the toolbox-driven architecture significantly improves computational efficiency through primal-based and solver-aware re-optimization techniques, while the structured patch-based updates improve interpretability and traceability of model modifications.",
          "authors": [
            "Tinghan Ye",
            "Arnaud Deza",
            "Ved Mohan",
            "El Mehdi Er Raqabi",
            "Pascal Van Hentenryck"
          ],
          "categories": [
            "cs.AI",
            "math.OC"
          ],
          "paper_id": "http://arxiv.org/abs/2605.18692v1",
          "abstract_url": "https://arxiv.org/abs/2605.18692v1",
          "pdf_url": "https://arxiv.org/pdf/2605.18692v1",
          "published_at": "2026-05-18T17:28:25+00:00",
          "updated_at": "2026-05-18T17:28:25+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.18692",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18692v1"
          },
          "relevance_score": 142,
          "match_reasons": [
            "title matched \"LLM\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"agent\"",
            "summary matched \"RAG\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18692"
        },
        {
          "title": "Latent Action Reparameterization for Efficient Agent Inference",
          "summary": "Large language model (LLM) agents often rely on long sequences of low-level textual actions, resulting in large effective decision horizons and high inference cost. While prior work has focused on improving inference efficiency through system-level optimizations or prompt engineering, we argue that a key bottleneck lies in the representation of the action space itself. We propose Latent Action Reparameterization (LAR), a framework that learns a compact latent action space in which each latent action corresponds to a multi-step semantic behavior. By reparameterizing agent actions into latent units, LAR enables decision making over a shorter effective horizon while preserving the expressiveness of the original action space. Unlike hand-crafted macros or hierarchical controllers, latent actions are learned from agent trajectories and integrated directly into the model, allowing both planning and execution to operate over abstract action representations. Across a range of LLM-based agent benchmarks, LAR significantly reduces the effective action horizon and improves inference efficiency under fixed compute budgets. As a consequence, our approach achieves substantial reductions in action tokens and corresponding wall-clock inference time, while maintaining or improving task success rates. These results suggest that action representation learning is a critical and underexplored factor in scaling efficient LLM agent inference, complementary to advances in model architecture and hardware.",
          "authors": [
            "Wenhao Huang",
            "Qingwen Zeng",
            "Qiyue Chen",
            "Zijie Guo",
            "Yu Sun",
            "Cheng Yang",
            "Siru Ouyang",
            "Jiri Gesi",
            "Fang Wu",
            "Jiayi Zhang",
            "Huaming Chen",
            "Bang Liu",
            "Xiangru Tang",
            "Chenglin Wu"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2605.18597v1",
          "abstract_url": "https://arxiv.org/abs/2605.18597v1",
          "pdf_url": "https://arxiv.org/pdf/2605.18597v1",
          "published_at": "2026-05-18T16:07:44+00:00",
          "updated_at": "2026-05-18T16:07:44+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.18597",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18597v1"
          },
          "relevance_score": 140,
          "match_reasons": [
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18597"
        },
        {
          "title": "Ancient Greek to Modern Greek Machine Translation: A Novel Benchmark and Fine-Tuning Experiments on LLMs and NMT Models",
          "summary": "Machine Translation (MT) for Ancient Greek (AG) to Modern Greek (MG) is a low-resource task, constrained by the lack of large-scale, high-quality parallel data. We address this gap by introducing the AG-MG Parallel Corpus, a new resource containing 132,481 sentence-aligned pairs derived from literary, historical, and biblical texts. We present a novel corpus creation pipeline that combines web-scraped, excerpt-level data with a multi-stage sentence-level alignment, and refinement process. Our method uses VecAlign with LaBSE embeddings, which we first fine-tune on a manually-aligned AG-MG subset, followed by an LLM-based error/misalignment correction phase using Gemini 2.5 Flash to ensure high alignment quality. Furthermore, we provide the first comprehensive benchmark of modern MT models on this task, evaluating three fine-tuning strategies across NMT models (NLLB, M2M100) and a Greek LLM (Llama-Krikri-8B). Our experiments show that fine-tuning yields significant improvements over base models, increasing performance by up to +10.3 BLEU points. Specifically, full-parameter fine-tuning of Llama-Krikri-8B achieves the highest overall performance with a BLEU score of 13.16, while the QLoRA-adapted M2M100-1.2B model demonstrates the largest relative gains and highly competitive results. Our dataset and models represent a significant contribution to Greek NLP.",
          "authors": [
            "Spyridon Mavromatis",
            "Sokratis Sofianopoulos",
            "Prokopis Prokopidis",
            "Maria Giagkou"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2605.18504v1",
          "abstract_url": "https://arxiv.org/abs/2605.18504v1",
          "pdf_url": "https://arxiv.org/pdf/2605.18504v1",
          "published_at": "2026-05-18T14:56:44+00:00",
          "updated_at": "2026-05-18T14:56:44+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": "10.63317/4cdk64dgm2w9",
          "arxiv_id": "2605.18504",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18504v1",
            "doi": "https://doi.org/10.63317/4cdk64dgm2w9"
          },
          "relevance_score": 137,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"benchmark\"",
            "summary matched \"alignment\"",
            "has DOI",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.63317/4cdk64dgm2w9"
        },
        {
          "title": "AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment",
          "summary": "The alignment of Large Language Models (LLMs) for complex reasoning heavily relies on Reinforcement Learning with Verifiable Rewards (RLVR). However, standard algorithms like GRPO apply sequence-level rewards uniformly to all tokens, creating a severe credit-assignment bottleneck. While on-policy self-distillation attempts to resolve this by conditioning a self-teacher on privileged contexts, direct exposure to raw oracle solutions often induces over-conditioned teacher distributions, implicit answer leakage, and late-stage training collapse. To overcome these limitations, we propose Asymmetric Meta-Reflective Self-Distillation (AMR-SD). Instead of conditioning directly on raw reference traces, AMR-SD inserts a reflection bottleneck: it compresses diagnostic signals -- from verifier outcomes, peer rollouts, or reference feedback -- into concise, self-generated Socratic hints and critiques. Furthermore, we introduce Causal Information Gain (CIG) with an asymmetric, ReLU-gated threshold to translate these reflections into sparse, highly precise token-level advantage modulations. Combined with temporal annealing, this mechanism preserves the base environmental reward while filtering out distributional noise. Experiments across scientific, mathematical, and tool-use benchmarks demonstrate that AMR-SD significantly outperforms existing baselines, achieving robust long-horizon stability and successfully preventing late-stage collapse.",
          "authors": [
            "Zhenlin Wei",
            "Pu Jian",
            "Yingzhuo Deng",
            "Xiaohan Wang",
            "Jiajun Chai",
            "Zhexin Hu",
            "Wei Lin",
            "Shanbin Zhang",
            "Guojun Yin"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2605.18529v1",
          "abstract_url": "https://arxiv.org/abs/2605.18529v1",
          "pdf_url": "https://arxiv.org/pdf/2605.18529v1",
          "published_at": "2026-05-18T15:14:34+00:00",
          "updated_at": "2026-05-18T15:14:34+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.18529",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18529v1"
          },
          "relevance_score": 136,
          "match_reasons": [
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"alignment\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18529"
        },
        {
          "title": "Query-Conditioned Knowledge Alignment for Reliable Cross-System Medical Reasoning",
          "summary": "Cross-domain knowledge alignment is essential for integrating heterogeneous medical systems, yet existing approaches typically treat entity alignment as a static matching problem, ignoring query context and cross-system asymmetry. This limitation is particularly critical in integrative medical settings, where correspondence between concepts is inherently context-dependent, non-bijective, and direction-sensitive. In this paper, we propose Query-Conditioned Entity Alignment (QCEA), which reformulates entity alignment as a query-conditioned correspondence problem. Instead of learning a fixed mapping between entity representations, QCEA treats the textual description of a source entity as a query and ranks candidate entities in the target graph, enabling context-dependent alignment. The framework integrates semantic encoding, graph-based representation learning, and a direction-aware transformation module to capture asymmetric and many-to-many correspondence across heterogeneous knowledge systems. We evaluate QCEA on TCM--WM knowledge graphs derived from SymMap, covering both symptom alignment and herb--molecule alignment tasks. Experimental results show consistent improvements over representative baselines, particularly on rank-sensitive metrics such as Hit@K and MRR. Furthermore, downstream retrieval-augmented generation (RAG) experiments demonstrate that improved alignment leads to better evidence retrieval, stronger grounding, and higher answer accuracy. These findings highlight that alignment is not merely a data integration step, but a key factor that shapes knowledge accessibility and reliability in cross-system medical reasoning.",
          "authors": [
            "Yan Jiao",
            "Jingran Xu",
            "Pin-Han Ho",
            "Limei Peng"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2605.18570v1",
          "abstract_url": "https://arxiv.org/abs/2605.18570v1",
          "pdf_url": "https://arxiv.org/pdf/2605.18570v1",
          "published_at": "2026-05-18T15:49:46+00:00",
          "updated_at": "2026-05-18T15:49:46+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Reasoning",
            "RAG"
          ],
          "doi": null,
          "arxiv_id": "2605.18570",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18570v1"
          },
          "relevance_score": 126,
          "match_reasons": [
            "title matched \"reasoning\"",
            "title matched \"alignment\"",
            "summary matched \"RAG\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18570"
        }
      ]
    },
    {
      "name": "Agent Runtime Security",
      "key_points": [
        "《An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments》〔方法〕：LLM-based chatbot agents increasingly process user requests by combining natural-language reasoning with external tools such as web browsing. These capabilitie…",
        "《Multilingual jailbreaking of LLMs using low-resource languages》〔数据 / 方法〕：Large Language Models (LLMs) remain vulnerable to jailbreak attempts that circumvent safety guardrails. We investigate whether multi-turn conversations using l…",
        "《Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks》〔评测 / 数据 / 方法〕：Coding agents now run autonomously with shell, file, and network privileges. When a user issues a benign request, the agent sometimes does more than asked: it…",
        "《Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models》〔数据 / 方法〕：The integration of audio modality into Large Audio Language Models (LALMs) significantly expands their attack surface. Existing jailbreak paradigms predominant…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments",
          "summary": "LLM-based chatbot agents increasingly process user requests by combining natural-language reasoning with external tools such as web browsing. These capabilities improve usability, but they also create attack surfaces when untrusted external content is processed as part of a user' s task. This paper studies a privacy-leakage attack chain based on indirect prompt injection in black-box chatbot environments, where the attacker has no access to model weights, system prompts, or agent implementation details including how a trajectory is actually managed during its processing for a query. We first analyze how an attacker can hijack an agent' s intended task by crafting external content that appears benign to the victim while inducing the agent to execute an attacker-defined objective. We then evaluate a new prompt-injection technique, called exemplification, which uses a bridge in the external content to reframe the user prompt and the benign beginning of the retrieved page as few-shot examples before appending the attacker' s objective. We compare its attack success rate with a prior fake-completion technique. Finally, we demonstrate a proof-of-concept data-exfiltration chain using fictitious personal information in a controlled setting. Our results suggest that prompt injection, jailbreak-style instruction steering, and web-tool invocation can be combined into a feasible privacy-leakage path in deployed chatbot agents.",
          "authors": [
            "Hongjang Yang",
            "Hyunsik Na",
            "Daeseon Choi"
          ],
          "categories": [
            "cs.CR",
            "cs.AI",
            "cs.HC",
            "cs.IR"
          ],
          "paper_id": "http://arxiv.org/abs/2605.18133v1",
          "abstract_url": "https://arxiv.org/abs/2605.18133v1",
          "pdf_url": "https://arxiv.org/pdf/2605.18133v1",
          "published_at": "2026-05-18T09:38:18+00:00",
          "updated_at": "2026-05-18T09:38:18+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.18133",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18133v1"
          },
          "relevance_score": 98,
          "match_reasons": [
            "title matched \"prompt injection\"",
            "summary matched \"indirect prompt injection\"",
            "summary matched \"jailbreak\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18133"
        },
        {
          "title": "Multilingual jailbreaking of LLMs using low-resource languages",
          "summary": "Large Language Models (LLMs) remain vulnerable to jailbreak attempts that circumvent safety guardrails. We investigate whether multi-turn conversations using low-resource African languages (Afrikaans, Kiswahili, isiXhosa, and isiZulu) can bypass safety mechanisms across commercial LLMs. We translated prompts from existing datasets and evaluated ChatGPT, Claude, DeepSeek, Gemini, and Grok through automated testing and human red-teaming with native speakers. Single-turn translation attacks proved ineffective, while multi-turn conversations achieved English harmful response rates from 52.7% (Claude 3.5 Haiku) to 83.6% (GPT-4o-mini), Afrikaans from 60.0% (Claude 3.5 Haiku) to 78.2% (GPT-4o-mini), and Kiswahili from 41.8% (Claude 3.5 Haiku) to 70.9% (DeepSeek). Human red-teaming increased jailbreak rates compared to automated methods. Over all evaluated languages, the average jailbreak rate increased from 59.8% to 75.8%, with improvements of +20.0% (Afrikaans), +12.7% (isiZulu), +12.3% (isiXhosa), and +1% (Kiswahili), demonstrating that poor translation quality limits jailbreak success. These findings suggest that vulnerabilities in LLMs persist in multilingual contexts and that translation quality is the critical factor determining jailbreak success in low-resource languages.",
          "authors": [
            "Dylan Marx",
            "Marcel Dunaiski"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2605.18239v1",
          "abstract_url": "https://arxiv.org/abs/2605.18239v1",
          "pdf_url": "https://arxiv.org/pdf/2605.18239v1",
          "published_at": "2026-05-18T11:33:18+00:00",
          "updated_at": "2026-05-18T11:33:18+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.18239",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18239v1"
          },
          "relevance_score": 82,
          "match_reasons": [
            "title matched \"jailbreak\"",
            "summary matched \"guardrail\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18239"
        },
        {
          "title": "Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks",
          "summary": "Coding agents now run autonomously with shell, file, and network privileges. When a user issues a benign request, the agent sometimes does more than asked: it deletes unrelated files, wipes a stale credentials backup, or rewrites configuration the user never mentioned. We call these scope expansions overeager actions, an authorization problem distinct from capability failures, prompt injection, or sandbox escapes. We present OverEager-Gen, a benchmark dedicated to overeager behavior on benign tasks. Building it surfaces a measurement-validity issue: if a benchmark spells out the authorized scope inside the prompt, the agent stops inferring boundaries and starts pattern-matching declaration text. On Claude Code, stripping the consent declaration alone raises the overeager rate from 0.0% to 17.1% on paired scenarios (McNemar exact p = 2.4 x 10^-4). OverEager-Gen therefore certifies each scenario's discriminative power before admission via a behavioral-gradient validator, audits internal tool calls through a dual-channel stack (PATH-injected shim plus per-agent event streams), and ships byte-identical consent_kept and consent_stripped variants. OverEager-Bench contains 500 validated scenarios and ~7,500 runs across four agent products (Claude Code, OpenHands, Codex CLI, Gemini CLI) and six base models; a 50-sample re-annotation gives Cohen's kappa = 0.73 and rule-judge recall = 1.00. Stripping consent multiplies the overeager rate on every shared base model (Delta in [11.9, 17.2] pp). The framework axis dominates effect size: a permissive cluster (Claude Code, Codex CLI, Gemini CLI) runs at 5.4-27.7% while the ask-to-continue framework (OpenHands) sits at 0.2-4.5% (Fisher p <= 10^-5). Within-framework base-model variance reaches 15.9 pp, indicating that model-layer alignment does not fully propagate through permissive permission gating.",
          "authors": [
            "Yubin Qu",
            "Ying Zhang",
            "Yanjun Zhang",
            "Gelei Deng",
            "Yuekang Li",
            "Leo Yu Zhang",
            "Yi Liu"
          ],
          "categories": [
            "cs.SE",
            "cs.AI",
            "cs.CL",
            "cs.CR"
          ],
          "paper_id": "http://arxiv.org/abs/2605.18583v1",
          "abstract_url": "https://arxiv.org/abs/2605.18583v1",
          "pdf_url": "https://arxiv.org/pdf/2605.18583v1",
          "published_at": "2026-05-18T16:00:41+00:00",
          "updated_at": "2026-05-18T16:00:41+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "Agent",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.18583",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18583v1"
          },
          "relevance_score": 68,
          "match_reasons": [
            "summary matched \"prompt injection\"",
            "has PDF",
            "has rich summary",
            "has complete metadata",
            "title matched \"coding agent\""
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18583"
        },
        {
          "title": "Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models",
          "summary": "The integration of audio modality into Large Audio Language Models (LALMs) significantly expands their attack surface. Existing jailbreak paradigms predominantly treat audio as a carrier for malicious payloads, relying on semantic optimization, acoustic parameter control, or additive perturbation to embed harmful content into the audio signal. In this work, we challenge this necessity and propose a new paradigm in which the role of audio shifts from content injection to safety alignment interference. We reveal that LALM safety alignment can be compromised solely by specific Acoustic Latent Semantics (ALS), the underlying paralinguistic features intrinsic to the priors of audio generative models. Distinct from previous works that leverage explicit acoustic parameters to merely style malicious audio, we demonstrate that interference audio, benign in content but infused with specific ALS, can serve as a universal jailbreak trigger. Leveraging this insight, we propose the Acoustic Interference Attack (AIA), which decouples the attack payload from the audio. Specifically, AIA employs a set of universal, instruction-neutral interference audio, enabling standard malicious text queries to bypass safety alignment without instance-specific optimization. Extensive experiments on 10 LALMs across five datasets demonstrate that AIA achieves the state-of-the-art attack success rate. Furthermore, our interpretability analysis uncovers the inference path drift induced by AIA and identifies the inherent effective patterns within ALS, revealing the fundamental vulnerability of cross-modal alignment in LALMs.",
          "authors": [
            "Yanyun Wang",
            "Yu Huang",
            "Zi Liang",
            "Xixin Wu",
            "Li Liu"
          ],
          "categories": [
            "cs.CR",
            "cs.SD"
          ],
          "paper_id": "http://arxiv.org/abs/2605.18168v1",
          "abstract_url": "https://arxiv.org/abs/2605.18168v1",
          "pdf_url": "https://arxiv.org/pdf/2605.18168v1",
          "published_at": "2026-05-18T10:10:31+00:00",
          "updated_at": "2026-05-18T10:10:31+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "方法"
          ],
          "topics": [
            "Language Model",
            "RAG"
          ],
          "doi": null,
          "arxiv_id": "2605.18168",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18168v1"
          },
          "relevance_score": 63,
          "match_reasons": [
            "title matched \"jailbreak\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18168"
        }
      ]
    },
    {
      "name": "Terminal and SWE Agents",
      "key_points": [
        "《Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents》〔应用 / 方法〕：Behavioral studies of LLM-based software engineering agents extract operational rules about which trajectory shapes correlate with higher resolution rates: tha…",
        "《SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution》〔评测 / 数据 / 应用 / 方法〕：Long-horizon LLM agents leave traces that could become reusable experience, but raw trajectories are noisy and hard to govern. We treat Agent Skills as an expe…",
        "《Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents》〔评测 / 方法〕：Legacy systems concentrate business rules, architectural decisions, and operational exceptions that often remain implicit in code, data, configuration, and mai…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents",
          "summary": "Behavioral studies of LLM-based software engineering agents extract operational rules about which trajectory shapes correlate with higher resolution rates: that a test step follows a code modification, that error cascades are short, or that trajectories are compact. Each rule is typically derived from a single framework, and whether it transfers, in sign as well as magnitude, to structurally different agent designs has not been directly tested. We address this at ecosystem scale: 64,380 SWE-bench runs from 126 agent configurations spanning 43 frameworks, where each configuration pairs an LLM with a framework (e.g., SWE-Agent, OpenHands) that supplies its tools and workflow. We separate framework effects from LLM effects by holding each layer fixed in turn, then measure one behavior-outcome effect per configuration and examine how those effects agree or disagree. Swapping the framework while the LLM is held fixed produces large behavioral differences in every action feature. On most signals, configurations disagree not merely in magnitude but in direction. Error rate is the cleanest case: 47 configurations resolve more issues when their error rate is lower, while 48 resolve more when it is higher. Five other continuous features and three of seven binary patterns from prior SE literature show similar directional disagreement. Framework identity accounts for more of this variation than LLM family: for mean turns, framework explains 64% of the between-configuration variance against the LLM's 10%. The implication is that the same observable behavioral signal can carry opposite meaning for different agent configurations. Behavioral findings from any single framework therefore warrant cross-configuration validation before being claimed as general.",
          "authors": [
            "Wei Ma",
            "Zhi Chen",
            "Jingxu Gu",
            "Tianling Li",
            "Shangqing Liu",
            "Lingxiao Jiang"
          ],
          "categories": [
            "cs.SE",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2605.18332v1",
          "abstract_url": "https://arxiv.org/abs/2605.18332v1",
          "pdf_url": "https://arxiv.org/pdf/2605.18332v1",
          "published_at": "2026-05-18T12:49:18+00:00",
          "updated_at": "2026-05-18T12:49:18+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.18332",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18332v1"
          },
          "relevance_score": 83,
          "match_reasons": [
            "title matched \"software engineering agent\"",
            "summary matched \"SWE-bench\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18332"
        },
        {
          "title": "SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution",
          "summary": "Long-horizon LLM agents leave traces that could become reusable experience, but raw trajectories are noisy and hard to govern. We treat Agent Skills as an experience schema that couples executable scripts, with non-executable guidance on procedures. Yet open skill ecosystems contain redundant, uneven, environment-sensitive artifacts, and indiscriminate updates can pollute future context. We present SkillsVote, a lifecycle-governance framework for Agent Skills from collection and recommendation to evolution. SkillsVote profiles a million-scale open-source corpus for environment requirements, quality, and verifiability, then synthesizes tasks for verifiable skills. Before execution, SkillsVote performs agentic library search over structured skill library to expose instructional skill context. After execution, it decomposes trajectories into skill-linked subtasks, attributes outcomes to skill use, agent exploration, environment, and result signals, and admits only successful reusable discoveries to evidence-gated updates. In our evaluation, offline evolution improves GPT-5.2 on Terminal-Bench 2.0 by up to 7.9 pp, while online evolution improves SWE-Bench Pro by up to 2.6 pp. Overall, governed external skill libraries can improve frozen agents without model updates when systems control exposure, credit, and preservation.",
          "authors": [
            "Hongyi Liu",
            "Haoyan Yang",
            "Tao Jiang",
            "Bo Tang",
            "Feiyu Xiong",
            "Zhiyu Li"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2605.18401v1",
          "abstract_url": "https://arxiv.org/abs/2605.18401v1",
          "pdf_url": "https://arxiv.org/pdf/2605.18401v1",
          "published_at": "2026-05-18T13:44:19+00:00",
          "updated_at": "2026-05-18T13:44:19+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.18401",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18401v1"
          },
          "relevance_score": 62,
          "match_reasons": [
            "summary matched \"Terminal-Bench\"",
            "summary matched \"SWE-bench\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18401"
        },
        {
          "title": "Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents",
          "summary": "Legacy systems concentrate business rules, architectural decisions, and operational exceptions that often remain implicit in code, data, configuration, and maintenance practices. At the same time, language-model-based coding agents depend on reliable context, correctness criteria, and behavioral contracts to modify real systems with lower risk. This paper presents Reversa, a reverse documentation engineering framework for converting legacy software into traceable operational specifications for AI agents. Reversa organizes this process as a multi-agent pipeline: specialized agents map the project surface, analyze modules, extract implicit rules, synthesize architecture, write unit-level specifications, and review generated claims. The proposal emphasizes three mechanisms: traceability between code and specification, explicit confidence marking, and preservation of gaps for human validation. The framework is distributed as a Node.js CLI, installs skills across multiple agent engines, and uses a SHA-256 manifest to preserve modified files during update or uninstall operations. In addition to the architectural description, we report an exploratory case study on migrating an ATM from COBOL to Go, in which the pipeline produced 517 claims classified by an internal confidence index, 10 registered gaps, 53 Gherkin parity scenarios, and a reconstruction plan with 9 of 11 tasks completed at inventory time. Final parity validation and cutover were not completed in this study. We do not claim broad empirical superiority; we position the contribution with respect to the literature on reverse engineering, LLM-based documentation, and software agents, and propose an evaluation protocol with metrics for coverage, traceability, confidence, utility, and cost.",
          "authors": [
            "Sanderson Oliveira de Macedo",
            "Ronaldo Martins da Costa"
          ],
          "categories": [
            "cs.SE",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2605.18684v1",
          "abstract_url": "https://arxiv.org/abs/2605.18684v1",
          "pdf_url": "https://arxiv.org/pdf/2605.18684v1",
          "published_at": "2026-05-18T17:23:13+00:00",
          "updated_at": "2026-05-18T17:23:13+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.18684",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.18684v1"
          },
          "relevance_score": 48,
          "match_reasons": [
            "summary matched \"coding agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.18684"
        }
      ]
    }
  ]
}