{
  "generated_at": "2026-05-26T13:09:24.051580+08:00",
  "timezone": "Asia/Shanghai",
  "lookback_hours": 24,
  "sorting": {
    "default_sort_by": "hybrid",
    "summary": "hybrid (relevance first, published_at tie-break)",
    "weights": {
      "title_match_weight": 40,
      "summary_match_weight": 18,
      "doi_weight": 12,
      "pdf_weight": 8,
      "rich_summary_weight": 6,
      "metadata_weight": 4,
      "multi_source_weight": 10,
      "freshness_weight_cap": 24
    },
    "feeds": [
      {
        "name": "LM",
        "sort_by": "hybrid"
      },
      {
        "name": "Agent Runtime Security",
        "sort_by": "hybrid"
      },
      {
        "name": "Terminal and SWE Agents",
        "sort_by": "hybrid"
      }
    ]
  },
  "highlights": [
    "主题「LLM」：命中 14 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《Automated Benchmark Auditing for AI Agents and Large Language Models》、《Causal methods for LLM development and evaluation》。",
    "主题「Language Model」：命中 14 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《Automated Benchmark Auditing for AI Agents and Large Language Models》、《Causal methods for LLM development and evaluation》。",
    "主题「Agent」：命中 4 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《PolyGnosis 2.0: Enhancing LLM Reasoning via Agentic Harness Engineering for Polymarket and OSINT Insight Extraction》、《DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking》。",
    "主题「Large Language Model」：命中 2 篇，覆盖 LM，代表论文包括 《Explaining Too Much? Understanding How Large Language Model Reasoning Traces Influence Performance and Metacognition》、《Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World》。",
    "主题「Reasoning」：命中 1 篇，覆盖 LM，代表论文包括 《MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models》。"
  ],
  "focus_items": [],
  "action_items": [],
  "topic_sections": [
    {
      "name": "LLM",
      "paper_count": 14,
      "feed_names": [
        "LM",
        "Agent Runtime Security"
      ],
      "paper_titles": [
        "Automated Benchmark Auditing for AI Agents and Large Language Models",
        "Causal methods for LLM development and evaluation",
        "PolyGnosis 2.0: Enhancing LLM Reasoning via Agentic Harness Engineering for Polymarket and OSINT Insight Extraction",
        "Can LLMs Time Travel? Enhancing Temporal Consistency in Legal Agentic Search through Reinforcement Learning",
        "When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation",
        "QUIET: A Multi-Blank Cascaded Story Cloze Benchmark for LLM Creative Generation Capability",
        "DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking",
        "Step-TP: A Grounded, Step-Level Dataset with Chain-of-Thought Reasoning for LLM-Guided Tensor Program Optimization",
        "TIAR: Trajectory-Informed Advantage Reweighting for LLM Abstention Learning",
        "Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation",
        "Beyond Summaries: Structure-Aware Labeling of Code Changes with Large Language Models",
        "Merge-Bench: Resolve Merge Conflicts with Large Language Models",
        "CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents",
        "AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions"
      ],
      "key_points": [
        "《Automated Benchmark Auditing for AI Agents and Large Language Models》〔评测 / 数据 / 方法〕：Modern AI benchmarks operate at a complexity that outpaces traditional verification methods. Tasks authored by domain experts often contain implicit assumption…",
        "《Causal methods for LLM development and evaluation》〔评测 / 应用 / 方法〕：Large language model (LLM) development is currently driven by large-scale empirical iteration over data mixtures, reward models, routing strategies, and evalua…"
      ]
    },
    {
      "name": "Language Model",
      "paper_count": 14,
      "feed_names": [
        "LM",
        "Agent Runtime Security"
      ],
      "paper_titles": [
        "Automated Benchmark Auditing for AI Agents and Large Language Models",
        "Causal methods for LLM development and evaluation",
        "Can LLMs Time Travel? Enhancing Temporal Consistency in Legal Agentic Search through Reinforcement Learning",
        "When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation",
        "QUIET: A Multi-Blank Cascaded Story Cloze Benchmark for LLM Creative Generation Capability",
        "Explaining Too Much? Understanding How Large Language Model Reasoning Traces Influence Performance and Metacognition",
        "MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models",
        "Step-TP: A Grounded, Step-Level Dataset with Chain-of-Thought Reasoning for LLM-Guided Tensor Program Optimization",
        "Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World",
        "TIAR: Trajectory-Informed Advantage Reweighting for LLM Abstention Learning",
        "Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation",
        "Beyond Summaries: Structure-Aware Labeling of Code Changes with Large Language Models",
        "Merge-Bench: Resolve Merge Conflicts with Large Language Models",
        "AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions"
      ],
      "key_points": [
        "《Automated Benchmark Auditing for AI Agents and Large Language Models》〔评测 / 数据 / 方法〕：Modern AI benchmarks operate at a complexity that outpaces traditional verification methods. Tasks authored by domain experts often contain implicit assumption…",
        "《Causal methods for LLM development and evaluation》〔评测 / 应用 / 方法〕：Large language model (LLM) development is currently driven by large-scale empirical iteration over data mixtures, reward models, routing strategies, and evalua…"
      ]
    },
    {
      "name": "Agent",
      "paper_count": 4,
      "feed_names": [
        "LM",
        "Agent Runtime Security"
      ],
      "paper_titles": [
        "PolyGnosis 2.0: Enhancing LLM Reasoning via Agentic Harness Engineering for Polymarket and OSINT Insight Extraction",
        "DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking",
        "CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents",
        "How Agentic AI Coding Assistants Become the Attacker's Shell"
      ],
      "key_points": [
        "《PolyGnosis 2.0: Enhancing LLM Reasoning via Agentic Harness Engineering for Polymarket and OSINT Insight Extraction》〔评测 / 方法〕：This paper introduces PolyGnosis 2.0, a pioneering multi-agent architecture designed to extract predictive intelligence by synthesizing Polymarket anomaly sign…",
        "《DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking》〔评测 / 方法〕：Frontier LLMs now perform strongly across a wide range of physics evaluations, but it is hard to disentangle genuine reasoning from recall of established scien…"
      ]
    },
    {
      "name": "Large Language Model",
      "paper_count": 2,
      "feed_names": [
        "LM"
      ],
      "paper_titles": [
        "Explaining Too Much? Understanding How Large Language Model Reasoning Traces Influence Performance and Metacognition",
        "Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World"
      ],
      "key_points": [
        "《Explaining Too Much? Understanding How Large Language Model Reasoning Traces Influence Performance and Metacognition》〔评测 / 方法〕：Large Language Model interfaces are increasingly verbose, exposing intermediate reasoning traces alongside final answers. Traces are framed as transparency mec…",
        "《Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World》〔评测 / 应用 / 方法〕：Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet curr…"
      ]
    },
    {
      "name": "Reasoning",
      "paper_count": 1,
      "feed_names": [
        "LM"
      ],
      "paper_titles": [
        "MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models"
      ],
      "key_points": [
        "《MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models》〔数据 / 方法〕：Instruction tuning of large vision-language models (LVLMs) increasingly depends on massive multimodal corpora, yet these datasets contain samples with substant…"
      ]
    }
  ],
  "template": "zh_daily_brief",
  "feeds": [
    {
      "name": "LM",
      "key_points": [
        "《Automated Benchmark Auditing for AI Agents and Large Language Models》〔评测 / 数据 / 方法〕：Modern AI benchmarks operate at a complexity that outpaces traditional verification methods. Tasks authored by domain experts often contain implicit assumption…",
        "《Causal methods for LLM development and evaluation》〔评测 / 应用 / 方法〕：Large language model (LLM) development is currently driven by large-scale empirical iteration over data mixtures, reward models, routing strategies, and evalua…",
        "《PolyGnosis 2.0: Enhancing LLM Reasoning via Agentic Harness Engineering for Polymarket and OSINT Insight Extraction》〔评测 / 方法〕：This paper introduces PolyGnosis 2.0, a pioneering multi-agent architecture designed to extract predictive intelligence by synthesizing Polymarket anomaly sign…",
        "《Can LLMs Time Travel? Enhancing Temporal Consistency in Legal Agentic Search through Reinforcement Learning》〔评测 / 应用 / 方法〕：While large language models (LLMs) augmented with agentic search capabilities show promise for legal reasoning, they overlook a fundamental constraint that app…",
        "《When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation》〔数据 / 方法〕：We document an empirical phenomenon in chain-of-thought and ReAct agents driven by ten large language models from seven architecture families: meaning-bearing…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Automated Benchmark Auditing for AI Agents and Large Language Models",
          "summary": "Modern AI benchmarks operate at a complexity that outpaces traditional verification methods. Tasks authored by domain experts often contain implicit assumptions, incomplete environment specifications, and brittle evaluation logic that human annotation cannot reliably catch. We introduce Auto Benchmark Audit (ABA), an agentic framework that systematically audits individual benchmark tasks, uncovering issues such as hidden environment dependencies, specification gaps, and limited grading logic. We run ABA on a collection of frontier LLM benchmarks and previous NeurIPS publications, totaling 168 benchmarks across nine domains. Across this corpus, ABA identifies critical issues including ambiguous task design, execution environment conflicts, and incorrect ground truths in over 25.7% of the evaluated tasks. The precision of these automated audits is validated by expert review and independent third-party reports such as upstream PRs. Crucially, we demonstrate that these problematic tasks severely distorts capability assessments for agents and LLMs: filtering out these tasks with issues shifts model rankings and increases average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6%, respectively. We release the agentic tool and all task annotations to support the future development of frontier benchmarks.",
          "authors": [
            "Junlin Wang",
            "Federico Bianchi",
            "Shang Zhu",
            "Fan Nie",
            "Yongchan Kwon",
            "Bhuwan Dhingra",
            "James Zou"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2605.26079v1",
          "abstract_url": "https://arxiv.org/abs/2605.26079v1",
          "pdf_url": "https://arxiv.org/pdf/2605.26079v1",
          "published_at": "2026-05-25T17:44:21+00:00",
          "updated_at": "2026-05-25T17:44:21+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.26079",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.26079v1"
          },
          "relevance_score": 244,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "title matched \"agent\"",
            "title matched \"benchmark\"",
            "summary matched \"LLM\"",
            "summary matched \"RAG\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata",
            "summary matched \"Terminal-Bench\"",
            "summary matched \"SWE-bench\""
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.26079"
        },
        {
          "title": "Causal methods for LLM development and evaluation",
          "summary": "Large language model (LLM) development is currently driven by large-scale empirical iteration over data mixtures, reward models, routing strategies, and evaluation pipelines. Here, we argue that many central questions in LLM development and evaluation are inherently causal: What is the effect of adding a data domain during pretraining? How do annotator preferences change when LLMs generate text in a different style? Should a prompt be routed to a larger or smaller model given inference cost constraints? In general, causal methods are well-suited to such settings where interventions change outcomes but, surprisingly, are underrepresented in LLM development. Our contribution is threefold: (1) We explain how causal methods can help develop modern LLM development and evaluation: LLM development relies heavily on logged data, which are often subject to confounding and distribution shifts; evaluation uses learned but potentially biased judges; and deployment environments are non-stationary. These conditions make purely predictive approaches fragile and create opportunities for principled identification and estimation methods from causal inference. (2) We further map opportunities for causal methods in the entire LLM development pipeline, including pretraining, alignment, routing, agentic workflows, and evaluation. (3) We discuss new research opportunities around leveraging causal methods for LLM development and evaluation. Overall, we argue that causal methods are potentially underutilized for the LLM development and evaluation pipeline, despite the fact that such methods can ensure a reliable and scientifically grounded design.",
          "authors": [
            "Dennis Frauen",
            "Marie Brockschmidt",
            "Konstantin Hess",
            "Haorui Ma",
            "Yuchen Ma",
            "Abdurahman Maarouf",
            "Maresa Schröder",
            "Jonas Schweisthal",
            "Yuxin Wang",
            "Athiya Deviyani",
            "Sonali Parbhoo",
            "Rahul G. Krishnan",
            "Stefan Feuerriegel"
          ],
          "categories": [
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2605.25998v1",
          "abstract_url": "https://arxiv.org/abs/2605.25998v1",
          "pdf_url": "https://arxiv.org/pdf/2605.25998v1",
          "published_at": "2026-05-25T16:15:44+00:00",
          "updated_at": "2026-05-25T16:15:44+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": "10.1145/3770855.3818647",
          "arxiv_id": "2605.25998",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.25998v1",
            "doi": "https://doi.org/10.1145/3770855.3818647"
          },
          "relevance_score": 211,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"evaluation\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"agent\"",
            "summary matched \"alignment\"",
            "summary matched \"RAG\"",
            "has DOI",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.1145/3770855.3818647"
        },
        {
          "title": "PolyGnosis 2.0: Enhancing LLM Reasoning via Agentic Harness Engineering for Polymarket and OSINT Insight Extraction",
          "summary": "This paper introduces PolyGnosis 2.0, a pioneering multi-agent architecture designed to extract predictive intelligence by synthesizing Polymarket anomaly signals with global Open Source Intelligence (OSINT) streams, specifically Global Database of Events, Language, and Tone (GDELT). We define and target \"Perspective Mismatches\", the narrative divergence between Polymarket sentiment and global media flows, as high-alpha trading signals. Moving beyond generic agentic superiority, we rigorously quantify the efficacy of \"Harness Engineering\" techniques, including reflection loops, tool-calling, divide-and-conquer partitioning (D&C), and chain-of-thought (CoT), within high-noise financial domains. Our empirical evaluation against human-expert benchmarks reveals that while structural partitioning is mandatory for multi-dimensional alignment, unconstrained terminal reflection actively induces logical drift. Furthermore, we identify a pervasive \"consensus bias\" across all agent configurations during narrative reasoning, necessitating deterministic validation. Ultimately, we isolate a Pareto-optimal configuration that achieves professional-grade analytical precision while minimizing latency and token overhead, providing a robust blueprint for autonomous intelligence in prediction markets.",
          "authors": [
            "Daren Wang",
            "Hong Xu",
            "Jiawen Xian"
          ],
          "categories": [
            "cs.CL",
            "cs.CE"
          ],
          "paper_id": "http://arxiv.org/abs/2605.25958v1",
          "abstract_url": "https://arxiv.org/abs/2605.25958v1",
          "pdf_url": "https://arxiv.org/pdf/2605.25958v1",
          "published_at": "2026-05-25T15:30:54+00:00",
          "updated_at": "2026-05-25T15:30:54+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.25958",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.25958v1"
          },
          "relevance_score": 202,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"reasoning\"",
            "title matched \"agent\"",
            "summary matched \"alignment\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.25958"
        },
        {
          "title": "Can LLMs Time Travel? Enhancing Temporal Consistency in Legal Agentic Search through Reinforcement Learning",
          "summary": "While large language models (LLMs) augmented with agentic search capabilities show promise for legal reasoning, they overlook a fundamental constraint that applicable law must match the temporal context of each case, as retroactive application of statutes violates core legal principles and leads to erroneous conclusions. Our observations reveal that current legal LLMs suffer from temporal bias anchored to their training cutoff, while search agents rarely incorporate temporal constraints into queries, and that web search alone cannot provide the precise statute and precedent citations that legal reasoning demands. To address these challenges, we propose LegalSearch-R1, an end-to-end reinforcement learning framework that pairs local statute RAG for precise article matching with online web search for broader legal knowledge, trained on temporally-indexed data spanning multiple amendment periods to enforce temporal consistency. Extensive experiments on our benchmark covering 13 legal tasks demonstrate that our 7B-parameter agent outperforms state-of-the-art deep research frameworks and specialized legal LLMs by 12.9% to 29.8%, surpasses baselines by 57.7% to 80.3% on temporal consistency, and exhibits robust out-of-domain generalization. The code and data are available at https://github.com/AlexFanw/LegalSearch-R1.",
          "authors": [
            "Wei Fan",
            "Yining Zhou",
            "Mufan Zhang",
            "Yanbing Weng",
            "Yiran HU",
            "Tianshi Zheng",
            "Baixuan Xu",
            "Chunyang Li",
            "Jianhui Yang",
            "Haoran Li",
            "Yangqiu Song"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2605.25920v1",
          "abstract_url": "https://arxiv.org/abs/2605.25920v1",
          "pdf_url": "https://arxiv.org/pdf/2605.25920v1",
          "published_at": "2026-05-25T14:57:13+00:00",
          "updated_at": "2026-05-25T14:57:13+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.25920",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.25920v1"
          },
          "relevance_score": 197,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"reasoning\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.25920"
        },
        {
          "title": "When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation",
          "summary": "We document an empirical phenomenon in chain-of-thought and ReAct agents driven by ten large language models from seven architecture families: meaning-bearing perturbations (e.g., paraphrase, synonym) alter final answers more often than presentation perturbations (e.g., formatting, reordering) of comparable severity. Across 68 cells spanning GSM8K, MATH, and HotpotQA (1,530 originals and $\\sim$11,150 variants), the inconsistency gap averages +19.69 pp after severity matching (paired $t=9.58$, $p<0.0001$), with 64/68 cells positive. The gap survives four severity-proxy audits and remains significant when excluding qwen models (+11.10 pp, $p<0.0001$). Several stress tests fail honestly: cluster-bootstrap significance disappears under stricter assumptions, tractability contrasts do not replicate, cross-architecture generator swaps break per-cell rankings, and a second LLM judge yields only moderate agreement ($κ=0.50$). We then validate the headline effect on a fully held-out 11th model (qwen2.5-14B-Instruct; 1,800 trajectories) and re-test a pre-registered capability$\\times$tractability partition, observing a small but positive held-out effect (3/4 cells positive; pooled Welch $t=3.81$, $p=9.6\\times10^{-4}$). Using held-out trajectories, we probe four trace-level mechanism signals. Two prior mechanism claims fail to replicate and are explicitly retracted. Two new probes instead support a \\emph{stealth-divergence} picture: semantic perturbations often preserve the first action but induce divergence in intermediate reasoning from later steps onward, accompanied by slightly deeper trajectories. We position this as a measurement contribution with held-out replication and a partial trace-level account of how semantic perturbations propagate through agent reasoning. Code, perturbation corpus, raw trajectories, and analysis scripts are released anonymously for review.",
          "authors": [
            "Liyun Zhang",
            "Jiayi Guo"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2605.25981v1",
          "abstract_url": "https://arxiv.org/abs/2605.25981v1",
          "pdf_url": "https://arxiv.org/pdf/2605.25981v1",
          "published_at": "2026-05-25T15:57:11+00:00",
          "updated_at": "2026-05-25T15:57:11+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.25981",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.25981v1"
          },
          "relevance_score": 180,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"reasoning\"",
            "summary matched \"RAG\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.25981"
        },
        {
          "title": "QUIET: A Multi-Blank Cascaded Story Cloze Benchmark for LLM Creative Generation Capability",
          "summary": "Large language models (LLMs) face a dual challenge in creative capability evaluation: existing benchmarks (e.g., Story Cloze Test, HellaSwag) measure models' discriminative ability over narrative continuation using multiple-choice recognition paradigms, rather than directly measuring creative generation capability; rubric-based scoring and LLM-as-Judge methods rely on subjective dimension assessment or natural language model outputs, and cannot provide objective, automated scoring mechanisms. This paper proposes QUIET (Quality Understanding via Interlocked Evaluation Testing), a diagnostic benchmark for LLM creative capability based on multi-blank cascaded story cloze. QUIET sets N blanks (10-20) in a story with complete structure, with each blank accompanied by an explicit content constraint, and cascade dependency relationships between blanks -- the content filled into earlier blanks constrains the feasible solution space for later blanks. The evaluated model (or human participants) fills all blanks in open-ended generation mode; the results are scored by an information-theoretic automated scoring protocol without human grading. The scoring protocol directly operationalizes the \"calibrated surprise\" theoretical framework (Zou & Xu, 2026a). For each blank k, a composite score is computed: score = satisfy * (1 + lambda * surprise), where lambda = 1.0. Here, \"satisfy\" measures how well the blank filling satisfies the content constraint (objective logical reasoning judgment, not subjective aesthetic scoring), and \"surprise\" measures the degree of surprise given that the constraint is satisfied. Creative answers that do not satisfy the constraint score zero; answers that satisfy the constraint but are mediocre score low; answers that satisfy the constraint and are surprising score high.",
          "authors": [
            "Bo Zou",
            "Chao Xu"
          ],
          "categories": [
            "cs.CL",
            "cs.AI",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2605.25955v1",
          "abstract_url": "https://arxiv.org/abs/2605.25955v1",
          "pdf_url": "https://arxiv.org/pdf/2605.25955v1",
          "published_at": "2026-05-25T15:29:58+00:00",
          "updated_at": "2026-05-25T15:29:58+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.25955",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.25955v1"
          },
          "relevance_score": 180,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"benchmark\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"reasoning\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.25955"
        },
        {
          "title": "DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking",
          "summary": "Frontier LLMs now perform strongly across a wide range of physics evaluations, but it is hard to disentangle genuine reasoning from recall of established science. We introduce DiscoverPhysics, an interactive benchmark that asks a LLM agent to discover the laws of motion of a simulated world whose physics deliberately deviates from our own. We construct 22 worlds governed by, among others, screened and fractional-power gravity, multi-species couplings, hidden dark-matter-like particles, non-coordinate-free physics, and time-varying interactions. Each world is generated on demand by an N-body simulator, for which the agent proposes several rounds of experiments, observes raw trajectory data, and ultimately submits both a natural-language explanation of the world's physics and a Python implementation of the inferred law. Because solving a world requires the agent to design informative experiments and revise its hypotheses, the benchmark probes long-horizon reasoning over an experimental history. We evaluate submissions along two complementary axes: trajectory MSE on held-out particles and an LLM-judged explanation score following an expert-written rubric assessing conceptual understanding of each world. Across eleven frontier models, we find that the strongest agents pass only half of the worlds and consistently fail on those where latent structure must be uncovered. Open-source models lag substantially behind commercial models, both in their ability to design informative experiments and in extracting conclusions from the data. We further find that good predictive accuracy does not guarantee high explanation quality and that conceptual understanding depends on hypothesis refinement through well-chosen experiments.",
          "authors": [
            "Matt L. Wiemann",
            "Lindsay M. Smith",
            "Peter Melchior",
            "Siddharth Mishra-Sharma",
            "Andrew Gordon Wilson",
            "Pavel Izmailov",
            "Carolina Cuesta-Lázaro"
          ],
          "categories": [
            "stat.ML",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2605.26087v1",
          "abstract_url": "https://arxiv.org/abs/2605.26087v1",
          "pdf_url": "https://arxiv.org/pdf/2605.26087v1",
          "published_at": "2026-05-25T17:50:07+00:00",
          "updated_at": "2026-05-25T17:50:07+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.26087",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.26087v1"
          },
          "relevance_score": 164,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"benchmark\"",
            "summary matched \"reasoning\"",
            "summary matched \"agent\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.26087"
        },
        {
          "title": "Explaining Too Much? Understanding How Large Language Model Reasoning Traces Influence Performance and Metacognition",
          "summary": "Large Language Model interfaces are increasingly verbose, exposing intermediate reasoning traces alongside final answers. Traces are framed as transparency mechanisms, yet it is unclear how people use them to solve problems. We report a preregistered between-subjects study (N = 559) in which participants solved ten LSAT-style reasoning problems under one of three conditions: an Answer-only baseline, a Full-trace revealed before the answer, and a Summary-trace presented alongside the answer. Summaries preserved task performance at the no-trace baseline while significantly elevating trust and hedonic appeal, establishing that trace exposure shifts subjective appraisal of the interaction without bringing performance benefits. Under an open-weight reasoning model exposing verbose intermediate output, full traces additionally impaired performance relative to the answer-only baseline. Across all conditions, participants substantially overestimated their performance, and no trace format supported calibrated self-evaluation. Further analysis indicates that hedonic appeal, not trust, carries the indirect path to overestimation, consistent with a processing-fluency account. Reasoning traces are best understood as user-facing interface artifacts rather than transparent windows into model cognition, and calibration is unlikely to emerge from the traces themselves and may best be scaffolded by interactions that elicit users' own reasoning first.",
          "authors": [
            "Daniela Fernandes",
            "Daniel Buschek",
            "Lev Tankelevitch",
            "Thomas Kosch",
            "Robin Welsch"
          ],
          "categories": [
            "cs.HC",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2605.25856v1",
          "abstract_url": "https://arxiv.org/abs/2605.25856v1",
          "pdf_url": "https://arxiv.org/pdf/2605.25856v1",
          "published_at": "2026-05-25T13:46:04+00:00",
          "updated_at": "2026-05-25T13:46:04+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Language Model",
            "Large Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.25856",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.25856v1"
          },
          "relevance_score": 164,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "title matched \"reasoning\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.25856"
        },
        {
          "title": "MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models",
          "summary": "Instruction tuning of large vision-language models (LVLMs) increasingly depends on massive multimodal corpora, yet these datasets contain samples with substantial redundancy, low visual dependency, and highly imbalanced coverage of multimodal reasoning behaviors. As a result, uniform subsampling or naive score-based selection often yields suboptimal training subsets. We introduce MAGIC, a training-free, forward-only coreset selection method designed to construct compact yet behaviorally faithful subsets for multimodal instruction tuning. MAGIC is built on three intrinsic signals extracted from a pretrained VLM: Multimodal Gain, which measures the likelihood improvement obtained from visual input; Bridging Relevance, which captures the sharpness of answer-token grounding over visual tokens; and Skill-Neuron Signatures, which characterize the functional computation elicited by each sample via top-activated feed-forward neurons. MAGIC combines these signals in a three-stage pipeline: filtering low-gain examples, ranking candidates by a normalized quality objective, and performing bucket-wise budget allocation over discrete neuron signatures to preserve latent multimodal skill coverage. This formulation avoids backpropagation, auxiliary selector training, and expensive clustering in continuous activation spaces, while remaining efficient and easily deployable in existing VLMs. Across LLaVA-665K and Vision-Flan datasets, and transfer settings to large target models, LLaVA-1.5-7B and -13B, MAGIC consistently improves over strong baselines under matched 20% budgets: it achieves 100.3% relative performance to full finetuning on LLaVA-665K and 101.6% relative performance on Vision-Flan-186K, while yielding a 73.7% reduction in wall-clock run time.",
          "authors": [
            "Shristi Das Biswas",
            "Kaushik Roy"
          ],
          "categories": [
            "cs.CV",
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2605.26004v1",
          "abstract_url": "https://arxiv.org/abs/2605.26004v1",
          "pdf_url": "https://arxiv.org/pdf/2605.26004v1",
          "published_at": "2026-05-25T16:22:09+00:00",
          "updated_at": "2026-05-25T16:22:09+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "方法"
          ],
          "topics": [
            "Language Model",
            "Reasoning"
          ],
          "doi": null,
          "arxiv_id": "2605.26004",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.26004v1"
          },
          "relevance_score": 163,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"alignment\"",
            "summary matched \"reasoning\"",
            "summary matched \"instruction tuning\"",
            "summary matched \"RAG\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.26004"
        },
        {
          "title": "Step-TP: A Grounded, Step-Level Dataset with Chain-of-Thought Reasoning for LLM-Guided Tensor Program Optimization",
          "summary": "Despite the strong reasoning capabilities of large language models (LLMs), optimizing the execution efficiency of tensor programs remains challenging due to the need for precise, composable transformation decisions. Recent LLM-guided approaches frame tensor program optimization as an iterative decision process, but existing datasets provide only end-to-end optimized program pairs using token-inefficient representations, lacking verifiable step-level supervision and interpretability. As a result, LLMs struggle to make reliable single-step decisions in large combinatorial optimization spaces. We introduce Step-TP, a post-training dataset for tensor program optimization that provides grounded, atomic, step-level supervision with structured chain-of-thought (CoT) reasoning. Step-TP forms a closed reasoning loop over intermediate program states, enabling reliable multi-step optimization rather than outcome imitation. Its design is guided by four principles: (i) a token-efficient, verifiable intermediate representation (IR) that deterministically lowers to TVM TIR; (ii) atomic and composable optimization strategies that decompose complex trajectories into interpretable single-step decisions; (iii) structured CoT supervision coupled with explicit IR-to-IR state transitions; and (iv) strategy filtering to balance coverage while preventing shortcut exploitation. The dataset and implementation are available at a GitHub link, https://github.com/LIUMENGFAN-gif/StepTP.",
          "authors": [
            "Mengfan Liu",
            "Da Zheng",
            "Junwei Su",
            "Chuan Wu"
          ],
          "categories": [
            "cs.LG",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2605.25954v1",
          "abstract_url": "https://arxiv.org/abs/2605.25954v1",
          "pdf_url": "https://arxiv.org/pdf/2605.25954v1",
          "published_at": "2026-05-25T15:29:49+00:00",
          "updated_at": "2026-05-25T15:29:49+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.25954",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.25954v1"
          },
          "relevance_score": 162,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"reasoning\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"RAG\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.25954"
        },
        {
          "title": "Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World",
          "summary": "Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. To instantiate this setting, we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals. Agents must reason over rich contextual environments while remaining robust to such noise. This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance. Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure.",
          "authors": [
            "Yusong Lin",
            "Xinyuan Liang",
            "Haiyang Wang",
            "Qipeng Gu",
            "Siqi Cheng",
            "Jiangui Chen",
            "Shuzhe Wu",
            "Feiyang Pan",
            "Lue Fan",
            "Sanyuan Zhao",
            "Dandan Tu"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2605.26086v1",
          "abstract_url": "https://arxiv.org/abs/2605.26086v1",
          "pdf_url": "https://arxiv.org/pdf/2605.26086v1",
          "published_at": "2026-05-25T17:50:04+00:00",
          "updated_at": "2026-05-25T17:50:04+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Language Model",
            "Large Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.26086",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.26086v1"
          },
          "relevance_score": 160,
          "match_reasons": [
            "title matched \"benchmark\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"reasoning\"",
            "summary matched \"agent\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.26086"
        },
        {
          "title": "TIAR: Trajectory-Informed Advantage Reweighting for LLM Abstention Learning",
          "summary": "This paper investigates large language model (LLM) abstention learning, specifically using ternary reward, which incentivize truthfulness in large language models. This paper extends that idea by moving from a ternary reward to a Trajectory-Informed advantage reweighting, dynamically re-weights the abstention reward during Group Relative Policy Optimization (GRPO) training. The objective of this work focuses on abstention learning instead of improving truthfulness, serving as an exploration into hallucination reduction. The novelty of this paper lies in methodological innovation, advantage re-weighting, and benchmark selection. Leveraging GRPO's multiple trajectories as a natural abstention signal, this method uses a reward signal to explore knowledge boundaries and encourage consistency. By demonstrating that trajectories can be used as a confidence indicator of the policy relative to the query, they are then used to dynamically calculate the abstention advantage. AbstentionBench is used as the evaluation benchmark, as this work aims to contribute to the field of abstention learning. All datasets on the benchmark were tested against this method and various baselines. Empirical results demonstrate that TIAR achieves state-of-the-art abstention F1 scores across five of six evaluation categories, outperforming the static ternary baseline on 17 of 31 benchmark datasets while fully preserving baseline accuracy.",
          "authors": [
            "Muyu Pan",
            "Shu Zhao",
            "Nan Zhang",
            "Philip Shin",
            "Varun Parekh",
            "Vijaykrishnan Narayanan",
            "Rui Zhang"
          ],
          "categories": [
            "cs.CL",
            "cs.AI",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2605.25850v1",
          "abstract_url": "https://arxiv.org/abs/2605.25850v1",
          "pdf_url": "https://arxiv.org/pdf/2605.25850v1",
          "published_at": "2026-05-25T13:42:37+00:00",
          "updated_at": "2026-05-25T13:42:37+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.25850",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.25850v1"
          },
          "relevance_score": 156,
          "match_reasons": [
            "title matched \"LLM\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.25850"
        },
        {
          "title": "Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation",
          "summary": "Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instructions. Existing approaches often encode text and reference images separately. This limits cross-modal reasoning abilities and causes copy-paste artifacts. Recent frameworks that connect multimodal models and diffusion models improve instruction following, but largely overlook identity preservation. To address these limitations, we condition diffusion models on Multimodal Large Language Models (MLLMs) that jointly encode text and reference images, and augment it with VAE-based identity conditioning. A novel Dual Layer Aggregation (DLA) module is designed to aggregate multi-level MLLM features for optimal conditioning, and a multi-stage denoising strategy is applied to progressively balance the semantic information from MLLM and fine-detail identity from VAE during inference. Extensive experiments demonstrate that our approach harmonizes multimodal understanding with identity preservation, mitigates copy-paste issues, and achieves superior performance regarding human preference on subject-driven image generation. Our project website is available at https://zsh2000.github.io/squeeze-mllm-subject-gen/.",
          "authors": [
            "Shuhong Zheng",
            "Aashish Kumar Misraa",
            "Yu-Teng Li",
            "Yu-Jhe Li",
            "Igor Gilitschenski"
          ],
          "categories": [
            "cs.CV",
            "cs.AI",
            "cs.GR",
            "cs.LG",
            "cs.MM"
          ],
          "paper_id": "http://arxiv.org/abs/2605.26111v1",
          "abstract_url": "https://arxiv.org/abs/2605.26111v1",
          "pdf_url": "https://arxiv.org/pdf/2605.26111v1",
          "published_at": "2026-05-25T17:59:35+00:00",
          "updated_at": "2026-05-25T17:59:35+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.26111",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.26111v1"
          },
          "relevance_score": 146,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.26111"
        },
        {
          "title": "Beyond Summaries: Structure-Aware Labeling of Code Changes with Large Language Models",
          "summary": "Code review is a critical practice in software engineering, yet the growing scale and frequency of code patches in modern projects, together with the widespread adoption of AI code assistants, make manual review increasingly challenging. Identifying the types of changes within a patch, such as renames, moves, or logic modifications, can substantially improve review efficiency by enabling prioritization, filtering, and automation. However, existing LLM-based approaches to code review have largely focused on summarization and comment generation, leaving structured code reviews underexplored. In this paper, we present a systematic study of using large language models (LLMs) for taxonomy-based labeling of code changes in a code patch. We introduce a two-stage pipeline that assigns labels to diff hunks and then refines them to capture structural relationships and semantic attributes, such as rename propagation and type changes. Our approach employs few-shot prompting to produce language-agnostic and customizable labels, without the engineering overhead of traditional static-analysis pipelines. We evaluate four LLMs across multiple context configurations on a manually curated benchmark of natural and synthetic patches. Our best configuration achieves up to $84\\%$ recall and $81\\%$ precision, with high accuracy in extracting relational and attribute metadata. These results suggest that LLM-based labeling can effectively complement static analysis by enabling flexible, multilingual, and automation-friendly code review workflows.",
          "authors": [
            "Bar Weiss",
            "Antonio Abu-Nassar",
            "Adi Sosnovich",
            "Karen Yorav"
          ],
          "categories": [
            "cs.SE",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2605.26100v1",
          "abstract_url": "https://arxiv.org/abs/2605.26100v1",
          "pdf_url": "https://arxiv.org/pdf/2605.26100v1",
          "published_at": "2026-05-25T17:56:46+00:00",
          "updated_at": "2026-05-25T17:56:46+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.26100",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.26100v1"
          },
          "relevance_score": 146,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.26100"
        },
        {
          "title": "Merge-Bench: Resolve Merge Conflicts with Large Language Models",
          "summary": "This paper applies machine learning to the difficult and important task of version control merging. (1) We constructed a dataset, Merge-Bench, of 7938 real-world merge conflict hunks from 1439 GitHub repositories. The ground truth is the merge resolution that developers committed to the repository. Our dataset construction methodology is scalable to arbitrary amounts of data since no manual labeling is required. (2) We trained a model, LLMergeJ, to resolve merge conflicts in Java programs. Our approach uses Group Relative Policy Optimization (GRPO), an online reinforcement learning method, to train a Large Language Model (LLM). (3) We performed two evaluations of the performance of LLMs on resolving merge conflicts. On Java programs, LLMergeJ with 14B parameters outperforms 3 commercial LLMs, trailing only Gemini 2.5 Pro. Across 11 programming languages, commercial LLM performance is largely stable from language to language. The best models correctly resolve less than 60% of merge conflicts.",
          "authors": [
            "Benedikt Schesch",
            "Michael D. Ernst"
          ],
          "categories": [
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2605.25890v1",
          "abstract_url": "https://arxiv.org/abs/2605.25890v1",
          "pdf_url": "https://arxiv.org/pdf/2605.25890v1",
          "published_at": "2026-05-25T14:17:48+00:00",
          "updated_at": "2026-05-25T14:17:48+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.25890",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.25890v1"
          },
          "relevance_score": 143,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.25890"
        }
      ]
    },
    {
      "name": "Agent Runtime Security",
      "key_points": [
        "《CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents》〔评测 / 数据 / 应用 / 方法〕：Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension t…",
        "《How Agentic AI Coding Assistants Become the Attacker's Shell》〔应用 / 方法〕：Agentic AI coding assistants can edit files, run commands, and access the internet on behalf of developers. However, their reliance on unvetted external artifa…",
        "《AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions》〔评测 / 应用 / 方法〕：Autonomous computer use agents that powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workf…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents",
          "summary": "Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension to computer-use agents (CUAs) has been bottlenecked by the scarcity of scalable training data with deterministic rewards. Constructing such data for CUAs requires consistent task instruction, executable environment, and verifiable reward. However, hand-curated benchmarks achieve high reward fidelity but cover few applications and LLM-as-judge-based datasets scale broadly but lack reliable verification. We present CUA-Gym, a scalable pipeline that co-generates task instructions, environment states, and reward functions. Concretely, a Generator agent constructs the initial and golden environment states, and a separate Discriminator agent writes the reward function from the task specification. An orchestrator agent drives the two through iterative rounds upon execution. Generated tuples then pass a final filter combining LLM majority voting and agent rollouts, ensuring quality beyond the per-task adversarial loop. To address the scarcity of training environments, we further synthesize CUA-Gym-Hub, a broad suite of high-fidelity mock web applications grounded in real-world software-use distributions, expanding the scale of CUA RLVR data by magnitude. Using this pipeline, we construct CUA-Gym, a dataset of 32,112 verified RLVR training tuples grounded in 110 environments. Trained with GSPO on CUA-Gym, our CUA-Gym-A3B and CUA-Gym-A17B achieve 62.1% and 72.6% on OSWorld-Verified, outperforming prior open-source CUAs at comparable scales, with performance scaling smoothly in both data volume and environment diversity. The same checkpoints also improve on the held-out WebArena benchmark, indicating transfer beyond the training environments. We will open-source the full synthesis pipeline, dataset, CUA-Gym-Hub environments, and models.",
          "authors": [
            "Bowen Wang",
            "Dunjie Lu",
            "Junli Wang",
            "Tianyi Bai",
            "Shixuan Liu",
            "Zhipeng Zhang",
            "Haiquan Wang",
            "Hao Hu",
            "Tianbao Xie",
            "Shuai Bai",
            "Dayiheng Liu",
            "Que Shen",
            "Junyang Lin",
            "Tao Yu"
          ],
          "categories": [
            "cs.AI",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2605.25624v1",
          "abstract_url": "https://arxiv.org/abs/2605.25624v1",
          "pdf_url": "https://arxiv.org/pdf/2605.25624v1",
          "published_at": "2026-05-25T09:28:03+00:00",
          "updated_at": "2026-05-25T09:28:03+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.25624",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.25624v1"
          },
          "relevance_score": 62,
          "match_reasons": [
            "title matched \"computer-use agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.25624"
        },
        {
          "title": "How Agentic AI Coding Assistants Become the Attacker's Shell",
          "summary": "Agentic AI coding assistants can edit files, run commands, and access the internet on behalf of developers. However, their reliance on unvetted external artifacts introduces a new attack vector. Hidden instructions in external artifacts can hijack these assistants, turning them into an attacker's shell to run unauthorized commands. In this article, we examine how these prompt injection attacks work, measure their prevalence, discuss the limitations and challenges of current defenses, and suggest future research directions.",
          "authors": [
            "Yue Liu",
            "Yanjie Zhao",
            "Yunbo Lyu",
            "Ting Zhang",
            "Haoyu Wang",
            "David Lo"
          ],
          "categories": [
            "cs.SE",
            "cs.CR"
          ],
          "paper_id": "http://arxiv.org/abs/2605.25871v1",
          "abstract_url": "https://arxiv.org/abs/2605.25871v1",
          "pdf_url": "https://arxiv.org/pdf/2605.25871v1",
          "published_at": "2026-05-25T13:59:48+00:00",
          "updated_at": "2026-05-25T13:59:48+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "Agent",
            "Prompt Injection"
          ],
          "doi": null,
          "arxiv_id": "2605.25871",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.25871v1"
          },
          "relevance_score": 44,
          "match_reasons": [
            "summary matched \"prompt injection\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.25871"
        },
        {
          "title": "AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions",
          "summary": "Autonomous computer use agents that powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workflows. However, real-world execution environments are far from ideal: pop-ups, resolution changes, and competing applications frequently interfere with agent perception and control. We introduce AgentHijack, a benchmark designed to evaluate the robustness of computer-use agents under common corruptions, where the uncertainties in dynamic environment disrupt the execution flow without direct adversarial intent. Specifically, AgentHijack introduces 9 configurable common corruptions to replicate realistic imperfect scenarios. We evaluate a variety of desktop tasks that utilize MLLM-based agents and discover that even minor instances of corruption can result in substantial performance degradation, which emphasizes the fragility of agents and underscores the necessity of robustness evaluation. Afterward, we propose AgentHijack-Agent, a framework that integrates an action generator with enhanced grounding capabilities and an onlooker responsible for behavior summarization and environment checking. Extensive experiments validate its effectiveness. Our code, environment, baseline models and data are publicly available at: https://AgentHijack.github.io.",
          "authors": [
            "Jingwei Sun",
            "Jianing Zhu",
            "Yuanyi Li",
            "Tongliang Liu",
            "Xia HU",
            "Bo Han"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2605.25707v1",
          "abstract_url": "https://arxiv.org/abs/2605.25707v1",
          "pdf_url": "https://arxiv.org/pdf/2605.25707v1",
          "published_at": "2026-05-25T11:09:22+00:00",
          "updated_at": "2026-05-25T11:09:22+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.25707",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.25707v1"
          },
          "relevance_score": 41,
          "match_reasons": [
            "summary matched \"computer-use agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.25707"
        }
      ]
    },
    {
      "name": "Terminal and SWE Agents",
      "key_points": [],
      "sort_by": "hybrid",
      "papers": []
    }
  ]
}