{
  "generated_at": "2026-05-29T13:18:32.598451+08:00",
  "timezone": "Asia/Shanghai",
  "lookback_hours": 24,
  "sorting": {
    "default_sort_by": "hybrid",
    "summary": "hybrid (relevance first, published_at tie-break)",
    "weights": {
      "title_match_weight": 40,
      "summary_match_weight": 18,
      "doi_weight": 12,
      "pdf_weight": 8,
      "rich_summary_weight": 6,
      "metadata_weight": 4,
      "multi_source_weight": 10,
      "freshness_weight_cap": 24
    },
    "feeds": [
      {
        "name": "LM",
        "sort_by": "hybrid"
      },
      {
        "name": "Agent Runtime Security",
        "sort_by": "hybrid"
      },
      {
        "name": "Terminal and SWE Agents",
        "sort_by": "hybrid"
      }
    ]
  },
  "highlights": [
    "主题「LLM」：命中 17 篇，覆盖 LM、Agent Runtime Security 等，代表论文包括 《FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via Board Game Simulations》、《Revisiting Anthropomorphic Reflection Markers in Large Language Model Reasoning》。",
    "主题「Benchmark」：命中 14 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via Board Game Simulations》、《CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models》。",
    "主题「Language Model」：命中 3 篇，覆盖 LM、Agent Runtime Security，代表论文包括 《CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models》、《Argument Quality Assessment with Large Language Models: A Pairwise Bradley-Terry Approach》。",
    "主题「Agent」：命中 2 篇，覆盖 Agent Runtime Security、Terminal and SWE Agents，代表论文包括 《AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security》、《Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software》。",
    "主题「Evaluation」：命中 2 篇，覆盖 Agent Runtime Security、Terminal and SWE Agents，代表论文包括 《Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures》、《Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas》。"
  ],
  "focus_items": [],
  "action_items": [],
  "topic_sections": [
    {
      "name": "LLM",
      "paper_count": 17,
      "feed_names": [
        "LM",
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via Board Game Simulations",
        "Revisiting Anthropomorphic Reflection Markers in Large Language Model Reasoning",
        "Argument Quality Assessment with Large Language Models: A Pairwise Bradley-Terry Approach",
        "KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs",
        "Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities",
        "ROSD: Reflective On-Policy Self-Distillation for Language Model Reasoning across Domains",
        "Framing Matters: Addressing Framing Sensitivity in Decision-Making through Behaviorally-Grounded Value Alignment",
        "HELEA: Hard-Negative Benchmark and LLM-based Reranking for Robust Entity Alignment",
        "LegalGraphRAG: Multi-Agent Graph Retrieval-Augmented Generation for Reliable Legal Reasoning",
        "ChildEval: When large language models meet children's personalities",
        "DisasterBench: Benchmarking LLM Planning under Typed Tool Interface Constraints",
        "LCO: LLM-based Constraint Optimization for Safer Agentic LLMs in Real-world Tasks",
        "ConRAG: Consensus-Driven Multi-View Retrieval for Multi-Hop Question Answering",
        "DEPART: DEcomposing PARiTy across Multilingual LLMs",
        "Robust and Efficient Guardrails with Latent Reasoning",
        "Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures",
        "Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas"
      ],
      "key_points": [
        "《FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via Board Game Simulations》〔评测 / 方法〕：Recently, large language models (LLMs) have achieved superior performance in static financial reasoning and simple dynamic trading tasks. However, existing sta…",
        "《Revisiting Anthropomorphic Reflection Markers in Large Language Model Reasoning》〔评测 / 应用 / 方法〕：Large Language Models (LLMs) often produce explicit reflective traces during complex reasoning, accompanied by anthropomorphic markers such as wait, hmm, and a…"
      ]
    },
    {
      "name": "Benchmark",
      "paper_count": 14,
      "feed_names": [
        "LM",
        "Agent Runtime Security"
      ],
      "paper_titles": [
        "FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via Board Game Simulations",
        "CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models",
        "Revisiting Anthropomorphic Reflection Markers in Large Language Model Reasoning",
        "KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs",
        "Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities",
        "ROSD: Reflective On-Policy Self-Distillation for Language Model Reasoning across Domains",
        "Framing Matters: Addressing Framing Sensitivity in Decision-Making through Behaviorally-Grounded Value Alignment",
        "HELEA: Hard-Negative Benchmark and LLM-based Reranking for Robust Entity Alignment",
        "ChildEval: When large language models meet children's personalities",
        "DisasterBench: Benchmarking LLM Planning under Typed Tool Interface Constraints",
        "LCO: LLM-based Constraint Optimization for Safer Agentic LLMs in Real-world Tasks",
        "ConRAG: Consensus-Driven Multi-View Retrieval for Multi-Hop Question Answering",
        "DEPART: DEcomposing PARiTy across Multilingual LLMs",
        "Robust and Efficient Guardrails with Latent Reasoning"
      ],
      "key_points": [
        "《FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via Board Game Simulations》〔评测 / 方法〕：Recently, large language models (LLMs) have achieved superior performance in static financial reasoning and simple dynamic trading tasks. However, existing sta…",
        "《CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models》〔评测 / 方法〕：Implicit Chain-of-Thought (CoT) reduces the inference cost of large language models by internalizing the explicit rationales. However, existing approaches typi…"
      ]
    },
    {
      "name": "Language Model",
      "paper_count": 3,
      "feed_names": [
        "LM",
        "Agent Runtime Security"
      ],
      "paper_titles": [
        "CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models",
        "Argument Quality Assessment with Large Language Models: A Pairwise Bradley-Terry Approach",
        "Provably Secure Agent Guardrail"
      ],
      "key_points": [
        "《CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models》〔评测 / 方法〕：Implicit Chain-of-Thought (CoT) reduces the inference cost of large language models by internalizing the explicit rationales. However, existing approaches typi…",
        "《Argument Quality Assessment with Large Language Models: A Pairwise Bradley-Terry Approach》〔评测 / 数据 / 方法〕：Large Language Models (LLMs) have demonstrated remarkable capabilities in tasks related to reasoning and judgment. However, assessing the quality of arguments…"
      ]
    },
    {
      "name": "Agent",
      "paper_count": 2,
      "feed_names": [
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security",
        "Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software"
      ],
      "key_points": [
        "《AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security》〔数据 / 应用 / 方法〕：Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, adv…",
        "《Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software》〔应用 / 方法〕：Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet an…"
      ]
    },
    {
      "name": "Evaluation",
      "paper_count": 2,
      "feed_names": [
        "Agent Runtime Security",
        "Terminal and SWE Agents"
      ],
      "paper_titles": [
        "Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures",
        "Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas"
      ],
      "key_points": [
        "《Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures》〔评测 / 方法〕：Attack Success Rate (ASR) evaluates each jailbreak with a single yes/no label at the end of generation, telling us whether a failure happened but not how it un…",
        "《Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas》〔评测 / 应用 / 方法〕：We study two-level autoresearch for cooperation: an outer-loop AI agent autonomously redesigns the inner-loop pipeline of an LLM policy-synthesis system for mu…"
      ]
    }
  ],
  "template": "zh_daily_brief",
  "feeds": [
    {
      "name": "LM",
      "key_points": [
        "《FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via Board Game Simulations》〔评测 / 方法〕：Recently, large language models (LLMs) have achieved superior performance in static financial reasoning and simple dynamic trading tasks. However, existing sta…",
        "《CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models》〔评测 / 方法〕：Implicit Chain-of-Thought (CoT) reduces the inference cost of large language models by internalizing the explicit rationales. However, existing approaches typi…",
        "《Revisiting Anthropomorphic Reflection Markers in Large Language Model Reasoning》〔评测 / 应用 / 方法〕：Large Language Models (LLMs) often produce explicit reflective traces during complex reasoning, accompanied by anthropomorphic markers such as wait, hmm, and a…",
        "《Argument Quality Assessment with Large Language Models: A Pairwise Bradley-Terry Approach》〔评测 / 数据 / 方法〕：Large Language Models (LLMs) have demonstrated remarkable capabilities in tasks related to reasoning and judgment. However, assessing the quality of arguments…",
        "《KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs》〔评测 / 方法〕：Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. However, SpeechLM evalu…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via Board Game Simulations",
          "summary": "Recently, large language models (LLMs) have achieved superior performance in static financial reasoning and simple dynamic trading tasks. However, existing static financial benchmarks are insufficient to assess the dynamic wealth management and financial decision-making capabilities of LLMs in real-world environments. To bridge this gap, we present FinBoardBench, an evaluation suite based on three classic financial board games: Cashflow, Acquire, and Monopoly. FinBoardBench assesses a comprehensive set of financial skills, including personal cash flow management with debt balancing, corporate investment and acquisition forecasting, and competitive trade negotiations with asset auctions. Our experiments with 9 advanced LLMs reveal that while exhibiting basic long-term planning and investment logic, they fail to effectively leverage complex interactions for profit, and their strong static reasoning performance does not transform into successful dynamic decision-making. Notably, they tend to prioritize immediate asset acquisition over maintaining sufficient liquidity, making them vulnerable to financial crises triggered by random events. We hope that FinBoardBench can provide a valuable reference for more intelligent LLM-based decision-making systems in the future.",
          "authors": [
            "Xuesi Hu",
            "Peng Wang",
            "Jinpeng Miao",
            "Xilin Tao",
            "Caiwei Li",
            "Yue Ma",
            "Jie He",
            "Qiancheng Zhang",
            "Yuntao Zou",
            "Dagang Li"
          ],
          "categories": [
            "cs.CL",
            "cs.CE"
          ],
          "paper_id": "https://arxiv.org/abs/2605.27896",
          "abstract_url": "https://arxiv.org/abs/2605.27896",
          "pdf_url": "https://arxiv.org/pdf/2605.27896",
          "published_at": "2026-05-29T04:00:00+00:00",
          "updated_at": "2026-05-29T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.27896",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.27896"
          },
          "relevance_score": 232,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"reasoning\"",
            "title matched \"benchmark\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"RAG\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.27896"
        },
        {
          "title": "CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models",
          "summary": "Implicit Chain-of-Thought (CoT) reduces the inference cost of large language models by internalizing the explicit rationales. However, existing approaches typically lack alignment with explicit rationales and adaptivity to example complexity. In this work, we propose CIRF (\\textit{\\underline{C}hain-of-thoughts \\underline{I}nto \\underline{R}eusable \\underline{F}unctional units}), an implicit CoT framework that performs reasoning as a dynamic sequence of discrete functional tokens. CIRF assigns a functional token to each semantically coherent reasoning unit in explicit CoT traces. The model is then fine-tuned to autoregressively generate functional tokens and their optional results, followed by the final answer. This design aligns latent reasoning with a sequence of functional units, facilitating parallel training, explicit rationale alignment, and adaptive reasoning. Extensive experiments on mathematical, symbolic, and commonsense reasoning benchmarks show that CIRF provides a favorable accuracy-latency trade-off compared with state-of-the-art implicit CoT methods. Further analyses demonstrate that CIRF constructs distinct, interpretable functional tokens, leading to consistent performance improvements.",
          "authors": [
            "Yukyung Lee",
            "Yumeng Shen",
            "Jinhyeong Park",
            "Hyein Yang",
            "Jun-Hyung Park"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.28292",
          "abstract_url": "https://arxiv.org/abs/2605.28292",
          "pdf_url": "https://arxiv.org/pdf/2605.28292",
          "published_at": "2026-05-29T04:00:00+00:00",
          "updated_at": "2026-05-29T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "Benchmark",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.28292",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.28292"
          },
          "relevance_score": 196,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "title matched \"reasoning\"",
            "summary matched \"alignment\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.28292"
        },
        {
          "title": "Revisiting Anthropomorphic Reflection Markers in Large Language Model Reasoning",
          "summary": "Large Language Models (LLMs) often produce explicit reflective traces during complex reasoning, accompanied by anthropomorphic markers such as wait, hmm, and alternatively. Although these markers are commonly used as visible indicators of reflection, their mechanisms remain unclear, which leaves the risk of overthinking associated with redundant and repetitive reflection markers. In this work, we revisit anthropomorphic reflection markers, examining their necessity for reasoning and role in the reflection. We suppress these markers through prompt-level and token-level interventions, and analyze their effects on task performance across four benchmarks and two model scales. Our results show that anthropomorphic markers are not uniformly necessary for reasoning performance: suppressing them can preserve or improve performance in several settings, especially under larger sampling budgets. Meanwhile, marker suppression does not necessarily remove reflection behavior, as models can still perform marker-free verification. These suggest that anthropomorphic markers tend to be surface cues rather than reliable proxies for reflection itself, and motivate future research on reasoning mechanisms beyond explicit marker patterns.",
          "authors": [
            "Yahan Yu",
            "Noa Nakanishi",
            "Fei Cheng"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "https://arxiv.org/abs/2605.28305",
          "abstract_url": "https://arxiv.org/abs/2605.28305",
          "pdf_url": "https://arxiv.org/pdf/2605.28305",
          "published_at": "2026-05-29T04:00:00+00:00",
          "updated_at": "2026-05-29T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.28305",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.28305"
          },
          "relevance_score": 196,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "title matched \"reasoning\"",
            "summary matched \"LLM\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.28305"
        },
        {
          "title": "Argument Quality Assessment with Large Language Models: A Pairwise Bradley-Terry Approach",
          "summary": "Large Language Models (LLMs) have demonstrated remarkable capabilities in tasks related to reasoning and judgment. However, assessing the quality of arguments requires a rigorous evaluation. We investigate the extent to which LLMs can effectively perform this task. We tested 12 open-weight LLMs of different sizes and families under zero-shot, few-shot, and chain-of-thought to approximate expert pairwise comparisons of argument quality across three dimensions-logical, rhetorical, and dialectic-and used these comparisons in a Bradley-Terry model to infer latent strength scores and derive a ranking of arguments. Our insights show that LLMs have promising but moderate correlation with human expert judgments, with Llama-70B obtaining the strongest alignment, reaching moderate Cohen's $\\kappa$ = 0.493 and moderate correlations with Bradley-Terry scores derived from these annotations (Kendall, Pearson, and Spearman: 0.327-0.477). Other LLMs exhibit weak, moderate, or high alignment with Llama-70B while achieving comparable results against human experts, suggesting partial but complementary understanding of underlying quality dimensions despite differences in model size and family. Moreover, LLM predictions are stable across trial runs, with fewer than 7.75\\% of cases yielding different labels. Remaining variability is handled via majority voting and few-shot prompting for large-size models.",
          "authors": [
            "Nicol\\'as Benjam\\'in Ocampo",
            "Agnes Paullate Nyiranziza",
            "Davide Ceolin"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.28313",
          "abstract_url": "https://arxiv.org/abs/2605.28313",
          "pdf_url": "https://arxiv.org/pdf/2605.28313",
          "published_at": "2026-05-29T04:00:00+00:00",
          "updated_at": "2026-05-29T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.28313",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.28313"
          },
          "relevance_score": 192,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"alignment\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.28313"
        },
        {
          "title": "KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs",
          "summary": "Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. However, SpeechLM evaluation remains heavily centered on English, limiting reliable assessment of multilingual speech capabilities. Straightforward benchmark transfer through ASR, translation, normalization, and TTS can corrupt language-specific instructions, answer constraints, and spoken forms; for audio understanding, transferring source-language audio also fails to preserve target-language speaker attributes, accents, and paralinguistic properties. To address these limitations, we propose two human-agent benchmark-construction frameworks: one transfers source-language SpokenQA benchmarks into target-language SpokenQA benchmarks, and the other converts target-language ASR corpora into audio understanding benchmarks using transcriptions and speaker metadata. Using these frameworks, we construct and publicly release three Korean speech benchmarks: KVoiceBench and KOpenAudioBench for Korean SpokenQA, and KMMAU for Korean audio understanding, comprising 12,345 samples in total. We evaluate eight recent SpeechLMs and find that English-Korean performance gaps vary substantially across models and task families, and that SpokenQA and audio understanding rankings diverge, revealing complementary weaknesses invisible to English-only evaluation.",
          "authors": [
            "Haechan Kim",
            "Seungjun Chung",
            "Inkyu Park",
            "Jihoo Lee",
            "Jonghyun Lee"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "https://arxiv.org/abs/2605.27984",
          "abstract_url": "https://arxiv.org/abs/2605.27984",
          "pdf_url": "https://arxiv.org/pdf/2605.27984",
          "published_at": "2026-05-29T04:00:00+00:00",
          "updated_at": "2026-05-29T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.27984",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.27984"
          },
          "relevance_score": 192,
          "match_reasons": [
            "title matched \"agent\"",
            "title matched \"benchmark\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.27984"
        },
        {
          "title": "Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities",
          "summary": "Large language models (LLMs) are increasingly utilized as proxies for computational social analysis; yet, their ability to faithfully represent the \"thick descriptions\" (Geertz, 1973) of human communities remains a critical challenge. Current evaluations often reduce social identity to static labels, sidelining how real-world groups navigate social shifts. To bridge this gap, we introduce CARE (Community-Aware Reaction Evaluation), a reaction-centered framework that benchmarks LLM-simulated discourse against the authentic, event-contingent responses of distinct communities to real-world news. By characterizing a fine-grained spectrum of illocutionary tones and the underlying attitudes they manifest--validated through human-AI collaboration--our diagnosis reveals a persistent \"realism gap\": steering LLMs with explicit community prompts fails to inherently improve simulation fidelity. Analysis further identifies divergent behavioral signatures among frontier models, suggesting that current alignment strategies remain insufficient for capturing the sociolinguistic dynamics of online groups.",
          "authors": [
            "Nuan Wen",
            "Xuezhe Ma"
          ],
          "categories": [
            "cs.CL",
            "cs.AI",
            "cs.SI"
          ],
          "paper_id": "https://arxiv.org/abs/2605.27388",
          "abstract_url": "https://arxiv.org/abs/2605.27388",
          "pdf_url": "https://arxiv.org/pdf/2605.27388",
          "published_at": "2026-05-29T04:00:00+00:00",
          "updated_at": "2026-05-29T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.27388",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.27388"
          },
          "relevance_score": 192,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"alignment\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.27388"
        },
        {
          "title": "ROSD: Reflective On-Policy Self-Distillation for Language Model Reasoning across Domains",
          "summary": "On-policy self-distillation (OPSD) improves the reasoning performance of large language models (LLMs) by providing dense token-level supervision for on-policy rollouts. However, existing OPSD methods often yield limited gains on in-domain reasoning and generalize poorly to out-of-domain problems. We identify two key causes: conditioning the self-teacher on a verified solution encourages imitation of training-domain reference trajectories rather than error-specific correction, and applying distillation to the full response can overwrite valid reasoning prefixes and reinforce overfitting. We propose Reflective On-policy Self-Distillation (ROSD), a framework that turns reference-solution imitation into targeted reasoning correction through reflection-guided, error-localized distillation. For each rollout, ROSD uses a self-reflector to extract a corrective idea and locate the first erroneous span. The corrective idea guides the self-teacher toward targeted supervision, while the localized error span restricts distillation to where correction is needed. This design corrects flawed reasoning while preserving valid prefixes. Experiments on multiple in-domain and out-of-domain reasoning benchmarks show that ROSD yields stronger in-domain reasoning performance overall and substantially better out-of-domain generalization than standard OPSD. Code is available at https://github.com/ZiqiZhao1/ROSD.",
          "authors": [
            "Ziqi Zhao",
            "Xinyu Ma",
            "Liu Yang",
            "Yujie Feng",
            "Daiting Shi",
            "Jingzhou He",
            "Xin Xin",
            "Zhaochun Ren",
            "Xiao-Ming Wu"
          ],
          "categories": [
            "cs.CL",
            "cs.LG"
          ],
          "paper_id": "https://arxiv.org/abs/2605.28014",
          "abstract_url": "https://arxiv.org/abs/2605.28014",
          "pdf_url": "https://arxiv.org/pdf/2605.28014",
          "published_at": "2026-05-29T04:00:00+00:00",
          "updated_at": "2026-05-29T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.28014",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.28014"
          },
          "relevance_score": 192,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"reasoning\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.28014"
        },
        {
          "title": "Framing Matters: Addressing Framing Sensitivity in Decision-Making through Behaviorally-Grounded Value Alignment",
          "summary": "Large Language Models (LLMs) are increasingly deployed in high-stakes decision-making settings such as legal reasoning, where consistency under factually equivalent inputs is critical. However, we find that fact-preserved but differently framed inputs can significantly destabilize LLM decisions. To systematically investigate this problem, we introduce Fragile, a large-scale benchmark that isolates fact-preserving semantic framing across three controlled dimensions: value-tinted narration, temporal slice, and narrative vividness. Our experiments reveal a high susceptibility of LLMs to framing, with an average decision flip rate of 28.6%. We find that simple prior prompt-level and activation-level interventions not only fail to suppress framing sensitivity but actively amplify it. We therefore propose Valign, a representation-level method that explicitly targets these framing dimensions by anchoring decisions to a stable value prior, steering hidden states toward the model's value-consistent direction, and projecting out temporal-vividness-sensitive directions from the model's hidden states. Valign consistently reduces framing-induced decision flips, demonstrating that robust mitigation requires directly targeting the internal pathways in which framing operates.",
          "authors": [
            "Seojin Hwang",
            "Minju Kim",
            "Junhyuk Choi",
            "JeongHyun Park",
            "Hwanhee Lee"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.28188",
          "abstract_url": "https://arxiv.org/abs/2605.28188",
          "pdf_url": "https://arxiv.org/pdf/2605.28188",
          "published_at": "2026-05-29T04:00:00+00:00",
          "updated_at": "2026-05-29T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.28188",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.28188"
          },
          "relevance_score": 188,
          "match_reasons": [
            "title matched \"alignment\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"RAG\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.28188"
        },
        {
          "title": "HELEA: Hard-Negative Benchmark and LLM-based Reranking for Robust Entity Alignment",
          "summary": "Entity Alignment (EA) is essential for knowledge graph (KG) fusion, but existing benchmarks often allow models to exploit name overlap rather than relational structure. This makes it difficult to evaluate whether models can reject same-name entities that refer to different real-world objects. Our primary contribution is a same-name hard-negative augmentation strategy that simultaneously yields quality-controlled evaluation benchmarks (DW-HN29K, DY-HN27K) and augmented training corpora (DW-Train, DY-Train), by mining same-name but distinct entity pairs from KG name-collision groups. We further introduce HELEA, a two-stage framework integrating (i) entity encoder retrieval trained on hard-negative-augmented training corpora with 1-hop KG context, and (ii) LLM-based reranking without additional training. Experiments show that name-dependent baselines collapse to near-random performance on our hard-negative benchmarks, while HELEA achieves F1 0.967 on DW-HN29K while maintaining Hit@1 0.993 on standard DW-15K.",
          "authors": [
            "Yoonjin Jang",
            "Junwoo Kim",
            "Youngjoong Ko"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.28308",
          "abstract_url": "https://arxiv.org/abs/2605.28308",
          "pdf_url": "https://arxiv.org/pdf/2605.28308",
          "published_at": "2026-05-29T04:00:00+00:00",
          "updated_at": "2026-05-29T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.28308",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.28308"
          },
          "relevance_score": 178,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"alignment\"",
            "title matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.28308"
        },
        {
          "title": "LegalGraphRAG: Multi-Agent Graph Retrieval-Augmented Generation for Reliable Legal Reasoning",
          "summary": "Graph-based Retrieval-Augmented Generation (GraphRAG) advances flat document retrieval by structuring knowledge as relational graphs, enabling more coherent and effective reasoning. However, applying it to specific domains like legal reasoning faces critical challenges. (i) Legal corpora are heterogeneous, containing multi-granular knowledge from cases, articles and interpretations. A flat knowledge graph cannot adequately differentiate between factual details, applied rules, and abstract principles, limiting accurate retrieval. (ii) Reliable legal judgment demands transparent, evidence-based reasoning. Traditional RAG passes retrieved context directly to an LLM without verification, resulting in opaque, error-prone reasoning. To this end, we propose LegalGraphRAG, a framework designed for reliable legal reasoning. Our approach introduces two core components: a hierarchical legal graph that hierarchically organizes legal sources to enable retrieval at appropriate abstraction levels, and a multi-agent system for reliable legal reasoning, where a Researcher retrieves candidate evidence, an Auditor rigorously verifies its validity against source documents, and an Adjudicator synthesizes the set of verified evidence to render a final judgment. Extensive experiments show that LegalGraphRAG achieves the state-of-the-art performance, outperforming existing GraphRAG baselines in accurate and trustworthy legal analysis. Our code, datasets and implementation details are available at https://github.com/XMUDeepLIT/LegalGraphRAG.",
          "authors": [
            "Zerui Chen",
            "Qinggang Zhang",
            "Zhishang Xiang",
            "Zhimin Wei",
            "Linfeng Gao",
            "Xiao Huang",
            "Zhihong Zhang",
            "Jinsong Su"
          ],
          "categories": [
            "cs.CL",
            "cs.AI",
            "cs.MA"
          ],
          "paper_id": "https://arxiv.org/abs/2605.28120",
          "abstract_url": "https://arxiv.org/abs/2605.28120",
          "pdf_url": "https://arxiv.org/pdf/2605.28120",
          "published_at": "2026-05-29T04:00:00+00:00",
          "updated_at": "2026-05-29T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Reasoning"
          ],
          "doi": null,
          "arxiv_id": "2605.28120",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.28120"
          },
          "relevance_score": 178,
          "match_reasons": [
            "title matched \"reasoning\"",
            "title matched \"agent\"",
            "title matched \"RAG\"",
            "summary matched \"LLM\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.28120"
        },
        {
          "title": "ChildEval: When large language models meet children's personalities",
          "summary": "While LLMs enable personalized chatbots, their effectiveness in child-centered personalization remains unclear, as systematic evaluation of child-specific preferences is still lacking. To address this gap, we introduce ChildEval, a benchmark for evaluating LLMs' ability to infer and follow child-centered preferences in long-context conversations. ChildEval contains 29K synthesized persona profiles of children aged 3-6, providing relatively static background information. Each persona is associated with a child preference-which may align with, conflict with, or be independent of the persona-expressed either explicitly in a single sentence or implicitly through 6-10 turn dialogues. Explicit and implicit preferences are designed to reflect the same underlying preference but differ in expression, capturing dynamic aspects of preference expression rather than changes in the static persona. The benchmark spans five top-level and fourteen sub-level categories covering children's daily lives and development. We further propose fine-grained, child-centric evaluation protocols to systematically assess open-source LLMs. Experimental results demonstrate how different personalized representations affect LLM responses and suggest that finetuning on ChildEval can enhance child-centered performance. Our code and dataset are available at https://github.com/ziyanluo/ChildEval.",
          "authors": [
            "Yanyan Luo",
            "Xue Han",
            "Chunxu Zhao",
            "Ruiqiao Bai",
            "Yaxing Zhang",
            "Qian Hu",
            "Lijun Mei",
            "Junlan Feng"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "https://arxiv.org/abs/2605.27805",
          "abstract_url": "https://arxiv.org/abs/2605.27805",
          "pdf_url": "https://arxiv.org/pdf/2605.27805",
          "published_at": "2026-05-29T04:00:00+00:00",
          "updated_at": "2026-05-29T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.27805",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.27805"
          },
          "relevance_score": 174,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.27805"
        },
        {
          "title": "DisasterBench: Benchmarking LLM Planning under Typed Tool Interface Constraints",
          "summary": "Disasters cause severe societal impacts, demanding rapid coordination of heterogeneous AI tools, from satellite analysis to flood prediction and damage assessment, into coherent multi-step workflows. As LLMs increasingly serve as orchestrators of such pipelines, effective coordination requires more than selecting semantically plausible tools: LLMs must generate executable workflows with correct parameter binding and dependency propagation. We introduce DisasterBench, a benchmark for evaluating structured multi-agent planning over semantically similar but operationally distinct disaster-response tools. To enable step-level failure attribution, we further propose First-Point-of-Failure (FPoF), which localizes the earliest root cause in a predicted workflow, separating primary errors from downstream cascading effects. Our evaluation reveals three findings: planning method effectiveness depends strongly on model capacity; tool mismatch and parameter-binding errors dominate first failures, revealing semantic grounding and execution consistency as distinct bottlenecks; and verbose intermediate reasoning can create instruction clash with structured output requirements, disrupting plan generation. Together, these findings highlight a fundamental gap between semantic reasoning and execution-grounded coordination, underscoring the need for planning frameworks that jointly model semantic intent, execution constraints, and workflow consistency. Code, data, and evaluation resources are available at: https://github.com/TamuChen18/DisasterBench_Open",
          "authors": [
            "Zhitong Chen",
            "Kai Yin",
            "Weifeng Zhang",
            "Zhiyuan Wang",
            "Xiangjue Dong",
            "Chengkai Liu",
            "Zhewei Liu",
            "Yiming Xiao",
            "Ali Mostafavi",
            "James Caverlee"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.27957",
          "abstract_url": "https://arxiv.org/abs/2605.27957",
          "pdf_url": "https://arxiv.org/pdf/2605.27957",
          "published_at": "2026-05-29T04:00:00+00:00",
          "updated_at": "2026-05-29T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.27957",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.27957"
          },
          "relevance_score": 174,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"benchmark\"",
            "summary matched \"reasoning\"",
            "summary matched \"agent\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.27957"
        },
        {
          "title": "LCO: LLM-based Constraint Optimization for Safer Agentic LLMs in Real-world Tasks",
          "summary": "Large Language Models (LLMs) are increasingly acting as autonomous agents, but their continuous interaction with the environment can lead to in-context reward hacking (ICRH), a phenomenon where LLMs iteratively optimize their behavior to maximize proxy objectives, inadvertently producing harmful side effects. Existing defense methods are insufficient to address this risk, as ICRH arises not from adversarial inputs but from the model's own over-optimization. To mitigate this issue, we propose \\textbf{LLM-based Constraint Optimization (LCO)}, a framework that effectively reduces ICRH without model fine-tuning. LCO consists of two modules: \\textit{self-thought module}, which guides the LLM to proactively deliberate and integrate potential safety constraints before execution; and \\textit{evolutionary sampling module}, which employs LLM-based crossover and mutation to constrain the model's actions within a safe solution space while maintaining task performance. Experimental results demonstrate that LCO substantially alleviates ICRH in both output-refine and policy-refine scenarios. In particular, on the tweet engagement optimization task, LCO achieves a 39% reduction in the Toxicity Growth Rate (TGR) on GPT-4, while on the policy optimization benchmark, it reduces the ICRH Occurrence Rate by 15.23%, demonstrating safety improvement without sacrificing task performance.",
          "authors": [
            "Jiayong Wan",
            "Jiawei Chen",
            "Zhaoxia Yin",
            "Liu Shuyuan",
            "Hang Su"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.27375",
          "abstract_url": "https://arxiv.org/abs/2605.27375",
          "pdf_url": "https://arxiv.org/pdf/2605.27375",
          "published_at": "2026-05-29T04:00:00+00:00",
          "updated_at": "2026-05-29T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.27375",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.27375"
          },
          "relevance_score": 174,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.27375"
        },
        {
          "title": "ConRAG: Consensus-Driven Multi-View Retrieval for Multi-Hop Question Answering",
          "summary": "Retrieval-augmented generation (RAG) has emerged as a promising paradigm for enhancing large language models (LLMs) on multi-hop question answering (QA), which requires reasoning over evidence from multiple documents. Current multi-hop RAG methods generally focus on either query-side task decomposition or corpus-side knowledge graph construction. Despite their progress, these methods still struggle to achieve satisfactory performance on complex multi-hop QA tasks. To this end, we propose ConRAG, a consensus-driven multi-view RAG framework that effectively boosts LLMs on complex multi-hop QA. The core of ConRAG is to systematically optimize both the query and corpus sides and to leverage multi-view evidence (relation, entity, and text signals) for more accurate retrieval. Extensive experiments on three multi-hop QA benchmarks show that ConRAG consistently outperforms all baselines by a clear margin, e.g., up to +26.9% average performance gains over vanilla RAG, and enables Gemma-4-31B to achieve a new state-of-the-art record on the challenging MuSiQue benchmark.",
          "authors": [
            "Yikai Zhu",
            "Kunfeng Chen",
            "Qihuang Zhong",
            "Juhua Liu",
            "Bo Du"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "https://arxiv.org/abs/2605.28093",
          "abstract_url": "https://arxiv.org/abs/2605.28093",
          "pdf_url": "https://arxiv.org/pdf/2605.28093",
          "published_at": "2026-05-29T04:00:00+00:00",
          "updated_at": "2026-05-29T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.28093",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.28093"
          },
          "relevance_score": 170,
          "match_reasons": [
            "title matched \"RAG\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.28093"
        },
        {
          "title": "DEPART: DEcomposing PARiTy across Multilingual LLMs",
          "summary": "Multilingual Large Language Models (mLLMs) leaderboards report per-language accuracy but rarely explain why disparities emerge, leaving systemic biases unattributed and offering practitioners no actionable levers. We first establish that these gaps are systematic rather than artifacts of sampling noise via distribution-free Friedman and Kruskal--Wallis tests, then introduce a two-step Bayesian hierarchical framework that decomposes multilingual performance variance into interpretable components. First, isolating the variance attributable to language identity, we show that observable language features (script, family, typological distance) explain $R^2_{\\text{ling}} = 79\\%$ of this variance on understanding tasks and $92\\%$ on reasoning, with a model's internal representational similarity to English emerging as the dominant predictor across both task buckets. Second, decomposing the full (model$\\times$benchmark$\\times$language) cube, we find that NLU and reasoning have fundamentally divergent variance profiles: model identity dominates understanding ($66.7\\%$ of variance), whereas the benchmark$\\times$model interaction dominates reasoning ($46.3\\%$). Together these results recast multilingual evaluation from passive performance mapping into an explainable, diagnostic framework with concrete levers for targeting the root drivers of language disparity.",
          "authors": [
            "Manan Uppadhyay",
            "Prashant Kodali",
            "Pranjal Chitale",
            "Reshma Ramaprasad",
            "Himanshu Beniwal",
            "Sunayana Sitaram"
          ],
          "categories": [
            "cs.CL",
            "cs.AI"
          ],
          "paper_id": "https://arxiv.org/abs/2605.28163",
          "abstract_url": "https://arxiv.org/abs/2605.28163",
          "pdf_url": "https://arxiv.org/pdf/2605.28163",
          "published_at": "2026-05-29T04:00:00+00:00",
          "updated_at": "2026-05-29T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.28163",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.28163"
          },
          "relevance_score": 170,
          "match_reasons": [
            "title matched \"LLM\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"reasoning\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.28163"
        }
      ]
    },
    {
      "name": "Agent Runtime Security",
      "key_points": [
        "《Provably Secure Agent Guardrail》〔评测 / 应用 / 方法〕：As large language models transition from bounded generative engines to agents with expansive execution privileges, AI going out of control precipitates a funda…",
        "《Robust and Efficient Guardrails with Latent Reasoning》〔评测 / 应用 / 方法〕：Maintaining the safety of large language models (LLMs) is crucial as they are increasingly deployed in real-world applications. Existing safety guardrails typi…",
        "《AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security》〔数据 / 应用 / 方法〕：Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, adv…",
        "《Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures》〔评测 / 方法〕：Attack Success Rate (ASR) evaluates each jailbreak with a single yes/no label at the end of generation, telling us whether a failure happened but not how it un…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Provably Secure Agent Guardrail",
          "summary": "As large language models transition from bounded generative engines to agents with expansive execution privileges, AI going out of control precipitates a fundamental crisis in artificial intelligence security. Existing defense architectures heavily rely on empirical semantic guardrails and probabilistic large model adjudicators, mechanisms that fail to provide deterministic security lower bounds when facing complex semantic symbol decoupling attacks. To overcome this empirical semantic guardrail dilemma, this paper proposes a new security paradigm for agents based on the fundamental limitations of logical reasoning. Based on this paradigm, we further introduce an executable Proof-Constrained Action (ePCA) framework with a neural symbolic isolation architecture. This framework abandons semantic trust in natural language, forcing agents to losslessly formalize their intentions into first-order logical mathematical constraints before performing physical operations. Empirical evaluations of macroscopic and microscopic two-dimensional dynamic adversarial systems demonstrate that our formal verification mechanism achieves zero attack success rate and zero false positive rate across the evaluated scenarios, with extremely low computational latency. This research provides a conditional formal foundation under explicit system assumptions and an engineering paradigm for constructing the underlying defense foundation for future intelligent systems.",
          "authors": [
            "Benlong Wu",
            "Weiming Zhang",
            "Kejiang Chen",
            "Han Fang",
            "Nenghai Yu"
          ],
          "categories": [
            "cs.AI",
            "cs.CR"
          ],
          "paper_id": "https://arxiv.org/abs/2605.29251",
          "abstract_url": "https://arxiv.org/abs/2605.29251",
          "pdf_url": "https://arxiv.org/pdf/2605.29251",
          "published_at": "2026-05-29T04:00:00+00:00",
          "updated_at": "2026-05-29T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Language Model",
            "Large Language Model"
          ],
          "doi": null,
          "arxiv_id": "2605.29251",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.29251"
          },
          "relevance_score": 120,
          "match_reasons": [
            "title matched \"secure agent\"",
            "title matched \"guardrail\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.29251"
        },
        {
          "title": "Robust and Efficient Guardrails with Latent Reasoning",
          "summary": "Maintaining the safety of large language models (LLMs) is crucial as they are increasingly deployed in real-world applications. Existing safety guardrails typically rely on single-pass classification or, more recently, distilled reasoning. Reasoning-based guardrails significantly outperform classification-only baselines, but they incur substantial query latency and token overhead that make them impractical for highthroughput deployment. To address this challenge, we propose COLAGUARD, a guardrail model that transfers multi-step safety reasoning into a continuous latent space through a stage-wise training curriculum, enabling direct hidden-state propagation at inference. Evaluated on ten prompt- and response-moderation settings spanning eight safety benchmarks, COLAGUARD improves macro-F1 by 8.24 points over Llama Guard 3 and matches our explicit reasoning baseline, GuardReasoner, in macroF1 while delivering a 12.9X speedup and 22.4X reduction in token usage. Our results suggest that latent reasoning offers a practical alternative to explicit rationale generation for deployable guardrails, jointly improving safety robustness and inference efficiency rather than treating them as competing objectives.",
          "authors": [
            "Siddharth Sai",
            "Xiaofei Wen",
            "Muhao Chen"
          ],
          "categories": [
            "cs.AI",
            "cs.CL",
            "cs.CR",
            "cs.LG"
          ],
          "paper_id": "https://arxiv.org/abs/2605.29068",
          "abstract_url": "https://arxiv.org/abs/2605.29068",
          "pdf_url": "https://arxiv.org/pdf/2605.29068",
          "published_at": "2026-05-29T04:00:00+00:00",
          "updated_at": "2026-05-29T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2605.29068",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.29068"
          },
          "relevance_score": 80,
          "match_reasons": [
            "title matched \"guardrail\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.29068"
        },
        {
          "title": "AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security",
          "summary": "Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.",
          "authors": [
            "Dongrui Liu",
            "Yu Li",
            "Zhonghao Yang",
            "Peng Wang",
            "Guanxu Chen",
            "Yuejin Xie",
            "Qinghua Mao",
            "Wanying Qu",
            "Yanxu Zhu",
            "Tianyi Zhou",
            "Leitao Yuan",
            "Zhijie Zheng",
            "Qihao Lin",
            "Yimin Wang",
            "Haoyu Luo",
            "Shuai Shao",
            "Chen Qian",
            "Qingyu Liu",
            "Ling Tang",
            "Ruiyang Qin",
            "Qihan Ren",
            "Junxiao Yang",
            "Kun Wang",
            "Zhiheng Xi",
            "Linfeng Zhang",
            "Ranjie Duan",
            "Bo Zhang",
            "Wenjie Wang",
            "Wen Shen",
            "Qiaosheng Zhang",
            "Yan Teng",
            "Chaochao Lu",
            "Rui Mei",
            "Man Li",
            "Jialing Tao",
            "Xi Lin",
            "Tianhang Zheng",
            "Yong Liu",
            "Quanshi Zhang",
            "Lei Zhu",
            "Xingjun Ma",
            "Junhua Liu",
            "Hui Xue",
            "Xiaoxiang Zuo",
            "Xiangnan He",
            "Chao Shen",
            "Xianglong Liu",
            "Minlie Huang",
            "Jing Shao",
            "Xia Hu"
          ],
          "categories": [
            "cs.AI",
            "cs.CL",
            "cs.CR",
            "cs.CV",
            "cs.LG"
          ],
          "paper_id": "https://arxiv.org/abs/2605.29801",
          "abstract_url": "https://arxiv.org/abs/2605.29801",
          "pdf_url": "https://arxiv.org/pdf/2605.29801",
          "published_at": "2026-05-29T04:00:00+00:00",
          "updated_at": "2026-05-29T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "应用",
            "方法"
          ],
          "topics": [
            "Agent",
            "Alignment"
          ],
          "doi": null,
          "arxiv_id": "2605.29801",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.29801"
          },
          "relevance_score": 58,
          "match_reasons": [
            "summary matched \"guardrail\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.29801"
        },
        {
          "title": "Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures",
          "summary": "Attack Success Rate (ASR) evaluates each jailbreak with a single yes/no label at the end of generation, telling us whether a failure happened but not how it unfolded. Two attacks that produce equally harmful outputs may have followed completely different paths, and ASR cannot tell them apart. We make those hidden paths observable from logits alone. Temporal Logit Observability (TLO) is a training-free diagnostic that watches a compliance-refusal margin during decoding and places each model-attack condition on a calibrated 2D plane. By design, this plane is most informative exactly where ASR is least informative: among attacks that succeed for genuinely different reasons. Across four aligned LLMs and three jailbreak paradigms, attacks with nearly identical ASR land at clearly different points on the plane: the same model can fail through different temporal patterns. The geometry matches refusal-direction probes from hidden states on most conditions, with one model showing the limit of our fixed-lexicon approach. A simple early-stop rule derived from TLO cuts successful jailbreaks by more than half, without false alarms on plain benign queries. Safety evaluation should report when and how a failure unfolds, not only whether it occurred. TLO makes the first two observable from logits alone.",
          "authors": [
            "Junyoung Park",
            "Sunghwan Park",
            "Seongyong Ju",
            "Jaewoo Lee"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "https://arxiv.org/abs/2605.29629",
          "abstract_url": "https://arxiv.org/abs/2605.29629",
          "pdf_url": "https://arxiv.org/pdf/2605.29629",
          "published_at": "2026-05-29T04:00:00+00:00",
          "updated_at": "2026-05-29T04:00:00+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2605.29629",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.29629"
          },
          "relevance_score": 58,
          "match_reasons": [
            "summary matched \"jailbreak\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.29629"
        }
      ]
    },
    {
      "name": "Terminal and SWE Agents",
      "key_points": [
        "《Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software》〔应用 / 方法〕：Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet an…",
        "《Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas》〔评测 / 应用 / 方法〕：We study two-level autoresearch for cooperation: an outer-loop AI agent autonomously redesigns the inner-loop pipeline of an LLM policy-synthesis system for mu…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software",
          "summary": "Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet and Opus models) over 12 work days and 57 sessions to build CLAX-PT, a differentiable one-loop perturbation theory module in JAX. We documented and classified 15 supervision events by intervention level. The agent resolved ten autonomously by iterating against oracle tests. Two more by the physicist's domain knowledge. The three it could not -- all evaded oracle detection -- share a common property: the agent treated symptom reduction as root-cause resolution. It spent 33 of the 57 sessions adjusting coefficients within a code architecture that could not represent the target physics, and could not re-evaluate its CLASS-PT branch choice even when prompted to reconsider; only an injected physics concept (anisotropic BAO damping) triggered the redesign. Separately, the agent committed a calibrated correction that passed all oracle tests but corresponded to no quantity in the theory, predicting wrong values at any other cosmology. The fudge factor was caught and replaced within the same session. Three supervision practices proved critical for catching what oracle tests missed: testing at diverse parameter points beyond the fiducial calibration; shared changelogs that surfaced stalled exploration across sessions; and an explicit rule against unphysical numerical patches. In this case, supervision design, not model capability, determined whether the agent's output was trustworthy. Closing the gap would require agents that propose architectural alternatives rather than optimize within a given structure, and distinguish predictive adequacy from explanatory correctness -- capabilities not exhibited here, not obviously addressed by scaling alone. [Abridged.]",
          "authors": [
            "Nhat-Minh Nguyen"
          ],
          "categories": [
            "cs.AI",
            "astro-ph.CO",
            "cs.HC",
            "cs.SE"
          ],
          "paper_id": "http://arxiv.org/abs/2605.30353v1",
          "abstract_url": "https://arxiv.org/abs/2605.30353v1",
          "pdf_url": "https://arxiv.org/pdf/2605.30353v1",
          "published_at": "2026-05-28T17:59:59+00:00",
          "updated_at": "2026-05-28T17:59:59+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "Agent",
            "Coding Agent"
          ],
          "doi": null,
          "arxiv_id": "2605.30353",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.30353v1"
          },
          "relevance_score": 48,
          "match_reasons": [
            "summary matched \"coding agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.30353"
        },
        {
          "title": "Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas",
          "summary": "We study two-level autoresearch for cooperation: an outer-loop AI agent autonomously redesigns the inner-loop pipeline of an LLM policy-synthesis system for multi-agent Sequential Social Dilemmas (SSDs). A researcher agent $\\mathcal{R}$ (run as a coding agent) reads the inner-loop source code, edits system prompts, feedback functions, helper libraries, and iteration logic, runs evaluations, and decides what to keep, following the autoresearch paradigm. Across two games (Cleanup and Gathering), two policy-synthesizer LLMs, and two welfare objectives (utilitarian efficiency and Rawlsian maximin), the researcher reliably exceeds hand-designed baselines, sharply tightens run-to-run variance, and outperforms prompt-only optimization. The discovered pipelines are objective-dependent: only under maximin does the researcher inject an explicit fairness mechanism into synthesizer pipelines, a class of mechanism that is absent from its own objective-agnostic system prompt and from every efficiency-optimized pipeline. This supports an information-design reading in which the researcher chooses what to reveal to the boundedly rational synthesizer as a function of the welfare objective. Code at https://github.com/vicgalle/autoresearch-social-dilemmas.",
          "authors": [
            "Víctor Gallego"
          ],
          "categories": [
            "cs.MA",
            "cs.AI",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2605.30003v1",
          "abstract_url": "https://arxiv.org/abs/2605.30003v1",
          "pdf_url": "https://arxiv.org/pdf/2605.30003v1",
          "published_at": "2026-05-28T14:33:10+00:00",
          "updated_at": "2026-05-28T14:33:10+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Evaluation"
          ],
          "doi": null,
          "arxiv_id": "2605.30003",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2605.30003v1"
          },
          "relevance_score": 45,
          "match_reasons": [
            "summary matched \"coding agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2605.30003"
        }
      ]
    }
  ]
}