{
  "generated_at": "2026-05-01T12:53:56.472428+08:00",
  "timezone": "Asia/Shanghai",
  "lookback_hours": 24,
  "sorting": {
    "default_sort_by": "hybrid",
    "summary": "hybrid (relevance first, published_at tie-break)",
    "weights": {
      "title_match_weight": 40,
      "summary_match_weight": 18,
      "doi_weight": 12,
      "pdf_weight": 8,
      "rich_summary_weight": 6,
      "metadata_weight": 4,
      "multi_source_weight": 10,
      "freshness_weight_cap": 24
    },
    "feeds": [
      {
        "name": "LM",
        "sort_by": "hybrid"
      }
    ]
  },
  "highlights": [
    "主题「LLM」：命中 13 篇，覆盖 LM，代表论文包括 《Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper Agents》、《Rethinking Agentic Reinforcement Learning In Large Language Models》。",
    "主题「Language Model」：命中 13 篇，覆盖 LM，代表论文包括 《Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper Agents》、《What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design》。",
    "主题「Agent」：命中 2 篇，覆盖 LM，代表论文包括 《Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows》、《Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems》。",
    "主题「Large Language Model」：命中 1 篇，覆盖 LM，代表论文包括 《What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design》。",
    "主题「Benchmark」：命中 1 篇，覆盖 LM，代表论文包括 《Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems》。"
  ],
  "focus_items": [],
  "action_items": [],
  "topic_sections": [
    {
      "name": "LLM",
      "paper_count": 13,
      "feed_names": [
        "LM"
      ],
      "paper_titles": [
        "Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper Agents",
        "Rethinking Agentic Reinforcement Learning In Large Language Models",
        "TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering",
        "LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning",
        "Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning",
        "Exploring Interaction Paradigms for LLM Agents in Scientific Visualization",
        "Exploration Hacking: Can LLMs Learn to Resist RL Training?",
        "SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images",
        "Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows",
        "Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception",
        "Design Structure Matrix Modularization with Large Language Models",
        "DPN-LE: Dual Personality Neuron Localization and Editing for Large Language Models",
        "Modeling Clinical Concern Trajectories in Language Model Agents"
      ],
      "key_points": [
        "《Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper Agents》〔评测 / 应用 / 方法〕：We present Collaborative Agent Reasoning Engineering (CARE), a disciplined methodology for engineering Large Language Model (LLM) agents in scientific domains.…",
        "《Rethinking Agentic Reinforcement Learning In Large Language Models》〔方法〕：Reinforcement Learning (RL) has traditionally focused on training specialized agents to optimize predefined reward functions within narrowly defined environmen…"
      ]
    },
    {
      "name": "Language Model",
      "paper_count": 13,
      "feed_names": [
        "LM"
      ],
      "paper_titles": [
        "Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper Agents",
        "What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design",
        "Rethinking Agentic Reinforcement Learning In Large Language Models",
        "TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering",
        "LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning",
        "Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning",
        "Exploring Interaction Paradigms for LLM Agents in Scientific Visualization",
        "Exploration Hacking: Can LLMs Learn to Resist RL Training?",
        "SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images",
        "Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception",
        "Design Structure Matrix Modularization with Large Language Models",
        "DPN-LE: Dual Personality Neuron Localization and Editing for Large Language Models",
        "Modeling Clinical Concern Trajectories in Language Model Agents"
      ],
      "key_points": [
        "《Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper Agents》〔评测 / 应用 / 方法〕：We present Collaborative Agent Reasoning Engineering (CARE), a disciplined methodology for engineering Large Language Model (LLM) agents in scientific domains.…",
        "《What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design》〔评测 / 应用 / 方法〕：Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market…"
      ]
    },
    {
      "name": "Agent",
      "paper_count": 2,
      "feed_names": [
        "LM"
      ],
      "paper_titles": [
        "Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows",
        "Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems"
      ],
      "key_points": [
        "《Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows》〔评测 / 应用 / 方法〕：LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a…",
        "《Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems》〔评测 / 应用 / 方法〕：Text-to-SQL (T2SQL) evaluation in production environments poses fundamental challenges that existing benchmarks do not address. Current evaluation methodologie…"
      ]
    },
    {
      "name": "Large Language Model",
      "paper_count": 1,
      "feed_names": [
        "LM"
      ],
      "paper_titles": [
        "What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design"
      ],
      "key_points": [
        "《What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design》〔评测 / 应用 / 方法〕：Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market…"
      ]
    },
    {
      "name": "Benchmark",
      "paper_count": 1,
      "feed_names": [
        "LM"
      ],
      "paper_titles": [
        "Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems"
      ],
      "key_points": [
        "《Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems》〔评测 / 应用 / 方法〕：Text-to-SQL (T2SQL) evaluation in production environments poses fundamental challenges that existing benchmarks do not address. Current evaluation methodologie…"
      ]
    }
  ],
  "template": "zh_daily_brief",
  "feeds": [
    {
      "name": "LM",
      "key_points": [
        "《Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper Agents》〔评测 / 应用 / 方法〕：We present Collaborative Agent Reasoning Engineering (CARE), a disciplined methodology for engineering Large Language Model (LLM) agents in scientific domains.…",
        "《What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design》〔评测 / 应用 / 方法〕：Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market…",
        "《Rethinking Agentic Reinforcement Learning In Large Language Models》〔方法〕：Reinforcement Learning (RL) has traditionally focused on training specialized agents to optimize predefined reward functions within narrowly defined environmen…",
        "《TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering》〔评测 / 应用 / 方法〕：Large Language Models (LLMs) have advanced Table Question Answering, where most queries can be answered by extracting information or simple aggregation. Howeve…",
        "《LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning》〔评测 / 方法〕：Recent large language models (LLMs) have achieved impressive reasoning milestones but continue to struggle with high computational costs, logical inconsistenci…"
      ],
      "sort_by": "hybrid",
      "papers": [
        {
          "title": "Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper Agents",
          "summary": "We present Collaborative Agent Reasoning Engineering (CARE), a disciplined methodology for engineering Large Language Model (LLM) agents in scientific domains. Unlike ad-hoc trial-and-error approaches, CARE specifies behavior, grounding, tool orchestration, and verification through reusable artifacts and systematic, stage-gated phases. The methodology employs a three-party workflow involving Subject-Matter Experts (SMEs), developers, and LLM-based helper agents. These helper agents function as facilitation infrastructure, transforming informal domain intent into structured, reviewable specifications for human approval at defined gates. CARE addresses the \"jagged technological frontier\", characterized by uneven LLM performance, by bridging the gap between novice and expert analysts regarding domain constraints and verification practices. By generating concrete artifacts, including interaction requirements, reasoning policies, and evaluation criteria, CARE ensures agent behavior is specifiable, testable, and maintainable. Evaluation results from a scientific use case demonstrate that this stage-gated, artifact-driven methodology yields measurable improvements in development efficiency and complex-query performance.",
          "authors": [
            "Rahul Ramachandran",
            "Nidhi Jha",
            "Muthukumaran Ramasubramanian"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.28043v1",
          "abstract_url": "https://arxiv.org/abs/2604.28043v1",
          "pdf_url": "https://arxiv.org/pdf/2604.28043v1",
          "published_at": "2026-04-30T15:54:47+00:00",
          "updated_at": "2026-04-30T15:54:47+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": "10.64631/taxq7736",
          "arxiv_id": "2604.28043",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.28043v1",
            "doi": "https://doi.org/10.64631/taxq7736"
          },
          "relevance_score": 193,
          "match_reasons": [
            "title matched \"reasoning\"",
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"evaluation\"",
            "has DOI",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "doi:10.64631/taxq7736"
        },
        {
          "title": "What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design",
          "summary": "Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market for evaluation environments grows, so does the pressure to ship tasks quickly, often without thorough adversarial review of the verification logic. This paper is a guideline for writing good benchmark tasks, drawn from over a year of contributing to and reviewing tasks for Terminal Bench. Most people write benchmark tasks the way they write prompts. They shouldn't. A prompt is designed to help the agent succeed; a benchmark is designed to find out if it can. We argue that good tasks are adversarial, difficult, and legible, and that a large class of common failure modes -- AI-generated instructions, over-prescriptive specifications, clerical difficulty, oracle solutions that assume hidden knowledge, tests that validate the wrong things, and reward-hackable environments -- are predictable consequences of treating task authoring as prompt authoring. We catalog these failure modes, argue that real difficulty is conceptual rather than environmental, and discuss recent empirical evidence that over 15% of tasks in popular terminal-agent benchmarks are reward-hackable. We hope this serves as a useful reference for benchmark maintainers, task contributors, and researchers using benchmark scores as evidence.",
          "authors": [
            "Ivan Bercovich"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.28093v1",
          "abstract_url": "https://arxiv.org/abs/2604.28093v1",
          "pdf_url": "https://arxiv.org/pdf/2604.28093v1",
          "published_at": "2026-04-30T16:37:37+00:00",
          "updated_at": "2026-04-30T16:37:37+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Language Model",
            "Large Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.28093",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.28093v1"
          },
          "relevance_score": 185,
          "match_reasons": [
            "title matched \"agent\"",
            "title matched \"benchmark\"",
            "title matched \"evaluation\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.28093"
        },
        {
          "title": "Rethinking Agentic Reinforcement Learning In Large Language Models",
          "summary": "Reinforcement Learning (RL) has traditionally focused on training specialized agents to optimize predefined reward functions within narrowly defined environments. However, the advent of powerful Large Language Models (LLMs) and increasingly complex, open-ended tasks has catalyzed a paradigm shift towards agentic paradigms within RL. This emerging framework extends beyond traditional RL by emphasizing the development of autonomous agents capable of goal-setting, long-term planning, dynamic strategy adaptation, and interactive reasoning in uncertain, real-world environments. Unlike conventional approaches that rely heavily on static objectives and episodic interactions, LLM-based Agentic RL incorporates cognitive-like capabilities such as meta-reasoning, self-reflection, and multi-step decision-making directly into the learning loop. In this paper, we provide a deep insight for looking the conceptual foundations, methodological innovations, and effective designs underlying this trend. Furthermore, we identify critical challenges and outline promising future directions for building LLM-based Agentic RL.",
          "authors": [
            "Fangming Cui",
            "Ruixiao Zhu",
            "Cheng Fang",
            "Sunan Li",
            "Jiahong Li"
          ],
          "categories": [
            "cs.AI",
            "cs.ET"
          ],
          "paper_id": "http://arxiv.org/abs/2604.27859v1",
          "abstract_url": "https://arxiv.org/abs/2604.27859v1",
          "pdf_url": "https://arxiv.org/pdf/2604.27859v1",
          "published_at": "2026-04-30T13:43:25+00:00",
          "updated_at": "2026-04-30T13:43:25+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.27859",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.27859v1"
          },
          "relevance_score": 182,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "title matched \"agent\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.27859"
        },
        {
          "title": "TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering",
          "summary": "Large Language Models (LLMs) have advanced Table Question Answering, where most queries can be answered by extracting information or simple aggregation. However, a common class of real-world queries is implicitly predictive, requiring the inference of unobserved answers from historical patterns rather than mere retrieval. These queries introduce two challenges: recognizing latent intent and reliable predictive reasoning over massive tables. To assess LLMs in such Tabular questiOn answering with implicit Prediction tasks, we introduce TopBench, a benchmark consisting of 779 samples across four sub-tasks, ranging from single-point prediction to decision making, treatment effect analysis, and complex filtering, requiring models to generate outputs spanning reasoning text and structured tables. We evaluate diverse models under both text-based and agentic workflows. Experiments reveal that current models often struggle with intent recognition, defaulting to just lookups. Deeper analysis identifies that accurate intent disambiguation serves as the prerequisite for leading these predictive behaviors. Furthermore, elevating the upper bound of prediction precision requires the integration of more sophisticated modeling or reasoning capabilities.",
          "authors": [
            "An-Yang Ji",
            "Jun-Peng Jiang",
            "De-Chuan Zhan",
            "Han-Jia Ye"
          ],
          "categories": [
            "cs.CL",
            "cs.AI",
            "cs.LG"
          ],
          "paper_id": "http://arxiv.org/abs/2604.28076v1",
          "abstract_url": "https://arxiv.org/abs/2604.28076v1",
          "pdf_url": "https://arxiv.org/pdf/2604.28076v1",
          "published_at": "2026-04-30T16:22:51+00:00",
          "updated_at": "2026-04-30T16:22:51+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.28076",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.28076v1"
          },
          "relevance_score": 181,
          "match_reasons": [
            "title matched \"reasoning\"",
            "title matched \"benchmark\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"agent\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.28076"
        },
        {
          "title": "LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning",
          "summary": "Recent large language models (LLMs) have achieved impressive reasoning milestones but continue to struggle with high computational costs, logical inconsistencies, and sharp performance degradation on high-complexity problems. While neuro-symbolic methods attempt to mitigate these issues by coupling LLMs with symbolic reasoners, existing approaches typically rely on monotonic logics (e.g., SMT) that cannot represent defeasible reasoning -- essential components of human cognition. We present \"LLM+ASP,\" a framework that translates natural language into Answer Set Programming (ASP), a nonmonotonic formalism based on stable model semantics. Unlike prior \"LLM+ASP\" approaches that require manually authored knowledge modules, domain-specific prompts, or evaluation restricted to single problem classes, our framework operates without any per-task engineering and applies uniformly across diverse reasoning tasks. Our system utilizes an automated self-correction loop where structured feedback from the ASP solver enables iterative refinement. Evaluating across six diverse benchmarks, we demonstrate that: (1) stable model semantics allow LLMs to naturally express default rules and exceptions, outperforming SMT-based alternatives by significant margins on nonmonotonic tasks; (2) iterative self-correction is the primary driver of performance, effectively replacing the need for handcrafted domain knowledge; (3) compact in-context reference guides substantially outperform verbose documentation, revealing a \"context rot\" phenomenon where excessive context hinders constraint adherence.",
          "authors": [
            "Adam Ishay",
            "Joohyung Lee"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.27960v1",
          "abstract_url": "https://arxiv.org/abs/2604.27960v1",
          "pdf_url": "https://arxiv.org/pdf/2604.27960v1",
          "published_at": "2026-04-30T14:55:48+00:00",
          "updated_at": "2026-04-30T14:55:48+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.27960",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.27960v1"
          },
          "relevance_score": 180,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"reasoning\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"benchmark\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.27960"
        },
        {
          "title": "Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning",
          "summary": "Recent advances in large language models (LLMs) have increasingly relied on reinforcement learning (RL) to improve their reasoning capabilities. Three approaches have been widely adopted: (i) Proximal policy optimization and advantage actor-critic rely on a deep neural network to estimate the value function of the learning policy in order to reduce the variance of the policy gradient. However, estimating and maintaining such a value network incurs substantial computational and memory overhead. (ii) Group relative policy optimization (GRPO) avoids training a value network by approximating the value function using sample averages. However, GRPO samples a large number of reasoning traces per prompt to achieve accurate value function approximation, making it computationally expensive. (iii) REINFORCE-type algorithms sample only a single reasoning trajectory per prompt, which reduces computational cost but suffers from poor sample efficiency. In this work, we focus on a practical, resource-constrained setting in which only a small number of reasoning traces can be sampled per prompt, while low-variance gradient estimation remains essential for high-quality policy learning. To address this challenge, we bring classical nonparametric statistical methods, which are both computationally and statistically efficient, to LLM reasoning. We employ kernel smoothing as a concrete example for value function estimation and the subsequent policy optimization. Numerical and theoretical results demonstrate that our proposal achieves accurate value and gradient estimation, leading to improved policy optimization.",
          "authors": [
            "Shijin Gong",
            "Kai Ye",
            "Jin Zhu",
            "Xinyu Zhang",
            "Hongyi Zhou",
            "Chengchun Shi"
          ],
          "categories": [
            "cs.LG",
            "stat.ML"
          ],
          "paper_id": "http://arxiv.org/abs/2604.28005v1",
          "abstract_url": "https://arxiv.org/abs/2604.28005v1",
          "pdf_url": "https://arxiv.org/pdf/2604.28005v1",
          "published_at": "2026-04-30T15:27:34+00:00",
          "updated_at": "2026-04-30T15:27:34+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.28005",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.28005v1"
          },
          "relevance_score": 162,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"reasoning\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"RAG\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.28005"
        },
        {
          "title": "Exploring Interaction Paradigms for LLM Agents in Scientific Visualization",
          "summary": "This paper examines how different types of large language model (LLM) agents perform on scientific visualization (SciVis) tasks, where users generate visualization workflows from natural-language instructions. We compare three primary interaction paradigms, including domain-specific agents with structured tool use, computer-use agents, and general-purpose coding agents, by evaluating eight representative agents across 15 benchmark tasks and measuring visualization quality, efficiency, robustness, and computational cost. We further analyze interaction modalities, including code scripts and model context protocol (MCP) or API calls for structured tool use, as well as command-line interfaces (CLI) and graphical user interfaces (GUI) for more general interaction, while additionally studying the effect of persistent memory in selected agents. The results reveal clear tradeoffs across paradigms and modalities. General-purpose coding agents achieve the highest task success rates but are computationally expensive, while domain-specific agents are more efficient and stable but less flexible. Computer-use agents perform well on individual steps but struggle with longer multi-step workflows, indicating that long-horizon planning is their primary limitation. Across both CLI- and GUI-based settings, persistent memory improves performance over repeated trials, although its benefits depend on the underlying interaction mode and the quality of feedback. These findings suggest that no single approach is sufficient, and future SciVis systems should combine structured tool use, interactive capabilities, and adaptive memory mechanisms to balance performance, robustness, and flexibility.",
          "authors": [
            "Jackson Vonderhorst",
            "Kuangshi Ai",
            "Haichao Miao",
            "Shusen Liu",
            "Chaoli Wang"
          ],
          "categories": [
            "cs.AI",
            "cs.GR",
            "cs.HC"
          ],
          "paper_id": "http://arxiv.org/abs/2604.27996v1",
          "abstract_url": "https://arxiv.org/abs/2604.27996v1",
          "pdf_url": "https://arxiv.org/pdf/2604.27996v1",
          "published_at": "2026-04-30T15:22:28+00:00",
          "updated_at": "2026-04-30T15:22:28+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.27996",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.27996v1"
          },
          "relevance_score": 162,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.27996"
        },
        {
          "title": "Exploration Hacking: Can LLMs Learn to Resist RL Training?",
          "summary": "Reinforcement learning (RL) has become essential to the post-training of large language models (LLMs) for reasoning, agentic capabilities and alignment. Successful RL relies on sufficient exploration of diverse actions by the model during training, which creates a potential failure mode: a model could strategically alter its exploration during training to influence the subsequent training outcome. In this paper we study this behavior, called exploration hacking. First, we create model organisms of selective RL resistance by fine-tuning LLMs to follow specific underperformance strategies; these models can successfully resist our RL-based capability elicitation in agentic biosecurity and AI R&D environments while maintaining performance on related tasks. We then use our model organisms to evaluate detection and mitigation strategies, including monitoring, weight noising, and SFT-based elicitation. Finally, we show that current frontier models can exhibit explicit reasoning about suppressing their exploration when provided with sufficient information about their training context, with higher rates when this information is acquired indirectly through the environment. Together, our results suggest exploration hacking is a possible failure mode of RL on sufficiently capable LLMs.",
          "authors": [
            "Eyon Jang",
            "Damon Falck",
            "Joschka Braun",
            "Nathalie Kirch",
            "Achu Menon",
            "Perusha Moodley",
            "Scott Emmons",
            "Roland S. Zimmermann",
            "David Lindner"
          ],
          "categories": [
            "cs.LG",
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2604.28182v1",
          "abstract_url": "https://arxiv.org/abs/2604.28182v1",
          "pdf_url": "https://arxiv.org/pdf/2604.28182v1",
          "published_at": "2026-04-30T17:58:39+00:00",
          "updated_at": "2026-04-30T17:58:39+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.28182",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.28182v1"
          },
          "relevance_score": 161,
          "match_reasons": [
            "title matched \"LLM\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"reasoning\"",
            "summary matched \"agent\"",
            "summary matched \"alignment\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.28182"
        },
        {
          "title": "SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images",
          "summary": "Spectra are a prevalent yet highly information-dense form of scientific imagery, presenting substantial challenges to multimodal large language models (MLLMs) due to their unstructured and domain-specific characteristics. Here we introduce SpecVQA, a professional scientific-image benchmark for evaluating multimodal models on scientific spectral understanding, covering 7 representative spectrum types with expert-annotated question-answer pairs. The aim comprises two aspects: spectra scientific QA evaluation and corresponding underlying task evaluation. SpecVQA contains 620 figures and 3100 QA pairs curated from peer-reviewed literature, targeting both direct information extraction and domain-specific reasoning. To effectively reduce token length while preserving essential curve characteristics, we propose a spectral data sampling and interpolation reconstruction approach. Ablation studies further confirm that the approach achieves substantial performance improvements on the proposed benchmark. We test the capability of prominent MLLMs in scientific spectral understanding on our benchmark and present a leaderboard. This work represents an essential step toward enhancing spectral understanding in multimodal large models and suggests promising directions for extending visual-language models to broader scientific research and data analysis.",
          "authors": [
            "Jialu Shen",
            "Han Lyu",
            "Suyang Zhong",
            "Hanzheng Li",
            "Haoyi Tao",
            "Nan Wang",
            "Changhong Chen",
            "Xi Fang"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.28039v1",
          "abstract_url": "https://arxiv.org/abs/2604.28039v1",
          "pdf_url": "https://arxiv.org/pdf/2604.28039v1",
          "published_at": "2026-04-30T15:51:10+00:00",
          "updated_at": "2026-04-30T15:51:10+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.28039",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.28039v1"
          },
          "relevance_score": 158,
          "match_reasons": [
            "title matched \"benchmark\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.28039"
        },
        {
          "title": "Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows",
          "summary": "LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot. Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. For grading, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts, using deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions. The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule. Experiments reveal that reliable workflow automation remains far from solved: the leading model passes only 66.7% of tasks and no model reaches 70%. Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated. Leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, and task-level discrimination concentrates in a middle band of tasks. Claw-Eval-Live suggests that workflow-agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action.",
          "authors": [
            "Chenxin Li",
            "Zhengyang Tang",
            "Huangxin Lin",
            "Yunlong Lin",
            "Shijue Huang",
            "Shengyuan Liu",
            "Bowen Ye",
            "Rang Li",
            "Lei Li",
            "Benyou Wang",
            "Yixuan Yuan"
          ],
          "categories": [
            "cs.SE",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.28139v1",
          "abstract_url": "https://arxiv.org/abs/2604.28139v1",
          "pdf_url": "https://arxiv.org/pdf/2604.28139v1",
          "published_at": "2026-04-30T17:23:19+00:00",
          "updated_at": "2026-04-30T17:23:19+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Agent"
          ],
          "doi": null,
          "arxiv_id": "2604.28139",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.28139v1"
          },
          "relevance_score": 146,
          "match_reasons": [
            "title matched \"agent\"",
            "title matched \"benchmark\"",
            "summary matched \"LLM\"",
            "summary matched \"evaluation\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.28139"
        },
        {
          "title": "Agent-Agnostic Evaluation of SQL Accuracy in Production Text-to-SQL Systems",
          "summary": "Text-to-SQL (T2SQL) evaluation in production environments poses fundamental challenges that existing benchmarks do not address. Current evaluation methodologies whether rule-based SQL matching or schema-dependent semantic parsers assume access to ground-truth queries and structured database schema, constraints that are rarely satisfied in real-world deployments. This disconnect leaves production T2SQL agents largely unevaluated beyond developer-time testing, creating silent quality degradation with no feedback mechanism for continuous improvement. We present STEF (Schema-agnostic Text-to-SQL Evaluation Framework), a production-native evaluation system that operates exclusively on natural language inputs the user question, an enriched reformulation, and the generated SQL without requiring database schema or reference queries. STEF extracts semantic specifications from both natural language and SQL representations, performs normalized feature alignment, and produces an interpretable 0 to 100 accuracy score via a composite metric that encompasses filter alignment, semantic verdict, and confidence of the evaluator. Key contributions include: enriched question quality validation as a first-class evaluation signal, configurable application-specific rule injection via prompt templating, and production-robust normalization handling GROUP BY tolerance, ORDER BY defaults, and LIMIT heuristics. Empirical results demonstrate that STEF enables continuous production monitoring and agent improvement feedback loops without schema dependency, making structured query evaluation viable at scale for the first time.",
          "authors": [
            "Taslim Jamal Arif",
            "Kuldeep Singh"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.28049v1",
          "abstract_url": "https://arxiv.org/abs/2604.28049v1",
          "pdf_url": "https://arxiv.org/pdf/2604.28049v1",
          "published_at": "2026-04-30T15:59:28+00:00",
          "updated_at": "2026-04-30T15:59:28+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "评测",
            "应用",
            "方法"
          ],
          "topics": [
            "Agent",
            "Benchmark"
          ],
          "doi": null,
          "arxiv_id": "2604.28049",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.28049v1"
          },
          "relevance_score": 145,
          "match_reasons": [
            "title matched \"agent\"",
            "title matched \"evaluation\"",
            "summary matched \"alignment\"",
            "summary matched \"benchmark\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.28049"
        },
        {
          "title": "Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception",
          "summary": "Large Language Models (LLMs) are increasingly used as proxies for human perception in urban analysis, yet it remains unclear whether persona prompting produces meaningful and reproducible behavioral diversity. We investigate whether distinct personas influence urban sentiment judgments generated by multimodal LLMs. Using a factorial set of personas spanning gender, economic status, political orientation, and personality, we instantiate multiple agents per persona to evaluate urban scene images from the PerceptSent dataset and assess both within-persona consistency and cross-persona variation. Results show strong convergence among agents sharing a persona, indicating stable and reproducible behavior. However, cross-persona differentiation is limited: economic status and personality induce statistically detectable but practically modest variation, while gender shows no measurable effect and political orientation only negligible impact. Agents also exhibit an extremity bias, collapsing intermediate sentiment categories common in human annotations. As a result, performance remains strong on coarse-grained polarity tasks but degrades as sentiment resolution increases, suggesting that simple label-based persona prompting does not capture fine-grained perceptual judgments. To isolate the contribution of persona conditioning, we additionally evaluate the same model without personas. Surprisingly, the no-persona model sometimes matches or exceeds persona-conditioned agreement with human labels across all task variants, suggesting that simple label-based persona prompting may add limited annotation value in this setting.",
          "authors": [
            "Neemias B da Silva",
            "Rodrigo Minetto",
            "Daniel Silver",
            "Thiago H Silva"
          ],
          "categories": [
            "cs.CL",
            "cs.SI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.28048v1",
          "abstract_url": "https://arxiv.org/abs/2604.28048v1",
          "pdf_url": "https://arxiv.org/pdf/2604.28048v1",
          "published_at": "2026-04-30T15:59:11+00:00",
          "updated_at": "2026-04-30T15:59:11+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "数据",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.28048",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.28048v1"
          },
          "relevance_score": 145,
          "match_reasons": [
            "title matched \"LLM\"",
            "title matched \"agent\"",
            "summary matched \"language model\"",
            "summary matched \"large language model\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.28048"
        },
        {
          "title": "Design Structure Matrix Modularization with Large Language Models",
          "summary": "Design Structure Matrix (DSM) modularization, the task of partitioning system elements into cohesive modules, is a fundamental combinatorial challenge in engineering design. Traditional methods treat modularization as a pure graph optimization, without access to the engineering context embedded in the system. Building on prior work on LLM-based combinatorial optimization for DSM sequencing, this paper extends the method to modularization across five cases and three backbone LLMs. Our method achieves near-reference quality within 30 iterations without requiring specialized optimization code. Counterintuitively, domain knowledge, beneficial in sequencing, consistently impairs performance on more complex DSMs. We attribute this to semantic misalignment between the LLM's functional priors and the purely structural optimization objective, and propose the semantic-alignment hypothesis as a testable condition governing knowledge effectiveness with LLMs. Ablation studies identify the most effective input representation, objective formulation, and solution pool design for practical deployment. These findings offer practical guidance for deploying LLMs in engineering design optimization.",
          "authors": [
            "Shuo Jiang",
            "Jianxi Luo"
          ],
          "categories": [
            "cs.CE",
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.28018v1",
          "abstract_url": "https://arxiv.org/abs/2604.28018v1",
          "pdf_url": "https://arxiv.org/pdf/2604.28018v1",
          "published_at": "2026-04-30T15:38:38+00:00",
          "updated_at": "2026-04-30T15:38:38+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "应用",
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.28018",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.28018v1"
          },
          "relevance_score": 144,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"alignment\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.28018"
        },
        {
          "title": "DPN-LE: Dual Personality Neuron Localization and Editing for Large Language Models",
          "summary": "With the widespread adoption of large language models (LLMs), understanding their personality representation mechanisms has become critical. As a novel paradigm in Personality Editing, most existing methods employ neuron-editing to locate and modify LLM neurons, requiring changes to numerous neurons and leading to significant performance degradation. This raises a fundamental question: Are all modified neurons directly related to personality representation? In this work, we investigate and quantify this specificity through assessments of general capability impact and representation-level patterns. We find that: 1) Current methods can change personalities but reduce overall performance. 2) Neurons are multifunctional, connecting personality traits and general knowledge. 3) Opposing personality traits demonstrate distinctly mutually exclusive representation patterns. Motivated by these findings, we propose DPN-LE (Dual Personality Neuron Localization and Editing), which identifies personality-specific neurons by contrasting MLP activations between high-trait and low-trait samples. DPN-LE constructs layer-wise steering vectors and applies dual-criterion filtering based on Cohen's $d$ effect size and activation magnitude to isolate mutually exclusive neuron subsets. Sparse linear intervention on these neurons enables precise personality control at inference time. Using only 1,000 contrastive sample pairs per trait, DPN-LE intervenes on $\\sim$0.5\\% of neurons while achieving competitive personality control and substantially better capability preservation across reasoning tasks. Experiments on LLaMA-3-8B-Instruct and Qwen2.5-7B-Instruct demonstrate the effectiveness and generalizability of our approach.",
          "authors": [
            "Lifan Zheng",
            "Xue Yang",
            "Jiawei Chen",
            "Chenyan Wu",
            "Jingyuan Zhang",
            "Fanheng Kong",
            "Xinyi Zeng",
            "Xiang Chen",
            "Yu Tian"
          ],
          "categories": [
            "cs.CL"
          ],
          "paper_id": "http://arxiv.org/abs/2604.27929v1",
          "abstract_url": "https://arxiv.org/abs/2604.27929v1",
          "pdf_url": "https://arxiv.org/pdf/2604.27929v1",
          "published_at": "2026-04-30T14:31:46+00:00",
          "updated_at": "2026-04-30T14:31:46+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.27929",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.27929v1"
          },
          "relevance_score": 143,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"large language model\"",
            "summary matched \"LLM\"",
            "summary matched \"reasoning\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.27929"
        },
        {
          "title": "Modeling Clinical Concern Trajectories in Language Model Agents",
          "summary": "Large language model (LLM) agents deployed in clinical settings often exhibit abrupt, threshold-driven behavior, offering little visibility into accumulating risk prior to escalation. In real-world care, however, clinicians act on gradually rising concern rather than instantaneous triggers. We study whether explicit state dynamics can expose such pre-escalation signals without delegating clinical authority to the agent. We introduce a lightweight agent architecture in which a memoryless clinical risk encoder is integrated over time using first- and second-order dynamics to produce a continuous escalation pressure signal. Across synthetic ward scenarios, stateless agents exhibit sharp escalation cliffs, while second-order dynamics produce smooth, anticipatory concern trajectories despite similar escalation timing. These trajectories surface sustained unease prior to escalation, enabling human-in-the-loop monitoring and more informed intervention. Our results suggest that explicit state dynamics can make LLM agents more clinically legible by revealing how long concern has been rising, not just when thresholds are crossed.",
          "authors": [
            "Sukesh Subaharan",
            "Venkatesan VS",
            "Murugadasan P",
            "Sivakumar D",
            "Gautham N",
            "Ganeshkumar M"
          ],
          "categories": [
            "cs.AI"
          ],
          "paper_id": "http://arxiv.org/abs/2604.27872v1",
          "abstract_url": "https://arxiv.org/abs/2604.27872v1",
          "pdf_url": "https://arxiv.org/pdf/2604.27872v1",
          "published_at": "2026-04-30T13:53:09+00:00",
          "updated_at": "2026-04-30T13:53:09+00:00",
          "source": "arxiv",
          "date_label": "Published",
          "analysis": null,
          "tags": [
            "方法"
          ],
          "topics": [
            "LLM",
            "Language Model"
          ],
          "doi": null,
          "arxiv_id": "2604.27872",
          "source_variants": [
            "arxiv"
          ],
          "source_urls": {
            "arxiv": "https://arxiv.org/abs/2604.27872v1"
          },
          "relevance_score": 142,
          "match_reasons": [
            "title matched \"language model\"",
            "title matched \"agent\"",
            "summary matched \"large language model\"",
            "summary matched \"LLM\"",
            "has PDF",
            "has rich summary",
            "has complete metadata"
          ],
          "feedback_status": null,
          "feedback_note": null,
          "feedback_next_action": null,
          "feedback_due_date": null,
          "feedback_snoozed_until": null,
          "feedback_review_interval_days": null,
          "canonical_id": "arxiv:2604.27872"
        }
      ]
    }
  ]
}