Feed Subscription

LM 固定订阅页

适合长期跟踪单个研究方向。页面会汇总这个 feed 的最近 7 天 / 30 天表现，并保留每天命中的原始条目和 digest 链接。

返回归档首页查看趋势总览最新 Markdown 订阅 RSS

近期走势

LM 今日没有新的命中文献。

2026-06-15

2026-06-16

2026-06-17

2026-06-18

2026-06-19

2026-06-20

2026-06-21

2026-06-22

2026-06-23

2026-06-24

2026-06-25

2026-06-26

2026-06-27

2026-06-28

历史命中

按天回看这个 feed 的命中文献，并保留当日 digest 的 Markdown / JSON 原始产物。

2026-06-26

命中 15 篇生成于 2026-06-26 13:16:53 (Asia/Shanghai)

Markdown JSON

LM15 篇

《NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models》〔评测 / 数据 / 方法〕：Large language models (LLMs) have demonstrated strong performance across a wide range of tasks, but e…

NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models · Score 218
title matched "language model"；title matched "large language model"；title matched "benchmark"
原始来源
Joint Learning of Experiential Rules and Policies for Large Language Model Agents · Score 165
title matched "language model"；title matched "large language model"；title matched "agent"
原始来源
The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans · Score 165
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
Prompt Injection in Automated Résumé Screening with Large Language Models: Single and Multi-Injection Settings · Score 163
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
Semantic Early-Stopping for Iterative LLM Agent Loops · Score 160
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源

2026-06-25

命中 15 篇生成于 2026-06-25 13:11:21 (Asia/Shanghai)

Markdown JSON

LM15 篇

《InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy》〔评测 / 应用 / 方法〕：Large language models are increasingly deployed as investment res…

InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy · Score 188
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models · Score 182
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations · Score 182
title matched "language model"；title matched "reasoning"；summary matched "LLM"
原始来源
MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction · Score 170
title matched "agent"；summary matched "language model"；summary matched "large language model"
原始来源
Probabilistic Agents in Deterministic Audits: Evaluating Multi-Agent Systems for Automated Audits Based on the German IT-Grundschutz · Score 164
title matched "agent"；summary matched "LLM"；summary matched "reasoning"
原始来源

2026-06-24

命中 15 篇生成于 2026-06-24 13:06:49 (Asia/Shanghai)

Markdown JSON

LM15 篇

《AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning》〔评测 / 方法〕：Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge.…

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning · Score 199
title matched "reasoning"；title matched "agent"；title matched "benchmark"
原始来源
AI-PAVE-Br: Leveraging Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach · Score 195
title matched "language model"；title matched "large language model"；title matched "RAG"
原始来源
A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial · Score 181
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
EG-VQA: Benchmarking Verifiable Video Question Answering with Grounded Temporal Evidence · Score 177
title matched "benchmark"；summary matched "language model"；summary matched "large language model"
原始来源
Are We Ready For An Agent-Native Memory System? · Score 177
title matched "agent"；summary matched "language model"；summary matched "large language model"
原始来源

2026-06-23

命中 15 篇生成于 2026-06-23 13:10:02 (Asia/Shanghai)

Markdown JSON

LM15 篇

《AIR: Adaptive Interleaved Reasoning with Code in MLLMs》〔评测 / 数据 / 应用 / 方法〕：Following the paradigm shift initiated by OpenAI o3, interleaved reasoning with code to enhance multimodal large language models (MLLMs) has be…

AIR: Adaptive Interleaved Reasoning with Code in MLLMs · Score 200
title matched "LLM"；title matched "reasoning"；summary matched "language model"
原始来源
TriggerBench: Investigating Prospective Memory for Large Language Models · Score 197
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
Can LLMs Reliably Self-Report Adversarial Prefills, and How? · Score 160
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源
Evaluation Awareness Is Not One Capability: Evidence from Open Language Models · Score 145
title matched "language model"；title matched "evaluation"；summary matched "instruction tuning"
原始来源
POTracker: Optimizing Large Language Models for Standard-Compliant Power Outage Report Generation · Score 145
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源

2026-06-19

命中 15 篇生成于 2026-06-19 14:26:15 (Asia/Shanghai)

Markdown JSON

LM15 篇

《QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation》〔评测 / 方法〕：Large Language Models (LLMs) have made significant progress in reasoning, particularly in ded…

QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation · Score 221
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems · Score 201
title matched "LLM"；title matched "agent"；title matched "benchmark"
原始来源
Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM Inference · Score 191
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源
Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems · Score 162
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源
Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users · Score 162
title matched "LLM"；title matched "alignment"；summary matched "language model"
原始来源

2026-06-18

命中 15 篇生成于 2026-06-18 14:03:08 (Asia/Shanghai)

Markdown JSON

LM15 篇

《Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play》〔评测 / 方法〕：Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with exec…

Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play · Score 185
title matched "language model"；title matched "large language model"；title matched "agent"
原始来源
IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages · Score 182
title matched "language model"；title matched "large language model"；title matched "benchmark"
原始来源
A Technical Taxonomy of LLM Agent Communication Protocols · Score 160
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源
Quantifying and Auditing LLM Evaluation via Positive--Unlabeled Learning · Score 159
title matched "LLM"；title matched "evaluation"；summary matched "language model"
原始来源
Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA · Score 158
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源

2026-06-17

命中 15 篇生成于 2026-06-17 14:22:19 (Asia/Shanghai)

Markdown JSON

LM15 篇

《Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports》〔评测 / 数据 / 应用 / 方法〕：Classifying Cyber Threat Intelligence (CTI) using MITRE Adversarial Tactics, Techniques, and Common Knowled…

Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports · Score 176
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源
Reading between the Lines: Leveraging Large Language Models for Global Dementia and Depression Assessment from Clinical Interviews · Score 164
title matched "language model"；title matched "large language model"；title matched "RAG"
原始来源
The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act · Score 162
title matched "reasoning"；title matched "benchmark"；summary matched "language model"
原始来源
WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning · Score 162
title matched "reasoning"；title matched "agent"；summary matched "language model"
原始来源
From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning · Score 161
title matched "language model"；title matched "reasoning"；summary matched "large language model"
原始来源

2026-06-16

命中 15 篇生成于 2026-06-16 14:38:43 (Asia/Shanghai)

Markdown JSON

LM15 篇

《OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models》〔评测 / 应用 / 方法〕：Equipping Large Language Model (LLM) agents with effective skills is crucial for solving complex tasks in real-world systems…

OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models · Score 235
title matched "language model"；title matched "large language model"；title matched "agent"
原始来源
Context-Aware RL for Agentic and Multimodal LLMs · Score 199
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源
Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio · Score 185
title matched "LLM"；title matched "agent"；title matched "benchmark"
原始来源
Consensus-based Agentic Large Language Model Framework for Harmonized Tariff Schedule Code Classification · Score 184
title matched "language model"；title matched "large language model"；title matched "agent"
原始来源
Scalable Circuit Learning for Interpreting Large Language Models · Score 162
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源

2026-06-12

命中 15 篇生成于 2026-06-12 13:55:02 (Asia/Shanghai)

Markdown JSON

LM15 篇

《EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments》〔评测 / 应用 / 方法〕：Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations as…

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments · Score 200
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源
Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents · Score 178
title matched "agent"；title matched "benchmark"；summary matched "language model"
原始来源
An LLM System for Autonomous Variational Quantum Circuit Design · Score 174
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源
Mod-Guide: An LLM-based Content Moderation Feedback System to Address Insensitive Speech toward Indigenous Ethnic and Religious Minority Communities · Score 168
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源
SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning · Score 164
title matched "reasoning"；title matched "agent"；summary matched "language model"
原始来源

2026-06-11

命中 15 篇生成于 2026-06-11 13:59:12 (Asia/Shanghai)

Markdown JSON

LM15 篇

《Measuring Epistemic Resilience of LLMs Under Misleading Medical Context》〔评测 / 应用 / 方法〕：Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores…

Measuring Epistemic Resilience of LLMs Under Misleading Medical Context · Score 194
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源
Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation · Score 182
title matched "LLM"；title matched "benchmark"；title matched "evaluation"
原始来源
OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models · Score 178
title matched "language model"；title matched "reasoning"；summary matched "alignment"
原始来源
Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization · Score 159
title matched "reasoning"；summary matched "language model"；summary matched "large language model"
原始来源
ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing · Score 159
title matched "alignment"；summary matched "language model"；summary matched "large language model"
原始来源

2026-06-10

命中 15 篇生成于 2026-06-10 13:25:04 (Asia/Shanghai)

Markdown JSON

LM15 篇

《T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains》〔评测 / 应用 / 方法〕：Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic sys…

T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains · Score 217
title matched "agent"；title matched "benchmark"；summary matched "language model"
原始来源
Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution · Score 215
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源
ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity · Score 200
title matched "agent"；title matched "benchmark"；summary matched "language model"
原始来源
Structure from Reasoning, Numbers from Search: On-Premise Open LLMs as Structural Priors for Coupled MIMO Controller Tuning · Score 180
title matched "LLM"；title matched "reasoning"；summary matched "language model"
原始来源
Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam? · Score 175
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源

2026-06-09

命中 15 篇生成于 2026-06-09 13:12:49 (Asia/Shanghai)

Markdown JSON

LM15 篇

《SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks》〔评测 / 方法〕：Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and op…

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks · Score 238
title matched "reasoning"；title matched "agent"；title matched "benchmark"
原始来源
Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving · Score 180
title matched "LLM"；title matched "benchmark"；summary matched "language model"
原始来源
TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs · Score 179
title matched "LLM"；title matched "benchmark"；summary matched "language model"
原始来源
Gradient-Guided Reward Optimization for Inference-time Alignment · Score 176
title matched "alignment"；summary matched "language model"；summary matched "large language model"
原始来源
IS-CoT: Breaking the Long-form Generation Collapse via Interleaved Structural Thinking · Score 173
summary matched "language model"；summary matched "large language model"；summary matched "LLM"
原始来源

2026-06-05

命中 15 篇生成于 2026-06-05 13:25:00 (Asia/Shanghai)

Markdown JSON

LM15 篇

《MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models》〔评测 / 方法〕：Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that…

MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models · Score 214
title matched "language model"；title matched "large language model"；title matched "benchmark"
原始来源
CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments · Score 210
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源
AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints · Score 196
title matched "language model"；title matched "large language model"；title matched "agent"
原始来源
The Tell-Tale Norm: $\ell_2$ Magnitude as a Signal for Reasoning Dynamics in Large Language Models · Score 196
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems · Score 192
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源

2026-06-04

命中 15 篇生成于 2026-06-04 14:02:06 (Asia/Shanghai)

Markdown JSON

LM15 篇

《A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs》〔评测 / 应用 / 方法〕：Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under mult…

A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs · Score 191
title matched "LLM"；title matched "evaluation"；summary matched "language model"
原始来源
Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases · Score 191
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
Self-Evolving Deep Research via Joint Generation and Evaluation · Score 187
title matched "evaluation"；summary matched "language model"；summary matched "large language model"
原始来源
Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents · Score 177
title matched "LLM"；title matched "agent"；title matched "benchmark"
原始来源
Large Language Models in K-12 Education: Alignment with State Curriculum Standards and Student Personas · Score 177
title matched "language model"；title matched "large language model"；title matched "alignment"
原始来源

2026-06-03

命中 15 篇生成于 2026-06-03 14:09:56 (Asia/Shanghai)

Markdown JSON

LM15 篇

《Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning》〔评测 / 方法〕：Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficie…

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning · Score 213
title matched "LLM"；title matched "reasoning"；title matched "agent"
原始来源
Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models · Score 213
title matched "language model"；title matched "large language model"；title matched "alignment"
原始来源
Can Factual Opinions Be Edited (Manipulated) in Large Language Models? · Score 191
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
Large Language Models Are Overconfident in Their Own Responses · Score 191
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models · Score 177
title matched "language model"；title matched "large language model"；title matched "alignment"
原始来源

2026-06-02

命中 15 篇生成于 2026-06-02 13:56:35 (Asia/Shanghai)

Markdown JSON

LM15 篇

《POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems》〔评测 / 应用 / 方法〕：Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emerge…

POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems · Score 192
title matched "agent"；summary matched "language model"；summary matched "large language model"
原始来源
MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation · Score 184
title matched "LLM"；title matched "agent"；title matched "benchmark"
原始来源
Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling · Score 178
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源
AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design · Score 165
title matched "language model"；title matched "reasoning"；title matched "agent"
原始来源
Auditing Asset-Specific Preferences in Financial Large Language Models: Evidence from Bitcoin Representations and Portfolio Allocation · Score 163
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源

2026-05-29

命中 15 篇生成于 2026-05-29 13:18:32 (Asia/Shanghai)

Markdown JSON

LM15 篇

《FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via Board Game Simulations》〔评测 / 方法〕：Recently, large language models (LLMs) have achieved superior performance in static f…

FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via Board Game Simulations · Score 232
title matched "LLM"；title matched "reasoning"；title matched "benchmark"
原始来源
CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models · Score 196
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
Revisiting Anthropomorphic Reflection Markers in Large Language Model Reasoning · Score 196
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
Argument Quality Assessment with Large Language Models: A Pairwise Bradley-Terry Approach · Score 192
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs · Score 192
title matched "agent"；title matched "benchmark"；summary matched "language model"
原始来源

2026-05-28

命中 15 篇生成于 2026-05-28 13:15:52 (Asia/Shanghai)

Markdown JSON

LM15 篇

《MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems》〔评测 / 方法〕：Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unr…

MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems · Score 199
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability · Score 184
title matched "LLM"；title matched "reasoning"；title matched "evaluation"
原始来源
TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning · Score 181
title matched "LLM"；title matched "reasoning"；summary matched "language model"
原始来源
Mobile-Aptus: Confidence-Driven Proactive and Robust Interaction in MLLM-based Mobile-Using Agents · Score 180
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源
The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic · Score 177
title matched "evaluation"；summary matched "language model"；summary matched "large language model"
原始来源

2026-05-27

命中 15 篇生成于 2026-05-27 13:23:19 (Asia/Shanghai)

Markdown JSON

LM15 篇

《Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry》〔评测 / 数据 / 应用 / 方法〕：Key knowledge for steel-industry volatile organic compounds (VOCs) governance is s…

Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry · Score 214
title matched "LLM"；title matched "reasoning"；summary matched "language model"
原始来源
Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation · Score 201
title matched "reasoning"；title matched "agent"；title matched "benchmark"
原始来源
Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments · Score 176
title matched "agent"；summary matched "language model"；summary matched "large language model"
原始来源
VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions · Score 175
title matched "agent"；summary matched "language model"；summary matched "large language model"
原始来源
MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation · Score 164
title matched "agent"；title matched "evaluation"；summary matched "language model"
原始来源

2026-05-26

命中 15 篇生成于 2026-05-26 13:09:24 (Asia/Shanghai)

Markdown JSON

LM15 篇

《Automated Benchmark Auditing for AI Agents and Large Language Models》〔评测 / 数据 / 方法〕：Modern AI benchmarks operate at a complexity that outpaces traditional verification methods. Tasks authored by domain experts often co…

Automated Benchmark Auditing for AI Agents and Large Language Models · Score 244
title matched "language model"；title matched "large language model"；title matched "agent"
原始来源
Causal methods for LLM development and evaluation · Score 211
title matched "LLM"；title matched "evaluation"；summary matched "language model"
原始来源
PolyGnosis 2.0: Enhancing LLM Reasoning via Agentic Harness Engineering for Polymarket and OSINT Insight Extraction · Score 202
title matched "LLM"；title matched "reasoning"；title matched "agent"
原始来源
Can LLMs Time Travel? Enhancing Temporal Consistency in Legal Agentic Search through Reinforcement Learning · Score 197
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源
When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation · Score 180
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源

2026-05-22

命中 15 篇生成于 2026-05-22 13:08:19 (Asia/Shanghai)

Markdown JSON

LM15 篇

《Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents》〔评测 / 方法〕：Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses…

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents · Score 196
title matched "LLM"；title matched "agent"；title matched "evaluation"
原始来源
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety · Score 192
title matched "agent"；title matched "benchmark"；summary matched "language model"
原始来源
ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning · Score 192
title matched "reasoning"；title matched "benchmark"；summary matched "LLM"
原始来源
LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance · Score 188
title matched "reasoning"；summary matched "language model"；summary matched "large language model"
原始来源
From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment · Score 174
title matched "LLM"；title matched "alignment"；summary matched "language model"
原始来源

2026-05-21

命中 15 篇生成于 2026-05-21 13:14:24 (Asia/Shanghai)

Markdown JSON

LM15 篇

《Tracing the ongoing emergence of human-like reasoning in Large Language Models》〔方法〕：Humans effortlessly go beyond literal meanings: If you mow the lawn, I will give you fifty dollars, is typically understood as implyin…

Tracing the ongoing emergence of human-like reasoning in Large Language Models · Score 184
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema · Score 167
title matched "LLM"；title matched "agent"；title matched "benchmark"
原始来源
Leveraging LLMs for Grammar Adaptation: A Study on Metamodel-Grammar Co-Evolution · Score 164
title matched "LLM"；title matched "RAG"；summary matched "language model"
原始来源
LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models · Score 163
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata · Score 160
title matched "benchmark"；summary matched "language model"；summary matched "large language model"
原始来源

2026-05-20

命中 15 篇生成于 2026-05-20 13:10:58 (Asia/Shanghai)

Markdown JSON

LM15 篇

《MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models》〔评测 / 方法〕：Large language models (LLMs) are increasingly integrated into high-stakes decision-making. Inspired by the theory of \emph{inattention…

MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models · Score 236
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
OpenCompass: A Universal Evaluation Platform for Large Language Models · Score 232
title matched "language model"；title matched "large language model"；title matched "evaluation"
原始来源
Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking · Score 214
title matched "LLM"；title matched "agent"；title matched "evaluation"
原始来源
SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models · Score 214
title matched "language model"；title matched "large language model"；title matched "evaluation"
原始来源
LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening · Score 196
title matched "LLM"；title matched "reasoning"；title matched "benchmark"
原始来源

2026-05-19

命中 15 篇生成于 2026-05-19 13:08:04 (Asia/Shanghai)

Markdown JSON

LM15 篇

《CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark》〔评测 / 数据 / 方法〕：Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view pe…

CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark · Score 217
title matched "LLM"；title matched "benchmark"；summary matched "language model"
原始来源
SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science · Score 181
title matched "LLM"；title matched "benchmark"；summary matched "language model"
原始来源
MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion · Score 176
title matched "agent"；summary matched "language model"；summary matched "large language model"
原始来源
SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents · Score 168
title matched "LLM"；title matched "agent"；title matched "benchmark"
原始来源
LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems · Score 158
title matched "agent"；summary matched "LLM"；summary matched "reasoning"
原始来源

2026-05-18

命中 15 篇生成于 2026-05-18 13:13:17 (Asia/Shanghai)

Markdown JSON

LM15 篇

《CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency》〔评测 / 应用 / 方法〕：This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate…

CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency · Score 236
title matched "LLM"；title matched "agent"；title matched "benchmark"
原始来源
FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models · Score 232
title matched "language model"；title matched "large language model"；title matched "benchmark"
原始来源
MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models · Score 214
title matched "language model"；title matched "large language model"；title matched "benchmark"
原始来源
Large Language Models Could Be Rote Learners · Score 192
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
Look Before You Leap: Autonomous Exploration for LLM Agents · Score 192
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源

2026-05-15

命中 11 篇生成于 2026-05-15 14:57:29 (Asia/Shanghai)

Markdown JSON

LM11 篇

《Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks》〔评测 / 数据 / 方法〕：We introduce a reusable framework for auditing whether LLM attack benchmarks collectively cover the threat surface: a 4$\times…

Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks · Score 202
title matched "LLM"；title matched "RAG"；title matched "benchmark"
原始来源
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning · Score 161
title matched "LLM"；title matched "reasoning"；summary matched "language model"
原始来源
Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use · Score 161
title matched "LLM"；title matched "reasoning"；summary matched "language model"
原始来源
APWA: A Distributed Architecture for Parallelizable Agentic Workflows · Score 158
title matched "agent"；summary matched "language model"；summary matched "large language model"
原始来源
MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory · Score 144
title matched "agent"；title matched "evaluation"；summary matched "reasoning"
原始来源

2026-05-14

命中 15 篇生成于 2026-05-14 12:52:54 (Asia/Shanghai)

Markdown JSON

LM15 篇

《RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation》〔评测 / 数据 / 方法〕：Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicia…

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation · Score 200
title matched "LLM"；title matched "agent"；title matched "benchmark"
原始来源
MILM: Large Language Models for Multimodal Irregular Time Series with Informative Sampling · Score 163
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
An LLM-Based System for Argument Reconstruction · Score 160
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源
OpenAaaS: An Open Agent-as-a-Service Framework for Distributed Materials-Informatics Research · Score 157
title matched "agent"；summary matched "language model"；summary matched "large language model"
原始来源
(How) Do Large Language Models Understand High-Level Message Sequence Charts? · Score 145
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源

2026-05-13

命中 15 篇生成于 2026-05-13 12:54:34 (Asia/Shanghai)

Markdown JSON

LM15 篇

《MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering》〔评测 / 数据 / 应用 / 方法〕：Evaluating large language models (LLMs) in the biomedical domain requi…

MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering · Score 225
title matched "LLM"；title matched "reasoning"；title matched "benchmark"
原始来源
ORCE: Order-Aware Alignment of Verbalized Confidence in Large Language Models · Score 222
title matched "language model"；title matched "large language model"；title matched "alignment"
原始来源
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models · Score 207
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
Question Difficulty Estimation for Large Language Models via Answer Plausibility Scoring · Score 182
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering · Score 177
title matched "evaluation"；summary matched "language model"；summary matched "large language model"
原始来源

2026-05-12

命中 15 篇生成于 2026-05-12 12:42:08 (Asia/Shanghai)

Markdown JSON

LM15 篇

《WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation》〔评测 / 方法〕：Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) ha…

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation · Score 205
title matched "agent"；title matched "benchmark"；title matched "evaluation"
原始来源
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox · Score 185
title matched "LLM"；title matched "agent"；title matched "evaluation"
原始来源
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents · Score 168
title matched "LLM"；title matched "agent"；title matched "benchmark"
原始来源
LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments · Score 167
title matched "LLM"；title matched "agent"；title matched "benchmark"
原始来源
Threat Modelling using Domain-Adapted Language Models: Empirical Evaluation and Insights · Score 163
title matched "language model"；title matched "evaluation"；summary matched "large language model"
原始来源

2026-05-08

命中 12 篇生成于 2026-05-08 14:15:32 (Asia/Shanghai)

Markdown JSON

LM12 篇

《LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG》〔评测 / 数据 / 应用 / 方法〕：Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question…

LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG · Score 231
title matched "reasoning"；title matched "agent"；title matched "RAG"
原始来源
MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents · Score 213
title matched "LLM"；title matched "agent"；title matched "benchmark"
原始来源
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity · Score 213
title matched "LLM"；title matched "alignment"；title matched "evaluation"
原始来源
Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using LLM Judges with Closed-Loop Reinforcement Learning Feedback · Score 213
title matched "LLM"；title matched "agent"；title matched "evaluation"
原始来源
BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models · Score 209
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源

2026-05-07

命中 15 篇生成于 2026-05-07 12:38:06 (Asia/Shanghai)

Markdown JSON

LM15 篇

《Misaligned by Reward: Socially Undesirable Preferences in LLMs》〔评测 / 数据 / 方法〕：Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, exis…

Misaligned by Reward: Socially Undesirable Preferences in LLMs · Score 194
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源
SoK: Robustness in Large Language Models against Jailbreak Attacks · Score 181
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
Why Expert Alignment Is Hard: Evidence from Subjective Evaluation · Score 161
title matched "alignment"；title matched "evaluation"；summary matched "language model"
原始来源
KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels · Score 143
title matched "LLM"；title matched "benchmark"；summary matched "RAG"
原始来源
Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction · Score 142
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源

2026-05-06

命中 15 篇生成于 2026-05-06 12:37:23 (Asia/Shanghai)

Markdown JSON

LM15 篇

《Safety and accuracy follow different scaling laws in clinical large language models》〔评测 / 应用 / 方法〕：Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time comput…

Safety and accuracy follow different scaling laws in clinical large language models · Score 201
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones · Score 183
title matched "LLM"；title matched "reasoning"；title matched "agent"
原始来源
OracleProto: A Reproducible Framework for Benchmarking LLM Native Forecasting via Knowledge Cutoff and Temporal Masking · Score 161
title matched "LLM"；title matched "benchmark"；summary matched "language model"
原始来源
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems · Score 147
title matched "reasoning"；title matched "agent"；summary matched "benchmark"
原始来源
Benchmarking Parameter-Efficient Fine-Tuning of Large Language Models for Low-Resource Tajik Text Generation with the Tajik Web Corpus · Score 146
title matched "language model"；title matched "large language model"；title matched "benchmark"
原始来源

2026-05-05

命中 15 篇生成于 2026-05-05 12:20:54 (Asia/Shanghai)

Markdown JSON

LM15 篇

《StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models》〔评测 / 数据 / 方法〕：Static benchmarks for LLMs are increasingly compromised by contamination and overfitting especia…

StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models · Score 219
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks · Score 211
title matched "LLM"；title matched "benchmark"；summary matched "language model"
原始来源
Maistros: A Greek Large Language Model Adapted Through Knowledge Distillation From Large Reasoning Models · Score 197
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation · Score 197
title matched "language model"；title matched "large language model"；title matched "alignment"
原始来源
OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice · Score 197
title matched "language model"；title matched "large language model"；title matched "LLM"
原始来源

2026-05-01

命中 15 篇生成于 2026-05-01 12:53:56 (Asia/Shanghai)

Markdown JSON

LM15 篇

《Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper Agents》〔评测 / 应用 / 方法〕：We present Collabora…

Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper Agents · Score 193
title matched "reasoning"；title matched "agent"；summary matched "language model"
原始来源
What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design · Score 185
title matched "agent"；title matched "benchmark"；title matched "evaluation"
原始来源
Rethinking Agentic Reinforcement Learning In Large Language Models · Score 182
title matched "language model"；title matched "large language model"；title matched "agent"
原始来源
TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering · Score 181
title matched "reasoning"；title matched "benchmark"；summary matched "language model"
原始来源
LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning · Score 180
title matched "LLM"；title matched "reasoning"；summary matched "language model"
原始来源

2026-04-29

命中 15 篇生成于 2026-04-29 12:26:28 (Asia/Shanghai)

Markdown JSON

LM15 篇

《LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation》〔评测 / 数据 / 方法〕：Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across hetero…

LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation · Score 197
title matched "LLM"；title matched "evaluation"；summary matched "language model"
原始来源
Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models · Score 178
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios · Score 165
title matched "agent"；title matched "benchmark"；summary matched "LLM"
原始来源
From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling · Score 164
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源
SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing? · Score 158
title matched "agent"；summary matched "language model"；summary matched "large language model"
原始来源

LM 固定订阅页

近期走势

相关关键词页

历史命中