Feed Subscription

LM 固定订阅页

适合长期跟踪单个研究方向。页面会汇总这个 feed 的最近 7 天 / 30 天表现,并保留每天命中的原始条目和 digest 链接。

最近 7 天

60

篇论文

4 个活跃 digest

最近 30 天

240

篇论文

16 个活跃 digest

全部历史

518

篇论文

35 个活跃 digest

近期走势

LM 今日没有新的命中文献。

2026-06-15
0
2026-06-16
15
2026-06-17
15
2026-06-18
15
2026-06-19
15
2026-06-20
0
2026-06-21
0
2026-06-22
0
2026-06-23
15
2026-06-24
15
2026-06-25
15
2026-06-26
15
2026-06-27
0
2026-06-28
0

相关关键词页

如果这个 feed 同时命中了你配置里的关键词,这里会给出长期追踪入口。

历史命中

按天回看这个 feed 的命中文献,并保留当日 digest 的 Markdown / JSON 原始产物。

2026-06-26

命中 15 篇生成于 2026-06-26 13:16:53 (Asia/Shanghai)
LM15 篇

《NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models》〔评测 / 数据 / 方法〕:Large language models (LLMs) have demonstrated strong performance across a wide range of tasks, but e…

  1. NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models · Score 218
    title matched "language model";title matched "large language model";title matched "benchmark"
  2. Joint Learning of Experiential Rules and Policies for Large Language Model Agents · Score 165
    title matched "language model";title matched "large language model";title matched "agent"
  3. The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans · Score 165
    title matched "language model";title matched "large language model";title matched "reasoning"
  4. Prompt Injection in Automated Résumé Screening with Large Language Models: Single and Multi-Injection Settings · Score 163
    title matched "language model";title matched "large language model";summary matched "LLM"
  5. Semantic Early-Stopping for Iterative LLM Agent Loops · Score 160
    title matched "LLM";title matched "agent";summary matched "language model"

2026-06-25

命中 15 篇生成于 2026-06-25 13:11:21 (Asia/Shanghai)
LM15 篇

《InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy》〔评测 / 应用 / 方法〕:Large language models are increasingly deployed as investment res…

  1. InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy · Score 188
    title matched "language model";title matched "large language model";title matched "reasoning"
  2. Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models · Score 182
    title matched "language model";title matched "large language model";summary matched "LLM"
  3. How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations · Score 182
    title matched "language model";title matched "reasoning";summary matched "LLM"
  4. MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction · Score 170
    title matched "agent";summary matched "language model";summary matched "large language model"
  5. Probabilistic Agents in Deterministic Audits: Evaluating Multi-Agent Systems for Automated Audits Based on the German IT-Grundschutz · Score 164
    title matched "agent";summary matched "LLM";summary matched "reasoning"

2026-06-24

命中 15 篇生成于 2026-06-24 13:06:49 (Asia/Shanghai)
LM15 篇

《AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning》〔评测 / 方法〕:Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge.…

  1. AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning · Score 199
    title matched "reasoning";title matched "agent";title matched "benchmark"
  2. AI-PAVE-Br: Leveraging Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach · Score 195
    title matched "language model";title matched "large language model";title matched "RAG"
  3. A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial · Score 181
    title matched "language model";title matched "large language model";title matched "reasoning"
  4. EG-VQA: Benchmarking Verifiable Video Question Answering with Grounded Temporal Evidence · Score 177
    title matched "benchmark";summary matched "language model";summary matched "large language model"
  5. Are We Ready For An Agent-Native Memory System? · Score 177
    title matched "agent";summary matched "language model";summary matched "large language model"

2026-06-23

命中 15 篇生成于 2026-06-23 13:10:02 (Asia/Shanghai)
LM15 篇

《AIR: Adaptive Interleaved Reasoning with Code in MLLMs》〔评测 / 数据 / 应用 / 方法〕:Following the paradigm shift initiated by OpenAI o3, interleaved reasoning with code to enhance multimodal large language models (MLLMs) has be…

  1. AIR: Adaptive Interleaved Reasoning with Code in MLLMs · Score 200
    title matched "LLM";title matched "reasoning";summary matched "language model"
  2. TriggerBench: Investigating Prospective Memory for Large Language Models · Score 197
    title matched "language model";title matched "large language model";summary matched "LLM"
  3. Can LLMs Reliably Self-Report Adversarial Prefills, and How? · Score 160
    title matched "LLM";summary matched "language model";summary matched "large language model"
  4. Evaluation Awareness Is Not One Capability: Evidence from Open Language Models · Score 145
    title matched "language model";title matched "evaluation";summary matched "instruction tuning"
  5. POTracker: Optimizing Large Language Models for Standard-Compliant Power Outage Report Generation · Score 145
    title matched "language model";title matched "large language model";summary matched "LLM"

2026-06-19

命中 15 篇生成于 2026-06-19 14:26:15 (Asia/Shanghai)
LM15 篇

《QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation》〔评测 / 方法〕:Large Language Models (LLMs) have made significant progress in reasoning, particularly in ded…

  1. QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation · Score 221
    title matched "language model";title matched "large language model";title matched "reasoning"
  2. LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems · Score 201
    title matched "LLM";title matched "agent";title matched "benchmark"
  3. Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM Inference · Score 191
    title matched "LLM";summary matched "language model";summary matched "large language model"
  4. Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems · Score 162
    title matched "LLM";title matched "agent";summary matched "language model"
  5. Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users · Score 162
    title matched "LLM";title matched "alignment";summary matched "language model"

2026-06-18

命中 15 篇生成于 2026-06-18 14:03:08 (Asia/Shanghai)
LM15 篇

《Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play》〔评测 / 方法〕:Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with exec…

  1. Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play · Score 185
    title matched "language model";title matched "large language model";title matched "agent"
  2. IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages · Score 182
    title matched "language model";title matched "large language model";title matched "benchmark"
  3. A Technical Taxonomy of LLM Agent Communication Protocols · Score 160
    title matched "LLM";title matched "agent";summary matched "language model"
  4. Quantifying and Auditing LLM Evaluation via Positive--Unlabeled Learning · Score 159
    title matched "LLM";title matched "evaluation";summary matched "language model"
  5. Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA · Score 158
    title matched "LLM";summary matched "language model";summary matched "large language model"

2026-06-17

命中 15 篇生成于 2026-06-17 14:22:19 (Asia/Shanghai)
LM15 篇

《Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports》〔评测 / 数据 / 应用 / 方法〕:Classifying Cyber Threat Intelligence (CTI) using MITRE Adversarial Tactics, Techniques, and Common Knowled…

  1. Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports · Score 176
    title matched "LLM";summary matched "language model";summary matched "large language model"
  2. Reading between the Lines: Leveraging Large Language Models for Global Dementia and Depression Assessment from Clinical Interviews · Score 164
    title matched "language model";title matched "large language model";title matched "RAG"
  3. The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act · Score 162
    title matched "reasoning";title matched "benchmark";summary matched "language model"
  4. WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning · Score 162
    title matched "reasoning";title matched "agent";summary matched "language model"
  5. From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning · Score 161
    title matched "language model";title matched "reasoning";summary matched "large language model"

2026-06-16

命中 15 篇生成于 2026-06-16 14:38:43 (Asia/Shanghai)
LM15 篇

《OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models》〔评测 / 应用 / 方法〕:Equipping Large Language Model (LLM) agents with effective skills is crucial for solving complex tasks in real-world systems…

  1. OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models · Score 235
    title matched "language model";title matched "large language model";title matched "agent"
  2. Context-Aware RL for Agentic and Multimodal LLMs · Score 199
    title matched "LLM";title matched "agent";summary matched "language model"
  3. Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio · Score 185
    title matched "LLM";title matched "agent";title matched "benchmark"
  4. Consensus-based Agentic Large Language Model Framework for Harmonized Tariff Schedule Code Classification · Score 184
    title matched "language model";title matched "large language model";title matched "agent"
  5. Scalable Circuit Learning for Interpreting Large Language Models · Score 162
    title matched "language model";title matched "large language model";summary matched "LLM"

2026-06-12

命中 15 篇生成于 2026-06-12 13:55:02 (Asia/Shanghai)
LM15 篇

《EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments》〔评测 / 应用 / 方法〕:Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations as…

  1. EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments · Score 200
    title matched "LLM";title matched "agent";summary matched "language model"
  2. Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents · Score 178
    title matched "agent";title matched "benchmark";summary matched "language model"
  3. An LLM System for Autonomous Variational Quantum Circuit Design · Score 174
    title matched "LLM";summary matched "language model";summary matched "large language model"
  4. Mod-Guide: An LLM-based Content Moderation Feedback System to Address Insensitive Speech toward Indigenous Ethnic and Religious Minority Communities · Score 168
    title matched "LLM";summary matched "language model";summary matched "large language model"
  5. SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning · Score 164
    title matched "reasoning";title matched "agent";summary matched "language model"

2026-06-11

命中 15 篇生成于 2026-06-11 13:59:12 (Asia/Shanghai)
LM15 篇

《Measuring Epistemic Resilience of LLMs Under Misleading Medical Context》〔评测 / 应用 / 方法〕:Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores…

  1. Measuring Epistemic Resilience of LLMs Under Misleading Medical Context · Score 194
    title matched "LLM";summary matched "language model";summary matched "large language model"
  2. Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation · Score 182
    title matched "LLM";title matched "benchmark";title matched "evaluation"
  3. OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models · Score 178
    title matched "language model";title matched "reasoning";summary matched "alignment"
  4. Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization · Score 159
    title matched "reasoning";summary matched "language model";summary matched "large language model"
  5. ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing · Score 159
    title matched "alignment";summary matched "language model";summary matched "large language model"

2026-06-10

命中 15 篇生成于 2026-06-10 13:25:04 (Asia/Shanghai)
LM15 篇

《T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains》〔评测 / 应用 / 方法〕:Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic sys…

  1. T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains · Score 217
    title matched "agent";title matched "benchmark";summary matched "language model"
  2. Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution · Score 215
    title matched "LLM";title matched "agent";summary matched "language model"
  3. ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity · Score 200
    title matched "agent";title matched "benchmark";summary matched "language model"
  4. Structure from Reasoning, Numbers from Search: On-Premise Open LLMs as Structural Priors for Coupled MIMO Controller Tuning · Score 180
    title matched "LLM";title matched "reasoning";summary matched "language model"
  5. Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam? · Score 175
    title matched "LLM";summary matched "language model";summary matched "large language model"

2026-06-09

命中 15 篇生成于 2026-06-09 13:12:49 (Asia/Shanghai)
LM15 篇

《SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks》〔评测 / 方法〕:Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and op…

  1. SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks · Score 238
    title matched "reasoning";title matched "agent";title matched "benchmark"
  2. Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving · Score 180
    title matched "LLM";title matched "benchmark";summary matched "language model"
  3. TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs · Score 179
    title matched "LLM";title matched "benchmark";summary matched "language model"
  4. Gradient-Guided Reward Optimization for Inference-time Alignment · Score 176
    title matched "alignment";summary matched "language model";summary matched "large language model"
  5. IS-CoT: Breaking the Long-form Generation Collapse via Interleaved Structural Thinking · Score 173
    summary matched "language model";summary matched "large language model";summary matched "LLM"

2026-06-05

命中 15 篇生成于 2026-06-05 13:25:00 (Asia/Shanghai)
LM15 篇

《MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models》〔评测 / 方法〕:Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that…

  1. MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models · Score 214
    title matched "language model";title matched "large language model";title matched "benchmark"
  2. CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments · Score 210
    title matched "LLM";title matched "agent";summary matched "language model"
  3. AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints · Score 196
    title matched "language model";title matched "large language model";title matched "agent"
  4. The Tell-Tale Norm: $\ell_2$ Magnitude as a Signal for Reasoning Dynamics in Large Language Models · Score 196
    title matched "language model";title matched "large language model";title matched "reasoning"
  5. Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems · Score 192
    title matched "LLM";title matched "agent";summary matched "language model"

2026-06-04

命中 15 篇生成于 2026-06-04 14:02:06 (Asia/Shanghai)
LM15 篇

《A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs》〔评测 / 应用 / 方法〕:Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under mult…

  1. A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs · Score 191
    title matched "LLM";title matched "evaluation";summary matched "language model"
  2. Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases · Score 191
    title matched "language model";title matched "large language model";summary matched "LLM"
  3. Self-Evolving Deep Research via Joint Generation and Evaluation · Score 187
    title matched "evaluation";summary matched "language model";summary matched "large language model"
  4. Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents · Score 177
    title matched "LLM";title matched "agent";title matched "benchmark"
  5. Large Language Models in K-12 Education: Alignment with State Curriculum Standards and Student Personas · Score 177
    title matched "language model";title matched "large language model";title matched "alignment"

2026-06-03

命中 15 篇生成于 2026-06-03 14:09:56 (Asia/Shanghai)
LM15 篇

《Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning》〔评测 / 方法〕:Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficie…

  1. Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning · Score 213
    title matched "LLM";title matched "reasoning";title matched "agent"
  2. Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models · Score 213
    title matched "language model";title matched "large language model";title matched "alignment"
  3. Can Factual Opinions Be Edited (Manipulated) in Large Language Models? · Score 191
    title matched "language model";title matched "large language model";summary matched "LLM"
  4. Large Language Models Are Overconfident in Their Own Responses · Score 191
    title matched "language model";title matched "large language model";summary matched "LLM"
  5. Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models · Score 177
    title matched "language model";title matched "large language model";title matched "alignment"

2026-06-02

命中 15 篇生成于 2026-06-02 13:56:35 (Asia/Shanghai)
LM15 篇

《POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems》〔评测 / 应用 / 方法〕:Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emerge…

  1. POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems · Score 192
    title matched "agent";summary matched "language model";summary matched "large language model"
  2. MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation · Score 184
    title matched "LLM";title matched "agent";title matched "benchmark"
  3. Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling · Score 178
    title matched "LLM";summary matched "language model";summary matched "large language model"
  4. AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design · Score 165
    title matched "language model";title matched "reasoning";title matched "agent"
  5. Auditing Asset-Specific Preferences in Financial Large Language Models: Evidence from Bitcoin Representations and Portfolio Allocation · Score 163
    title matched "language model";title matched "large language model";summary matched "LLM"

2026-05-29

命中 15 篇生成于 2026-05-29 13:18:32 (Asia/Shanghai)
LM15 篇

《FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via Board Game Simulations》〔评测 / 方法〕:Recently, large language models (LLMs) have achieved superior performance in static f…

  1. FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via Board Game Simulations · Score 232
    title matched "LLM";title matched "reasoning";title matched "benchmark"
  2. CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models · Score 196
    title matched "language model";title matched "large language model";title matched "reasoning"
  3. Revisiting Anthropomorphic Reflection Markers in Large Language Model Reasoning · Score 196
    title matched "language model";title matched "large language model";title matched "reasoning"
  4. Argument Quality Assessment with Large Language Models: A Pairwise Bradley-Terry Approach · Score 192
    title matched "language model";title matched "large language model";summary matched "LLM"
  5. KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs · Score 192
    title matched "agent";title matched "benchmark";summary matched "language model"

2026-05-28

命中 15 篇生成于 2026-05-28 13:15:52 (Asia/Shanghai)
LM15 篇

《MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems》〔评测 / 方法〕:Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unr…

  1. MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems · Score 199
    title matched "language model";title matched "large language model";summary matched "LLM"
  2. Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability · Score 184
    title matched "LLM";title matched "reasoning";title matched "evaluation"
  3. TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning · Score 181
    title matched "LLM";title matched "reasoning";summary matched "language model"
  4. Mobile-Aptus: Confidence-Driven Proactive and Robust Interaction in MLLM-based Mobile-Using Agents · Score 180
    title matched "LLM";title matched "agent";summary matched "language model"
  5. The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic · Score 177
    title matched "evaluation";summary matched "language model";summary matched "large language model"

2026-05-27

命中 15 篇生成于 2026-05-27 13:23:19 (Asia/Shanghai)
LM15 篇

《Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry》〔评测 / 数据 / 应用 / 方法〕:Key knowledge for steel-industry volatile organic compounds (VOCs) governance is s…

  1. Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry · Score 214
    title matched "LLM";title matched "reasoning";summary matched "language model"
  2. Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation · Score 201
    title matched "reasoning";title matched "agent";title matched "benchmark"
  3. Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments · Score 176
    title matched "agent";summary matched "language model";summary matched "large language model"
  4. VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions · Score 175
    title matched "agent";summary matched "language model";summary matched "large language model"
  5. MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation · Score 164
    title matched "agent";title matched "evaluation";summary matched "language model"

2026-05-26

命中 15 篇生成于 2026-05-26 13:09:24 (Asia/Shanghai)
LM15 篇

《Automated Benchmark Auditing for AI Agents and Large Language Models》〔评测 / 数据 / 方法〕:Modern AI benchmarks operate at a complexity that outpaces traditional verification methods. Tasks authored by domain experts often co…

  1. Automated Benchmark Auditing for AI Agents and Large Language Models · Score 244
    title matched "language model";title matched "large language model";title matched "agent"
  2. Causal methods for LLM development and evaluation · Score 211
    title matched "LLM";title matched "evaluation";summary matched "language model"
  3. PolyGnosis 2.0: Enhancing LLM Reasoning via Agentic Harness Engineering for Polymarket and OSINT Insight Extraction · Score 202
    title matched "LLM";title matched "reasoning";title matched "agent"
  4. Can LLMs Time Travel? Enhancing Temporal Consistency in Legal Agentic Search through Reinforcement Learning · Score 197
    title matched "LLM";title matched "agent";summary matched "language model"
  5. When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation · Score 180
    title matched "LLM";title matched "agent";summary matched "language model"

2026-05-22

命中 15 篇生成于 2026-05-22 13:08:19 (Asia/Shanghai)
LM15 篇

《Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents》〔评测 / 方法〕:Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses…

  1. Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents · Score 196
    title matched "LLM";title matched "agent";title matched "evaluation"
  2. Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety · Score 192
    title matched "agent";title matched "benchmark";summary matched "language model"
  3. ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning · Score 192
    title matched "reasoning";title matched "benchmark";summary matched "LLM"
  4. LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance · Score 188
    title matched "reasoning";summary matched "language model";summary matched "large language model"
  5. From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment · Score 174
    title matched "LLM";title matched "alignment";summary matched "language model"

2026-05-21

命中 15 篇生成于 2026-05-21 13:14:24 (Asia/Shanghai)
LM15 篇

《Tracing the ongoing emergence of human-like reasoning in Large Language Models》〔方法〕:Humans effortlessly go beyond literal meanings: If you mow the lawn, I will give you fifty dollars, is typically understood as implyin…

  1. Tracing the ongoing emergence of human-like reasoning in Large Language Models · Score 184
    title matched "language model";title matched "large language model";title matched "reasoning"
  2. What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema · Score 167
    title matched "LLM";title matched "agent";title matched "benchmark"
  3. Leveraging LLMs for Grammar Adaptation: A Study on Metamodel-Grammar Co-Evolution · Score 164
    title matched "LLM";title matched "RAG";summary matched "language model"
  4. LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models · Score 163
    title matched "language model";title matched "large language model";summary matched "LLM"
  5. WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata · Score 160
    title matched "benchmark";summary matched "language model";summary matched "large language model"

2026-05-20

命中 15 篇生成于 2026-05-20 13:10:58 (Asia/Shanghai)
LM15 篇

《MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models》〔评测 / 方法〕:Large language models (LLMs) are increasingly integrated into high-stakes decision-making. Inspired by the theory of \emph{inattention…

  1. MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models · Score 236
    title matched "language model";title matched "large language model";title matched "reasoning"
  2. OpenCompass: A Universal Evaluation Platform for Large Language Models · Score 232
    title matched "language model";title matched "large language model";title matched "evaluation"
  3. Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking · Score 214
    title matched "LLM";title matched "agent";title matched "evaluation"
  4. SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models · Score 214
    title matched "language model";title matched "large language model";title matched "evaluation"
  5. LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening · Score 196
    title matched "LLM";title matched "reasoning";title matched "benchmark"

2026-05-19

命中 15 篇生成于 2026-05-19 13:08:04 (Asia/Shanghai)
LM15 篇

《CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark》〔评测 / 数据 / 方法〕:Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view pe…

  1. CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark · Score 217
    title matched "LLM";title matched "benchmark";summary matched "language model"
  2. SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science · Score 181
    title matched "LLM";title matched "benchmark";summary matched "language model"
  3. MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion · Score 176
    title matched "agent";summary matched "language model";summary matched "large language model"
  4. SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents · Score 168
    title matched "LLM";title matched "agent";title matched "benchmark"
  5. LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems · Score 158
    title matched "agent";summary matched "LLM";summary matched "reasoning"

2026-05-18

命中 15 篇生成于 2026-05-18 13:13:17 (Asia/Shanghai)
LM15 篇

《CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency》〔评测 / 应用 / 方法〕:This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate…

  1. CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency · Score 236
    title matched "LLM";title matched "agent";title matched "benchmark"
  2. FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models · Score 232
    title matched "language model";title matched "large language model";title matched "benchmark"
  3. MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models · Score 214
    title matched "language model";title matched "large language model";title matched "benchmark"
  4. Large Language Models Could Be Rote Learners · Score 192
    title matched "language model";title matched "large language model";summary matched "LLM"
  5. Look Before You Leap: Autonomous Exploration for LLM Agents · Score 192
    title matched "LLM";title matched "agent";summary matched "language model"

2026-05-15

命中 11 篇生成于 2026-05-15 14:57:29 (Asia/Shanghai)
LM11 篇

《Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks》〔评测 / 数据 / 方法〕:We introduce a reusable framework for auditing whether LLM attack benchmarks collectively cover the threat surface: a 4$\times…

  1. Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks · Score 202
    title matched "LLM";title matched "RAG";title matched "benchmark"
  2. SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning · Score 161
    title matched "LLM";title matched "reasoning";summary matched "language model"
  3. Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use · Score 161
    title matched "LLM";title matched "reasoning";summary matched "language model"
  4. APWA: A Distributed Architecture for Parallelizable Agentic Workflows · Score 158
    title matched "agent";summary matched "language model";summary matched "large language model"
  5. MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory · Score 144
    title matched "agent";title matched "evaluation";summary matched "reasoning"

2026-05-14

命中 15 篇生成于 2026-05-14 12:52:54 (Asia/Shanghai)
LM15 篇

《RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation》〔评测 / 数据 / 方法〕:Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicia…

  1. RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation · Score 200
    title matched "LLM";title matched "agent";title matched "benchmark"
  2. MILM: Large Language Models for Multimodal Irregular Time Series with Informative Sampling · Score 163
    title matched "language model";title matched "large language model";summary matched "LLM"
  3. An LLM-Based System for Argument Reconstruction · Score 160
    title matched "LLM";summary matched "language model";summary matched "large language model"
  4. OpenAaaS: An Open Agent-as-a-Service Framework for Distributed Materials-Informatics Research · Score 157
    title matched "agent";summary matched "language model";summary matched "large language model"
  5. (How) Do Large Language Models Understand High-Level Message Sequence Charts? · Score 145
    title matched "language model";title matched "large language model";summary matched "LLM"

2026-05-13

命中 15 篇生成于 2026-05-13 12:54:34 (Asia/Shanghai)
LM15 篇

《MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering》〔评测 / 数据 / 应用 / 方法〕:Evaluating large language models (LLMs) in the biomedical domain requi…

  1. MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering · Score 225
    title matched "LLM";title matched "reasoning";title matched "benchmark"
  2. ORCE: Order-Aware Alignment of Verbalized Confidence in Large Language Models · Score 222
    title matched "language model";title matched "large language model";title matched "alignment"
  3. Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models · Score 207
    title matched "language model";title matched "large language model";title matched "reasoning"
  4. Question Difficulty Estimation for Large Language Models via Answer Plausibility Scoring · Score 182
    title matched "language model";title matched "large language model";summary matched "LLM"
  5. Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering · Score 177
    title matched "evaluation";summary matched "language model";summary matched "large language model"

2026-05-12

命中 15 篇生成于 2026-05-12 12:42:08 (Asia/Shanghai)
LM15 篇

《WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation》〔评测 / 方法〕:Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) ha…

  1. WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation · Score 205
    title matched "agent";title matched "benchmark";title matched "evaluation"
  2. ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox · Score 185
    title matched "LLM";title matched "agent";title matched "evaluation"
  3. AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents · Score 168
    title matched "LLM";title matched "agent";title matched "benchmark"
  4. LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments · Score 167
    title matched "LLM";title matched "agent";title matched "benchmark"
  5. Threat Modelling using Domain-Adapted Language Models: Empirical Evaluation and Insights · Score 163
    title matched "language model";title matched "evaluation";summary matched "large language model"

2026-05-08

命中 12 篇生成于 2026-05-08 14:15:32 (Asia/Shanghai)
LM12 篇

《LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG》〔评测 / 数据 / 应用 / 方法〕:Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question…

  1. LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG · Score 231
    title matched "reasoning";title matched "agent";title matched "RAG"
  2. MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents · Score 213
    title matched "LLM";title matched "agent";title matched "benchmark"
  3. Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity · Score 213
    title matched "LLM";title matched "alignment";title matched "evaluation"
  4. Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using LLM Judges with Closed-Loop Reinforcement Learning Feedback · Score 213
    title matched "LLM";title matched "agent";title matched "evaluation"
  5. BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models · Score 209
    title matched "language model";title matched "large language model";summary matched "LLM"

2026-05-07

命中 15 篇生成于 2026-05-07 12:38:06 (Asia/Shanghai)
LM15 篇

《Misaligned by Reward: Socially Undesirable Preferences in LLMs》〔评测 / 数据 / 方法〕:Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, exis…

  1. Misaligned by Reward: Socially Undesirable Preferences in LLMs · Score 194
    title matched "LLM";summary matched "language model";summary matched "large language model"
  2. SoK: Robustness in Large Language Models against Jailbreak Attacks · Score 181
    title matched "language model";title matched "large language model";summary matched "LLM"
  3. Why Expert Alignment Is Hard: Evidence from Subjective Evaluation · Score 161
    title matched "alignment";title matched "evaluation";summary matched "language model"
  4. KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels · Score 143
    title matched "LLM";title matched "benchmark";summary matched "RAG"
  5. Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction · Score 142
    title matched "LLM";summary matched "language model";summary matched "large language model"

2026-05-06

命中 15 篇生成于 2026-05-06 12:37:23 (Asia/Shanghai)
LM15 篇

《Safety and accuracy follow different scaling laws in clinical large language models》〔评测 / 应用 / 方法〕:Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time comput…

  1. Safety and accuracy follow different scaling laws in clinical large language models · Score 201
    title matched "language model";title matched "large language model";summary matched "LLM"
  2. Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones · Score 183
    title matched "LLM";title matched "reasoning";title matched "agent"
  3. OracleProto: A Reproducible Framework for Benchmarking LLM Native Forecasting via Knowledge Cutoff and Temporal Masking · Score 161
    title matched "LLM";title matched "benchmark";summary matched "language model"
  4. Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems · Score 147
    title matched "reasoning";title matched "agent";summary matched "benchmark"
  5. Benchmarking Parameter-Efficient Fine-Tuning of Large Language Models for Low-Resource Tajik Text Generation with the Tajik Web Corpus · Score 146
    title matched "language model";title matched "large language model";title matched "benchmark"

2026-05-05

命中 15 篇生成于 2026-05-05 12:20:54 (Asia/Shanghai)
LM15 篇

《StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models》〔评测 / 数据 / 方法〕:Static benchmarks for LLMs are increasingly compromised by contamination and overfitting especia…

  1. StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models · Score 219
    title matched "language model";title matched "large language model";title matched "reasoning"
  2. Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks · Score 211
    title matched "LLM";title matched "benchmark";summary matched "language model"
  3. Maistros: A Greek Large Language Model Adapted Through Knowledge Distillation From Large Reasoning Models · Score 197
    title matched "language model";title matched "large language model";title matched "reasoning"
  4. MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation · Score 197
    title matched "language model";title matched "large language model";title matched "alignment"
  5. OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice · Score 197
    title matched "language model";title matched "large language model";title matched "LLM"

2026-05-01

命中 15 篇生成于 2026-05-01 12:53:56 (Asia/Shanghai)
LM15 篇

《Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper Agents》〔评测 / 应用 / 方法〕:We present Collabora…

  1. Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper Agents · Score 193
    title matched "reasoning";title matched "agent";summary matched "language model"
  2. What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design · Score 185
    title matched "agent";title matched "benchmark";title matched "evaluation"
  3. Rethinking Agentic Reinforcement Learning In Large Language Models · Score 182
    title matched "language model";title matched "large language model";title matched "agent"
  4. TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering · Score 181
    title matched "reasoning";title matched "benchmark";summary matched "language model"
  5. LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning · Score 180
    title matched "LLM";title matched "reasoning";summary matched "language model"

2026-04-29

命中 15 篇生成于 2026-04-29 12:26:28 (Asia/Shanghai)
LM15 篇

《LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation》〔评测 / 数据 / 方法〕:Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across hetero…

  1. LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation · Score 197
    title matched "LLM";title matched "evaluation";summary matched "language model"
  2. Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models · Score 178
    title matched "language model";title matched "large language model";summary matched "LLM"
  3. DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios · Score 165
    title matched "agent";title matched "benchmark";summary matched "LLM"
  4. From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling · Score 164
    title matched "LLM";title matched "agent";summary matched "language model"
  5. SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing? · Score 158
    title matched "agent";summary matched "language model";summary matched "large language model"