最近 7 天
60
篇论文
Feed Subscription
适合长期跟踪单个研究方向。页面会汇总这个 feed 的最近 7 天 / 30 天表现,并保留每天命中的原始条目和 digest 链接。
最近 7 天
60
篇论文
最近 30 天
240
篇论文
全部历史
518
篇论文
LM 今日没有新的命中文献。
如果这个 feed 同时命中了你配置里的关键词,这里会给出长期追踪入口。
按天回看这个 feed 的命中文献,并保留当日 digest 的 Markdown / JSON 原始产物。
《NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models》〔评测 / 数据 / 方法〕:Large language models (LLMs) have demonstrated strong performance across a wide range of tasks, but e…
《InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy》〔评测 / 应用 / 方法〕:Large language models are increasingly deployed as investment res…
《AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning》〔评测 / 方法〕:Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge.…
《AIR: Adaptive Interleaved Reasoning with Code in MLLMs》〔评测 / 数据 / 应用 / 方法〕:Following the paradigm shift initiated by OpenAI o3, interleaved reasoning with code to enhance multimodal large language models (MLLMs) has be…
《QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation》〔评测 / 方法〕:Large Language Models (LLMs) have made significant progress in reasoning, particularly in ded…
《Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play》〔评测 / 方法〕:Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with exec…
《Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports》〔评测 / 数据 / 应用 / 方法〕:Classifying Cyber Threat Intelligence (CTI) using MITRE Adversarial Tactics, Techniques, and Common Knowled…
《OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models》〔评测 / 应用 / 方法〕:Equipping Large Language Model (LLM) agents with effective skills is crucial for solving complex tasks in real-world systems…
《EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments》〔评测 / 应用 / 方法〕:Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations as…
《Measuring Epistemic Resilience of LLMs Under Misleading Medical Context》〔评测 / 应用 / 方法〕:Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores…
《T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains》〔评测 / 应用 / 方法〕:Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic sys…
《SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks》〔评测 / 方法〕:Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and op…
《MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models》〔评测 / 方法〕:Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that…
《A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs》〔评测 / 应用 / 方法〕:Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under mult…
《Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning》〔评测 / 方法〕:Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficie…
《POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems》〔评测 / 应用 / 方法〕:Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emerge…
《FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via Board Game Simulations》〔评测 / 方法〕:Recently, large language models (LLMs) have achieved superior performance in static f…
《MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems》〔评测 / 方法〕:Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unr…
《Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry》〔评测 / 数据 / 应用 / 方法〕:Key knowledge for steel-industry volatile organic compounds (VOCs) governance is s…
《Automated Benchmark Auditing for AI Agents and Large Language Models》〔评测 / 数据 / 方法〕:Modern AI benchmarks operate at a complexity that outpaces traditional verification methods. Tasks authored by domain experts often co…
《Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents》〔评测 / 方法〕:Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses…
《Tracing the ongoing emergence of human-like reasoning in Large Language Models》〔方法〕:Humans effortlessly go beyond literal meanings: If you mow the lawn, I will give you fifty dollars, is typically understood as implyin…
《MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models》〔评测 / 方法〕:Large language models (LLMs) are increasingly integrated into high-stakes decision-making. Inspired by the theory of \emph{inattention…
《CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark》〔评测 / 数据 / 方法〕:Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view pe…
《CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency》〔评测 / 应用 / 方法〕:This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate…
《Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks》〔评测 / 数据 / 方法〕:We introduce a reusable framework for auditing whether LLM attack benchmarks collectively cover the threat surface: a 4$\times…
《RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation》〔评测 / 数据 / 方法〕:Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicia…
《MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering》〔评测 / 数据 / 应用 / 方法〕:Evaluating large language models (LLMs) in the biomedical domain requi…
《WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation》〔评测 / 方法〕:Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) ha…
《LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG》〔评测 / 数据 / 应用 / 方法〕:Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question…
《Misaligned by Reward: Socially Undesirable Preferences in LLMs》〔评测 / 数据 / 方法〕:Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, exis…
《Safety and accuracy follow different scaling laws in clinical large language models》〔评测 / 应用 / 方法〕:Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time comput…
《StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models》〔评测 / 数据 / 方法〕:Static benchmarks for LLMs are increasingly compromised by contamination and overfitting especia…
《Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper Agents》〔评测 / 应用 / 方法〕:We present Collabora…
《LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation》〔评测 / 数据 / 方法〕:Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across hetero…