Feed Subscription

LLM 固定订阅页

适合长期跟踪单个研究方向。页面会汇总这个 feed 的最近 7 天 / 30 天表现,并保留每天命中的原始条目和 digest 链接。

最近 7 天

60

篇论文

4 个活跃 digest

最近 30 天

105

篇论文

7 个活跃 digest

全部历史

105

篇论文

7 个活跃 digest

近期走势

《Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents》〔评测 / 应用 / 方法〕:Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization)…

2026-04-09
0
2026-04-10
0
2026-04-11
0
2026-04-12
0
2026-04-13
0
2026-04-14
15
2026-04-15
15
2026-04-16
15
2026-04-17
15
2026-04-18
0
2026-04-19
0
2026-04-20
0
2026-04-21
15
2026-04-22
15

相关关键词页

如果这个 feed 同时命中了你配置里的关键词,这里会给出长期追踪入口。

历史命中

按天回看这个 feed 的命中文献,并保留当日 digest 的 Markdown / JSON 原始产物。

2026-04-22

命中 15 篇生成于 2026-04-22 11:37:03 (Asia/Shanghai)
LLM15 篇

《Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents》〔评测 / 应用 / 方法〕:Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization)…

  1. Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents · Score 162
    title matched "agent";title matched "alignment";summary matched "reasoning"
  2. Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps · Score 149
    title matched "agent";title matched "benchmark";title matched "evaluation"
  3. Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment · Score 145
    title matched "agent";title matched "alignment";summary matched "reasoning"
  4. Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views · Score 130
    title matched "reasoning";title matched "alignment";summary matched "benchmark"
  5. Revac: A Social Deduction Reasoning Agent · Score 127
    title matched "agent";title matched "reasoning";summary matched "evaluation"

2026-04-21

命中 15 篇生成于 2026-04-21 11:40:46 (Asia/Shanghai)
LLM15 篇

《MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval》〔评测 / 数据 / 方法〕:Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing…

  1. MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval · Score 112
    title matched "reasoning";title matched "benchmark";has PDF
  2. Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion · Score 108
    title matched "benchmark";summary matched "reasoning";summary matched "evaluation"
  3. ClawEnvKit: Automatic Environment Generation for Claw-Like Agents · Score 107
    title matched "agent";summary matched "benchmark";summary matched "evaluation"
  4. MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation · Score 107
    title matched "agent";summary matched "reasoning";summary matched "benchmark"
  5. OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation · Score 106
    title matched "reasoning";summary matched "agent";summary matched "benchmark"

2026-04-17

命中 15 篇生成于 2026-04-17 11:39:21 (Asia/Shanghai)
LLM15 篇

《CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas》〔评测 / 方法〕:It is increasingly important that LLM agents interact effectively and safely with other goal-pursuing agents, yet, re…

  1. CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas · Score 130
    title matched "agent";title matched "benchmark";summary matched "reasoning"
  2. IE as Cache: Information Extraction Enhanced Agentic Reasoning · Score 124
    title matched "agent";title matched "reasoning";summary matched "benchmark"
  3. QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies · Score 123
    title matched "benchmark";summary matched "agent";summary matched "alignment"
  4. From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench · Score 122
    title matched "agent";summary matched "reasoning";summary matched "benchmark"
  5. An Axiomatic Benchmark for Evaluation of Scientific Novelty Metrics · Score 109
    title matched "benchmark";title matched "evaluation";has PDF

2026-04-16

命中 15 篇生成于 2026-04-16 11:43:00 (Asia/Shanghai)
LLM15 篇

《GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis》〔评测 / 应用 / 方法〕:The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift…

  1. GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis · Score 162
    title matched "agent";title matched "benchmark";summary matched "reasoning"
  2. HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark · Score 127
    title matched "agent";title matched "benchmark";summary matched "evaluation"
  3. Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning · Score 120
    title matched "evaluation";summary matched "agent";summary matched "reasoning"
  4. LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning · Score 112
    title matched "reasoning";title matched "benchmark";has PDF
  5. Correct Prediction, Wrong Steps? Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis · Score 108
    title matched "reasoning";summary matched "benchmark";summary matched "evaluation"

2026-04-15

命中 15 篇生成于 2026-04-15 11:35:50 (Asia/Shanghai)
LLM15 篇

《Parallax: Why AI Agents That Think Must Never Act》〔评测 / 应用 / 方法〕:Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise application…

  1. Parallax: Why AI Agents That Think Must Never Act · Score 107
    title matched "agent";summary matched "reasoning";summary matched "evaluation"
  2. Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents · Score 107
    title matched "agent";summary matched "reasoning";summary matched "benchmark"
  3. Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss · Score 106
    title matched "benchmark";summary matched "reasoning";summary matched "evaluation"
  4. Towards Long-horizon Agentic Multimodal Search · Score 106
    title matched "agent";summary matched "reasoning";summary matched "benchmark"
  5. QuarkMedSearch: A Long-Horizon Deep Search Agent for Exploring Medical Intelligence · Score 105
    title matched "agent";summary matched "benchmark";summary matched "evaluation"

2026-04-14

命中 15 篇生成于 2026-04-14 11:37:06 (Asia/Shanghai)
LLM15 篇

《UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents》〔评测 / 数据 / 应用 / 方法〕:Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems throu…

  1. UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents · Score 145
    title matched "agent";title matched "evaluation";summary matched "reasoning"
  2. General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks · Score 130
    title matched "reasoning";title matched "benchmark";summary matched "evaluation"
  3. Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games · Score 129
    title matched "agent";title matched "reasoning";summary matched "benchmark"
  4. FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning · Score 127
    title matched "agent";title matched "reasoning";summary matched "evaluation"
  5. From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python · Score 126
    title matched "agent";title matched "benchmark";summary matched "evaluation"

2026-04-08

命中 15 篇生成于 2026-04-08 17:10:24 (Asia/Shanghai)