Action Queue

Review Queue

这里汇总下一步最值得处理的论文，帮助你把每日研究情报转成明确的跟进动作。

返回归档首页查看趋势总览阅读清单通知历史周度回顾

优先处理

篇已过期论文

0 篇 3 天内到期

下一步已设

篇待推进论文

2 篇带行动计划

阅读推进

篇阅读中

0 篇已完成

行动队列

优先看已过期、3 天内到期，以及已经写下下一步动作但还没推进的论文。

已过期

已经超过计划处理日期的论文，应该优先清掉积压。

Review Queue

标星

Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework

The rapid growth of scientific literature has made it increasingly difficult for researchers to efficiently discover, evaluate, and synthesize relevant work. Recent advances in multi-agent large language models (LLMs) h…

备注：Anchor paper for the multi-agent discovery workflow; compare its planner design with newer agent benchmarks.

下一步：compare planner design with newer agent benchmarks

最晚处理：2026-04-18

1 天1 个 feed1 次命中

首次出现：2026-04-08 17:10:24 (UTC+08:00)最近出现：2026-04-08 17:10:24 (UTC+08:00)

Review Queue

待跟进

ClinicRealm: Re-evaluating large language models with conventional machine learning for non-generative clinical prediction tasks.

Large Language Models (LLMs) are increasingly deployed in medicine. However, their utility for non-generative clinical prediction is under-evaluated, and they are often assumed to be inferior to specialized models, crea…

备注：Recheck whether ClinicRealm still beats classical clinical baselines under the same task framing.

下一步：recheck benchmark framing against classical baselines

最晚处理：2026-04-20

复查周期：每 14 天

1 天1 个 feed1 次命中

首次出现：2026-04-09 14:51:56 (UTC+08:00)最近出现：2026-04-09 14:51:56 (UTC+08:00)

新出现且未标记

最近 7 天出现、相关性高但还没进入个人反馈状态的论文。

Review Queue

QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation

Large Language Models (LLMs) have made significant progress in reasoning, particularly in deductive reasoning, which is crucial for high-stakes decision-making. As models improve, evaluation benchmarks should evolve to…

1 天1 个 feed1 次命中

首次出现：2026-06-19 14:26:15 (UTC+08:00)最近出现：2026-06-19 14:26:15 (UTC+08:00)

Review Queue

NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models

Large language models (LLMs) have demonstrated strong performance across a wide range of tasks, but ensuring their reliability in highly technical domains remains a significant challenge. In nuclear engineering, problem…

1 天1 个 feed1 次命中

首次出现：2026-06-26 13:16:53 (UTC+08:00)最近出现：2026-06-26 13:16:53 (UTC+08:00)

Review Queue

LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems

Large language model (LLM) agents are increasingly proposed as supervisory components for safety-critical systems, yet their robustness under sustained, adaptive adversarial pressure remains poorly characterized. We pre…

1 天1 个 feed1 次命中

首次出现：2026-06-19 14:26:15 (UTC+08:00)最近出现：2026-06-19 14:26:15 (UTC+08:00)

Review Queue

AIR: Adaptive Interleaved Reasoning with Code in MLLMs

Following the paradigm shift initiated by OpenAI o3, interleaved reasoning with code to enhance multimodal large language models (MLLMs) has become a pivotal research frontier. The existing literature focuses primarily…

1 天1 个 feed1 次命中

首次出现：2026-06-23 13:10:02 (UTC+08:00)最近出现：2026-06-23 13:10:02 (UTC+08:00)

Review Queue

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge. We study archive-grounded reasoning: locating sparse evidence across a large, messy coll…

1 天1 个 feed1 次命中

首次出现：2026-06-24 13:06:49 (UTC+08:00)最近出现：2026-06-24 13:06:49 (UTC+08:00)

Review Queue

TriggerBench: Investigating Prospective Memory for Large Language Models

While Large Language Models (LLMs) are increasingly deployed in long interactions, existing evaluations focus predominantly on retrospective memory (RM) via explicit queries. Prospective memory (PM), the critical abilit…

1 天1 个 feed1 次命中

首次出现：2026-06-23 13:10:02 (UTC+08:00)最近出现：2026-06-23 13:10:02 (UTC+08:00)

Review Queue

AI-PAVE-Br: Leveraging Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach

The explosive growth and complexity of product data within the dynamic Brazilian e-commerce landscape demand robust and specialized methods for structured information extraction. Traditional approaches to Product Attrib…

1 天1 个 feed1 次命中

首次出现：2026-06-24 13:06:49 (UTC+08:00)最近出现：2026-06-24 13:06:49 (UTC+08:00)

Review Queue

Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM Inference

Large language models (LLMs) have achieved strong performance across a wide range of language-based tasks by leveraging both extensive parametric knowledge and in-context learning ability, enabling them to incorporate e…

1 天1 个 feed1 次命中

首次出现：2026-06-19 14:26:15 (UTC+08:00)最近出现：2026-06-19 14:26:15 (UTC+08:00)

Review Queue

InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy

Large language models are increasingly deployed as investment research assistants, yet no benchmark tests whether they can accurately reconstruct and apply the specific procedural decision frameworks of expert investors…

1 天1 个 feed1 次命中

首次出现：2026-06-25 13:11:21 (UTC+08:00)最近出现：2026-06-25 13:11:21 (UTC+08:00)

Review Queue

How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations

Vision-language models (VLMs) have achieved strong performance on OCR-based benchmarks and increasingly focused on text-rich understanding, but their robustness under controlled visual degradation remains insufficiently…

1 天1 个 feed1 次命中

首次出现：2026-06-25 13:11:21 (UTC+08:00)最近出现：2026-06-25 13:11:21 (UTC+08:00)

Review Queue

Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

Standard benchmarks for multimodal large language models (MLLMs) score each item on one canonical ordering and miss whether order-irrelevant shuffling changes the answer, a baseline reliability property called for by em…

1 天1 个 feed1 次命中

首次出现：2026-06-25 13:11:21 (UTC+08:00)最近出现：2026-06-25 13:11:21 (UTC+08:00)

Review Queue

A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial

Rare diseases affect millions of individuals worldwide, yet timely diagnosis remains a major public health challenge due to scarcity of specialized clinical expertise. While large language models (LLMs) show promise to…

1 天1 个 feed1 次命中

首次出现：2026-06-24 13:06:49 (UTC+08:00)最近出现：2026-06-24 13:06:49 (UTC+08:00)