《NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models》〔评测 / 数据 / 方法〕：Large language models (LLMs) have demonstrated strong performance across a wide range of tasks, but e…

NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models · Score 218
title matched "language model"；title matched "large language model"；title matched "benchmark"
原始来源
Joint Learning of Experiential Rules and Policies for Large Language Model Agents · Score 165
title matched "language model"；title matched "large language model"；title matched "agent"
原始来源
The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans · Score 165
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
Prompt Injection in Automated Résumé Screening with Large Language Models: Single and Multi-Injection Settings · Score 163
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
Semantic Early-Stopping for Iterative LLM Agent Loops · Score 160
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源

Agent Runtime Security4 篇

《Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries》〔评测 / 应用 / 方法〕：With a profusion of jailbreaks for LLMs now widely known, a growing concern is that…

Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries · Score 64
title matched "jailbreak"；has PDF；has rich summary
原始来源
Do Safety Guardrails Need to Reason? LeanGuard: A Fast and Light Approach for Robust Moderation · Score 59
title matched "guardrail"；has PDF；has rich summary
原始来源
AgentX: Towards Agent-Driven Self-Iteration of Industrial Recommender Systems · Score 41
summary matched "guardrail"；has PDF；has rich summary
原始来源
MIRROR: Novelty-Constrained Memory-Guided MCTS Red-Teaming for Agentic RAG · Score 40
summary matched "prompt injection"；has PDF；has rich summary
原始来源

Terminal and SWE Agents7 篇

《Smaller Models, Unexpected Costs: Trade-offs in LLM Quantization for Automated Program Repair》〔评测 / 方法〕：Language Models (LLMs) are powerful toolsand have been increasingly adopted for complex software engineering tasks…

Smaller Models, Unexpected Costs: Trade-offs in LLM Quantization for Automated Program Repair · Score 108
title matched "program repair"；title matched "automated program repair"；has PDF
原始来源
To Run or Not to Run: Analyzing the Cost-Effectiveness of Code Execution in LLM-Based Program Repair · Score 83
title matched "program repair"；summary matched "SWE-bench"；has PDF
原始来源
How Much Static Structure Do Code Agents Need? A Study of Deterministic Anchoring · Score 65
title matched "code agent"；has PDF；has rich summary
原始来源
A Deterministic Control Plane for LLM Coding Agents · Score 64
title matched "coding agent"；has PDF；has rich summary
原始来源
NOVA: A Verification-Aware Agent Harness for Architecture Evolution in Industrial Recommender Systems · Score 47
summary matched "coding agent"；has PDF；has rich summary
原始来源

2026-06-25

命中 20 篇生成于 2026-06-25 13:11:21 (Asia/Shanghai)

Markdown JSON

LM15 篇

《InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy》〔评测 / 应用 / 方法〕：Large language models are increasingly deployed as investment res…

InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy · Score 188
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models · Score 182
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations · Score 182
title matched "language model"；title matched "reasoning"；summary matched "LLM"
原始来源
MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction · Score 170
title matched "agent"；summary matched "language model"；summary matched "large language model"
原始来源
Probabilistic Agents in Deterministic Audits: Evaluating Multi-Agent Systems for Automated Audits Based on the German IT-Grundschutz · Score 164
title matched "agent"；summary matched "LLM"；summary matched "reasoning"
原始来源

Agent Runtime Security3 篇

《How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring》〔方法〕：Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number…

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring · Score 78
title matched "jailbreak"；summary matched "prompt injection"；has PDF
原始来源
The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems · Score 48
summary matched "guardrail"；has PDF；has rich summary
原始来源
AI Snitches Get Glitches: Towards Evading Agentic Surveillance · Score 44
summary matched "prompt injection"；has PDF；has rich summary
原始来源

Terminal and SWE Agents2 篇

《Unlocking Model Potentials Through Adaptive Multi-Agent Scaffolding for Efficient Issue Resolution》〔评测 / 应用 / 方法〕：Resolving issues with ambiguous and incomplete descriptions, particularly concerning complex bugs, requi…

Unlocking Model Potentials Through Adaptive Multi-Agent Scaffolding for Efficient Issue Resolution · Score 78
title matched "issue resolution"；summary matched "SWE-bench"；has PDF
原始来源
Evaluating LLMs on Real-World Software Performance Optimization · Score 38
summary matched "repository-level"；has PDF；has rich summary
原始来源

2026-06-24

命中 26 篇生成于 2026-06-24 13:06:49 (Asia/Shanghai)

Markdown JSON

LM15 篇

《AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning》〔评测 / 方法〕：Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge.…

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning · Score 199
title matched "reasoning"；title matched "agent"；title matched "benchmark"
原始来源
AI-PAVE-Br: Leveraging Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach · Score 195
title matched "language model"；title matched "large language model"；title matched "RAG"
原始来源
A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial · Score 181
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
EG-VQA: Benchmarking Verifiable Video Question Answering with Grounded Temporal Evidence · Score 177
title matched "benchmark"；summary matched "language model"；summary matched "large language model"
原始来源
Are We Ready For An Agent-Native Memory System? · Score 177
title matched "agent"；summary matched "language model"；summary matched "large language model"
原始来源

Agent Runtime Security6 篇

《Burnyard: Future of Malware Analysis》〔方法〕：Malware analysis is a critical aspect of modern cybersecurity. The prevailing industry practice, sandboxing, involves executing suspicious binaries within isol…；《LLMs Prompted…

Burnyard: Future of Malware Analysis · Score 47
summary matched "sandboxing"；has PDF；has rich summary
原始来源
LLMs Prompted for Legal Context Object More: Overrefusal from Small On-Premises LLMs in Criminal Legal Context · Score 44
summary matched "jailbreak"；has PDF；has rich summary
原始来源
Red-Teaming the Agentic Red-Team · Score 43
summary matched "guardrail"；has PDF；has rich summary
原始来源
PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models · Score 41
summary matched "guardrail"；has PDF；has rich summary
原始来源
Securing LLM-Agent Long-Term Memory Against Poisoning: Non-Malleable, Origin-Bound Authority with Machine-Checked Guarantees · Score 39
summary matched "data exfiltration"；has PDF；has rich summary
原始来源

Terminal and SWE Agents5 篇

《SHERLOC: Structured Diagnostic Localization for Code Repair Agents》〔方法〕：LLM agents solve repository-level coding tasks through multi-turn tool use, but utilize half their budget on locating faults before editing. Dedic…

SHERLOC: Structured Diagnostic Localization for Code Repair Agents · Score 105
title matched "code repair"；summary matched "SWE-bench"；summary matched "repository-level"
原始来源
NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers? · Score 65
title matched "coding agent"；has PDF；has rich summary
原始来源
Bayesian control for coding agents · Score 64
title matched "coding agent"；has PDF；has rich summary
原始来源
Detecting AI Coding Agents in Open Source: A Validated Multi-Method Census of 180 Million Repositories · Score 63
title matched "coding agent"；has PDF；has rich summary
原始来源
LemonHarness Technical Report · Score 39
summary matched "Terminal-Bench"；has PDF；has rich summary
原始来源

2026-06-23

命中 19 篇生成于 2026-06-23 13:10:02 (Asia/Shanghai)

Markdown JSON

LM15 篇

《AIR: Adaptive Interleaved Reasoning with Code in MLLMs》〔评测 / 数据 / 应用 / 方法〕：Following the paradigm shift initiated by OpenAI o3, interleaved reasoning with code to enhance multimodal large language models (MLLMs) has be…

AIR: Adaptive Interleaved Reasoning with Code in MLLMs · Score 200
title matched "LLM"；title matched "reasoning"；summary matched "language model"
原始来源
TriggerBench: Investigating Prospective Memory for Large Language Models · Score 197
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
Can LLMs Reliably Self-Report Adversarial Prefills, and How? · Score 160
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源
Evaluation Awareness Is Not One Capability: Evidence from Open Language Models · Score 145
title matched "language model"；title matched "evaluation"；summary matched "instruction tuning"
原始来源
POTracker: Optimizing Large Language Models for Standard-Compliant Power Outage Report Generation · Score 145
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源

Agent Runtime Security3 篇

《Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?》〔评测 / 应用 / 方法〕：Computer-use agents (CUAs) now act on a user's behalf across personal applications such as email, calendars, and to-do lists. Thi…

Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity? · Score 64
title matched "computer-use agent"；has PDF；has rich summary
原始来源
TROPT: An Open Framework for Unifying and Advancing Discrete Text Optimization · Score 46
summary matched "jailbreak"；has PDF；has rich summary
原始来源
GIF: Locally Sound Geometric Information Flow Control for LLMs · Score 43
summary matched "prompt injection"；has PDF；has rich summary
原始来源

Terminal and SWE Agents1 篇

《Tmax: A simple recipe for terminal agents》〔评测 / 数据 / 应用 / 方法〕：Terminal-using agents have quickly become the most popular downstream application of language models (LMs). Despite their prevalence, relatively little acad…

Tmax: A simple recipe for terminal agents · Score 84
title matched "terminal agent"；summary matched "Terminal-Bench"；has PDF
原始来源

2026-06-22

命中 0 篇生成于 2026-06-22 14:36:11 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

Agent Runtime Security0 篇

Agent Runtime Security 今日没有新的命中文献。

Terminal and SWE Agents0 篇

Terminal and SWE Agents 今日没有新的命中文献。

2026-06-21

命中 0 篇生成于 2026-06-21 14:11:22 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

Agent Runtime Security0 篇

Agent Runtime Security 今日没有新的命中文献。

Terminal and SWE Agents0 篇

Terminal and SWE Agents 今日没有新的命中文献。

2026-06-20

命中 0 篇生成于 2026-06-20 13:22:23 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

Agent Runtime Security0 篇

Agent Runtime Security 今日没有新的命中文献。

Terminal and SWE Agents0 篇

Terminal and SWE Agents 今日没有新的命中文献。

2026-06-19

命中 22 篇生成于 2026-06-19 14:26:15 (Asia/Shanghai)

Markdown JSON

LM15 篇

《QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation》〔评测 / 方法〕：Large Language Models (LLMs) have made significant progress in reasoning, particularly in ded…

QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation · Score 221
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems · Score 201
title matched "LLM"；title matched "agent"；title matched "benchmark"
原始来源
Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM Inference · Score 191
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源
Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems · Score 162
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源
Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users · Score 162
title matched "LLM"；title matched "alignment"；summary matched "language model"
原始来源

Agent Runtime Security4 篇

《What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?》〔方法〕：Prior work has shown that in-context demonstrations can jailbreak language models, but it remains unclear how models interpret different type…

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations? · Score 46
summary matched "jailbreak"；has PDF；has rich summary
原始来源
Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems · Score 46
summary matched "jailbreak"；has PDF；has rich summary
原始来源
RACL: Reasoning-Agent Control Layers for Continuous Metaheuristic Learning · Score 41
summary matched "guardrail"；has PDF；has rich summary
原始来源
Beyond Static Endpoints: Tool Programs as an Interface for Flexible Agentic Web Services · Score 39
summary matched "sandboxing"；has PDF；has rich summary
原始来源

Terminal and SWE Agents3 篇

《Probe-and-Refine Tuning of Repository Guidance for Coding Agents》〔应用 / 方法〕：LLM-based coding agents need higher-level operational knowledge about a repository (which files house which subsystems, how to run the test sui…

Probe-and-Refine Tuning of Repository Guidance for Coding Agents · Score 87
title matched "coding agent"；summary matched "SWE-bench"；has PDF
原始来源
Phoenix: Safe GitHub Issue Resolution via Multi-Agent LLMs · Score 83
title matched "issue resolution"；summary matched "SWE-bench"；has PDF
原始来源
N-Version Programming with Coding Agents · Score 63
title matched "coding agent"；has PDF；has rich summary
原始来源

2026-06-18

命中 17 篇生成于 2026-06-18 14:03:08 (Asia/Shanghai)

Markdown JSON

LM15 篇

《Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play》〔评测 / 方法〕：Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with exec…

Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play · Score 185
title matched "language model"；title matched "large language model"；title matched "agent"
原始来源
IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages · Score 182
title matched "language model"；title matched "large language model"；title matched "benchmark"
原始来源
A Technical Taxonomy of LLM Agent Communication Protocols · Score 160
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源
Quantifying and Auditing LLM Evaluation via Positive--Unlabeled Learning · Score 159
title matched "LLM"；title matched "evaluation"；summary matched "language model"
原始来源
Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA · Score 158
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源

Agent Runtime Security1 篇

《CodeSentinel: A Three-Layer Defense Against Indirect Prompt Injection in Code Contexts》〔方法〕：Code large language models increasingly retrieve external code context from repositories, documentation, issue threads, and co…

CodeSentinel: A Three-Layer Defense Against Indirect Prompt Injection in Code Contexts · Score 108
title matched "prompt injection"；title matched "indirect prompt injection"；has PDF
原始来源

Terminal and SWE Agents1 篇

《Data Intelligence Agents: Interpreting, Modeling, and Querying Enterprise Data via Autonomous Coding Agents》〔评测 / 应用 / 方法〕：Production data integration is bottlenecked by repeated, lossy handoffs between data owners, en…

Data Intelligence Agents: Interpreting, Modeling, and Querying Enterprise Data via Autonomous Coding Agents · Score 69
title matched "coding agent"；has PDF；has rich summary
原始来源

2026-06-17

命中 23 篇生成于 2026-06-17 14:22:19 (Asia/Shanghai)

Markdown JSON

LM15 篇

《Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports》〔评测 / 数据 / 应用 / 方法〕：Classifying Cyber Threat Intelligence (CTI) using MITRE Adversarial Tactics, Techniques, and Common Knowled…

Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports · Score 176
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源
Reading between the Lines: Leveraging Large Language Models for Global Dementia and Depression Assessment from Clinical Interviews · Score 164
title matched "language model"；title matched "large language model"；title matched "RAG"
原始来源
The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act · Score 162
title matched "reasoning"；title matched "benchmark"；summary matched "language model"
原始来源
WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning · Score 162
title matched "reasoning"；title matched "agent"；summary matched "language model"
原始来源
From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning · Score 161
title matched "language model"；title matched "reasoning"；summary matched "large language model"
原始来源

Agent Runtime Security3 篇

《Seeing Is Not Screening: Multimodal Hidden Instruction Attacks on Agent Skill Scanners》〔应用 / 方法〕：Agent skills are emerging as an important attack surface in LLM-based systems. Through an empirical study of existing ski…

Seeing Is Not Screening: Multimodal Hidden Instruction Attacks on Agent Skill Scanners · Score 47
summary matched "privilege escalation"；has PDF；has rich summary
原始来源
A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models · Score 47
summary matched "jailbreak"；has PDF；has rich summary
原始来源
PreAct: Computer-Using Agents that Get Faster on Repeated Tasks · Score 43
summary matched "guardrail"；has PDF；has rich summary
原始来源

Terminal and SWE Agents5 篇

《All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code》〔方法〕：Software practitioners increasingly use AI coding agents that generate test code alongside production code in open source pull requests (PRs). Recent…

All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code · Score 46
summary matched "coding agent"；has PDF；has rich summary
原始来源
LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling · Score 44
summary matched "SWE-bench"；has PDF；has rich summary
原始来源
VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination · Score 44
summary matched "code generation benchmark"；has PDF；has rich summary
原始来源
GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine? · Score 42
summary matched "coding agent"；has PDF；has rich summary
原始来源
Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering · Score 40
summary matched "coding agent"；has PDF；has rich summary
原始来源

2026-06-16

命中 23 篇生成于 2026-06-16 14:38:43 (Asia/Shanghai)

Markdown JSON

LM15 篇

《OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models》〔评测 / 应用 / 方法〕：Equipping Large Language Model (LLM) agents with effective skills is crucial for solving complex tasks in real-world systems…

OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models · Score 235
title matched "language model"；title matched "large language model"；title matched "agent"
原始来源
Context-Aware RL for Agentic and Multimodal LLMs · Score 199
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源
Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio · Score 185
title matched "LLM"；title matched "agent"；title matched "benchmark"
原始来源
Consensus-based Agentic Large Language Model Framework for Harmonized Tariff Schedule Code Classification · Score 184
title matched "language model"；title matched "large language model"；title matched "agent"
原始来源
Scalable Circuit Learning for Interpreting Large Language Models · Score 162
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源

Agent Runtime Security5 篇

《Automated jailbreak attack targeting multiple defense strategies》〔评测 / 方法〕：Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. However, their safety remains a critical c…

Automated jailbreak attack targeting multiple defense strategies · Score 65
title matched "jailbreak"；has PDF；has rich summary
原始来源
MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents · Score 65
title matched "computer-use agent"；has PDF；has rich summary
原始来源
DoubtProbe: Black-Box Jailbreak Defense via Structural Verification and Semantic Auditing · Score 61
title matched "jailbreak"；has PDF；has rich summary
原始来源
KVEraser: Learning to Steer KV Cache for Efficient Localized Context Erasing · Score 47
summary matched "prompt injection"；has PDF；has rich summary
原始来源
Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models · Score 44
summary matched "jailbreak"；has PDF；has rich summary
原始来源

Terminal and SWE Agents3 篇

《Agent trajectories as programs: fingerprinting and programming coding-agent behavior》〔评测 / 数据 / 应用 / 方法〕：Benchmark scores tell you what an agent got right; they do not tell you how it got there. In this work, we introd…

Agent trajectories as programs: fingerprinting and programming coding-agent behavior · Score 64
summary matched "SWE-bench"；summary matched "coding agent"；has PDF
原始来源
Towards LLM Accelerated Rapid Reviews for Software Tool Discovery -- Case for Log Anomaly Detection · Score 44
summary matched "coding agent"；has PDF；has rich summary
原始来源
No Resource, No Benchmarks, No Problem? Evaluating and Improving LLMs for Code Generation in No-Resource Languages · Score 44
summary matched "code generation benchmark"；has PDF；has rich summary
原始来源

2026-06-15

命中 0 篇生成于 2026-06-15 14:32:43 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

Agent Runtime Security0 篇

Agent Runtime Security 今日没有新的命中文献。

Terminal and SWE Agents0 篇

Terminal and SWE Agents 今日没有新的命中文献。

2026-06-14

命中 0 篇生成于 2026-06-14 13:59:09 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

Agent Runtime Security0 篇

Agent Runtime Security 今日没有新的命中文献。

Terminal and SWE Agents0 篇

Terminal and SWE Agents 今日没有新的命中文献。

2026-06-13

命中 0 篇生成于 2026-06-13 13:25:35 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

Agent Runtime Security0 篇

Agent Runtime Security 今日没有新的命中文献。

Terminal and SWE Agents0 篇

Terminal and SWE Agents 今日没有新的命中文献。

2026-06-12

命中 22 篇生成于 2026-06-12 13:55:02 (Asia/Shanghai)

Markdown JSON

LM15 篇

《EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments》〔评测 / 应用 / 方法〕：Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations as…

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments · Score 200
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源
Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents · Score 178
title matched "agent"；title matched "benchmark"；summary matched "language model"
原始来源
An LLM System for Autonomous Variational Quantum Circuit Design · Score 174
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源
Mod-Guide: An LLM-based Content Moderation Feedback System to Address Insensitive Speech toward Indigenous Ethnic and Religious Minority Communities · Score 168
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源
SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning · Score 164
title matched "reasoning"；title matched "agent"；summary matched "language model"
原始来源

Agent Runtime Security5 篇

《Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda》〔应用 / 方法〕：LLM-based agents are entering regulated industries where they automate judgment intensive quality management processes. W…

Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda · Score 44
summary matched "guardrail"；has PDF；has rich summary
原始来源
ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm · Score 41
summary matched "computer-use agent"；has PDF；has rich summary
原始来源
Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents · Score 40
summary matched "agent runtime"；has PDF；has rich summary
原始来源
No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions · Score 38
summary matched "prompt injection"；has PDF；has rich summary
原始来源
Nous: An Attempt to Extract and Inject the Cognition Behind Prediction-Market Behavior · Score 38
summary matched "prompt injection"；has PDF；has rich summary
原始来源

Terminal and SWE Agents2 篇

《Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset》〔数据 / 方法〕：AI coding agents are increasingly used to generate pull requests (PRs) that propose code fixes in sof…

Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset · Score 57
summary matched "coding agent"；has DOI；has PDF
原始来源
Recursive Agent Harnesses · Score 47
summary matched "coding agent"；has PDF；has rich summary
原始来源

2026-06-11

命中 22 篇生成于 2026-06-11 13:59:12 (Asia/Shanghai)

Markdown JSON

LM15 篇

《Measuring Epistemic Resilience of LLMs Under Misleading Medical Context》〔评测 / 应用 / 方法〕：Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores…

Measuring Epistemic Resilience of LLMs Under Misleading Medical Context · Score 194
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源
Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation · Score 182
title matched "LLM"；title matched "benchmark"；title matched "evaluation"
原始来源
OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models · Score 178
title matched "language model"；title matched "reasoning"；summary matched "alignment"
原始来源
Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization · Score 159
title matched "reasoning"；summary matched "language model"；summary matched "large language model"
原始来源
ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing · Score 159
title matched "alignment"；summary matched "language model"；summary matched "large language model"
原始来源

Agent Runtime Security4 篇

《Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code》〔评测 / 方法〕：Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce mali…

Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code · Score 60
title matched "jailbreak"；has PDF；has rich summary
原始来源
OCELOT: Inference-Leakage Budgets for Privacy-Preserving LLM Agents · Score 47
summary matched "jailbreak"；has PDF；has rich summary
原始来源
Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers · Score 41
summary matched "jailbreak"；has PDF；has rich summary
原始来源
External Experience Serving in Production LLM Systems: A Deployment-Oriented Study of Quality-Cost Trade-offs · Score 38
summary matched "prompt injection"；has PDF；has rich summary
原始来源

Terminal and SWE Agents3 篇

《PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents》〔应用 / 方法〕：AI coding assistants now support a growing share of software work, from quick scripts to production applications. Yet th…

PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents · Score 69
title matched "coding agent"；has PDF；has rich summary
原始来源
Exploration Structure in LLM Agents for Multi-File Change Localization · Score 59
summary matched "SWE-bench"；summary matched "SWE bench"；has PDF
原始来源
Agents All the Way Down; A Methodology for Building Custom AI Agents from Substrate to Production · Score 39
summary matched "code agent"；has PDF；has rich summary
原始来源

2026-06-10

命中 25 篇生成于 2026-06-10 13:25:04 (Asia/Shanghai)

Markdown JSON

LM15 篇

《T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains》〔评测 / 应用 / 方法〕：Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic sys…

T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains · Score 217
title matched "agent"；title matched "benchmark"；summary matched "language model"
原始来源
Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution · Score 215
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源
ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity · Score 200
title matched "agent"；title matched "benchmark"；summary matched "language model"
原始来源
Structure from Reasoning, Numbers from Search: On-Premise Open LLMs as Structural Priors for Coupled MIMO Controller Tuning · Score 180
title matched "LLM"；title matched "reasoning"；summary matched "language model"
原始来源
Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam? · Score 175
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源

Agent Runtime Security7 篇

《Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation》〔评测 / 应用 / 方法〕：Large language model (LLM) agents are rapidly moving from conversational interfaces to software components that plan, invoke t…

Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation · Score 78
summary matched "agent security"；summary matched "LLM agent security"；summary matched "prompt injection"
原始来源
Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields · Score 68
title matched "computer-use agent"；has PDF；has rich summary
原始来源
Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories · Score 48
summary matched "computer-use agent"；has PDF；has rich summary
原始来源
It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO · Score 45
summary matched "guardrail"；has PDF；has rich summary
原始来源
Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization · Score 44
summary matched "prompt injection"；has PDF；has rich summary
原始来源

Terminal and SWE Agents3 篇

《Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages》〔评测 / 方法〕：LLM-based coding agents are usually evaluated in familiar software settings: mainstream languages, common libraries, and…

Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages · Score 103
title matched "coding agent"；summary matched "Terminal-Bench"；summary matched "SWE-bench"
原始来源
AutoPDE: Reliable Agentic PDE Solving via Explicitly Represented Solver Strategies · Score 60
summary matched "coding agent"；summary matched "code agent"；has PDF
原始来源
DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch · Score 60
summary matched "code agent"；summary matched "bug fixing"；has PDF
原始来源

2026-06-09

命中 22 篇生成于 2026-06-09 13:12:49 (Asia/Shanghai)

Markdown JSON

LM15 篇

《SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks》〔评测 / 方法〕：Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and op…

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks · Score 238
title matched "reasoning"；title matched "agent"；title matched "benchmark"
原始来源
Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving · Score 180
title matched "LLM"；title matched "benchmark"；summary matched "language model"
原始来源
TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs · Score 179
title matched "LLM"；title matched "benchmark"；summary matched "language model"
原始来源
Gradient-Guided Reward Optimization for Inference-time Alignment · Score 176
title matched "alignment"；summary matched "language model"；summary matched "large language model"
原始来源
IS-CoT: Breaking the Long-form Generation Collapse via Interleaved Structural Thinking · Score 173
summary matched "language model"；summary matched "large language model"；summary matched "LLM"
原始来源

Agent Runtime Security4 篇

《WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces》〔评测 / 方法〕：Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line ex…

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces · Score 83
title matched "computer-use agent"；summary matched "agent runtime"；has PDF
原始来源
Brain-Prompt Injection: A Route-Safety Audit for BCI-LLM Agents · Score 63
title matched "prompt injection"；has PDF；has rich summary
原始来源
What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks · Score 47
summary matched "guardrail"；has PDF；has rich summary
原始来源
PRISM: Recovering Instruction Sets from Language Model Activations · Score 45
summary matched "prompt injection"；has PDF；has rich summary
原始来源

Terminal and SWE Agents3 篇

《SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation》〔方法〕：Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them ca…

SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation · Score 48
summary matched "coding agent"；has PDF；has rich summary
原始来源
From 0-to-1 to 1-to-N: Reproducible Engineering Evidence for MetaAI Recursive Self-Design · Score 46
summary matched "SWE-bench"；has PDF；has rich summary
原始来源
Self-Harness: Harnesses That Improve Themselves · Score 44
summary matched "Terminal-Bench"；has PDF；has rich summary
原始来源

2026-06-08

命中 0 篇生成于 2026-06-08 14:00:23 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

Agent Runtime Security0 篇

Agent Runtime Security 今日没有新的命中文献。

Terminal and SWE Agents0 篇

Terminal and SWE Agents 今日没有新的命中文献。

2026-06-07

命中 0 篇生成于 2026-06-07 13:45:34 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

Agent Runtime Security0 篇

Agent Runtime Security 今日没有新的命中文献。

Terminal and SWE Agents0 篇

Terminal and SWE Agents 今日没有新的命中文献。

2026-06-06

命中 0 篇生成于 2026-06-06 12:59:49 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

Agent Runtime Security0 篇

Agent Runtime Security 今日没有新的命中文献。

Terminal and SWE Agents0 篇

Terminal and SWE Agents 今日没有新的命中文献。

2026-06-05

命中 30 篇生成于 2026-06-05 13:25:00 (Asia/Shanghai)

Markdown JSON

LM15 篇

《MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models》〔评测 / 方法〕：Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that…

MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models · Score 214
title matched "language model"；title matched "large language model"；title matched "benchmark"
原始来源
CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments · Score 210
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源
AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints · Score 196
title matched "language model"；title matched "large language model"；title matched "agent"
原始来源
The Tell-Tale Norm: $\ell_2$ Magnitude as a Signal for Reasoning Dynamics in Large Language Models · Score 196
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems · Score 192
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源

Agent Runtime Security5 篇

《GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection》〔评测 / 数据 / 应用 / 方法〕：Large Language Models (LLMs) have transformed natural language processing, but they remai…

GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection · Score 138
title matched "prompt injection"；title matched "jailbreak"；summary matched "guardrail"
原始来源
From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents · Score 80
title matched "guardrail"；has PDF；has rich summary
原始来源
Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack · Score 76
summary matched "jailbreak"；summary matched "guardrail"；has PDF
原始来源
Beyond Similarity: Trustworthy Memory Search for Personal AI Agents · Score 58
summary matched "jailbreak"；has PDF；has rich summary
原始来源
The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models · Score 58
summary matched "guardrail"；has PDF；has rich summary
原始来源

Terminal and SWE Agents10 篇

《ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer》〔评测 / 方法〕：The rapid proliferation of Agent Development Kits (ADKs), SDK-level frameworks for building LLM-powered autonomous agents, has outpaced any…

ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer · Score 94
summary matched "Terminal-Bench"；summary matched "SWE-bench"；summary matched "coding agent"
原始来源
Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement · Score 80
title matched "code agent"；has PDF；has rich summary
原始来源
Knowledge Matters: Injecting Project and Testing Knowledge into LLM-based Unit Test Generation · Score 80
title matched "test generation"；has PDF；has rich summary
原始来源
SmellBench: Towards Fine-Grained Evaluation of Code Agents on Refactoring Tasks · Score 80
title matched "code agent"；has PDF；has rich summary
原始来源
From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws · Score 76
summary matched "Terminal-Bench"；summary matched "SWE-bench"；has PDF
原始来源

2026-06-04

命中 27 篇生成于 2026-06-04 14:02:06 (Asia/Shanghai)

Markdown JSON

LM15 篇

《A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs》〔评测 / 应用 / 方法〕：Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under mult…

A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs · Score 191
title matched "LLM"；title matched "evaluation"；summary matched "language model"
原始来源
Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases · Score 191
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
Self-Evolving Deep Research via Joint Generation and Evaluation · Score 187
title matched "evaluation"；summary matched "language model"；summary matched "large language model"
原始来源
Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents · Score 177
title matched "LLM"；title matched "agent"；title matched "benchmark"
原始来源
Large Language Models in K-12 Education: Alignment with State Curriculum Standards and Student Personas · Score 177
title matched "language model"；title matched "large language model"；title matched "alignment"
原始来源

Agent Runtime Security6 篇

《MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models》〔评测 / 应用 / 方法〕：Diffusion large language models (dLLMs) generate text by iteratively denoising partially masked sequences unde…

MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models · Score 79
title matched "jailbreak"；has PDF；has rich summary
原始来源
What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems · Score 79
title matched "prompt injection"；has PDF；has rich summary
原始来源
Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents · Score 75
summary matched "prompt injection"；summary matched "indirect prompt injection"；has PDF
原始来源
AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning · Score 57
summary matched "agent runtime"；has PDF；has rich summary
原始来源
From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents · Score 57
summary matched "prompt injection"；has PDF；has rich summary
原始来源

Terminal and SWE Agents6 篇

《Latent Anchor-Driven Test Generation for Deep Neural Networks》〔数据 / 应用 / 方法〕：Deep Neural Networks (DNNs) are increasingly being deployed in security-critical and safety-sensitive applications, which makes rigorous test…

Latent Anchor-Driven Test Generation for Deep Neural Networks · Score 79
title matched "test generation"；has PDF；has rich summary
原始来源
Can Generalist Agents Automate Data Curation? · Score 57
summary matched "coding agent"；has PDF；has rich summary
原始来源
Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation · Score 57
summary matched "SWE-bench"；has PDF；has rich summary
原始来源
The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development? · Score 57
summary matched "code agent"；has PDF；has rich summary
原始来源
The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents · Score 57
summary matched "SWE-bench"；has PDF；has rich summary
原始来源

2026-06-03

命中 32 篇生成于 2026-06-03 14:09:56 (Asia/Shanghai)

Markdown JSON

LM15 篇

《Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning》〔评测 / 方法〕：Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficie…

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning · Score 213
title matched "LLM"；title matched "reasoning"；title matched "agent"
原始来源
Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models · Score 213
title matched "language model"；title matched "large language model"；title matched "alignment"
原始来源
Can Factual Opinions Be Edited (Manipulated) in Large Language Models? · Score 191
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
Large Language Models Are Overconfident in Their Own Responses · Score 191
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models · Score 177
title matched "language model"；title matched "large language model"；title matched "alignment"
原始来源

Agent Runtime Security8 篇

《D-Judge: Disrupting Multi-Turn Jailbreaks using Semantics-Preserving Output Rewriting》〔评测 / 数据 / 方法〕：Multi-turn jailbreak attacks pose a growing threat to large language model (LLM) safety because they exploit feedback…

D-Judge: Disrupting Multi-Turn Jailbreaks using Semantics-Preserving Output Rewriting · Score 79
title matched "jailbreak"；has PDF；has rich summary
原始来源
MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents · Score 79
title matched "computer-use agent"；has PDF；has rich summary
原始来源
MultiTurnPSB: Evaluating Multi-Turn Jailbreak Attacks an dClassifier-Based Defenses for Medical AI Safety · Score 79
title matched "jailbreak"；has PDF；has rich summary
原始来源
From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework · Score 75
summary matched "prompt injection"；summary matched "malicious tool"；has PDF
原始来源
Acceptance-Test-Driven Evaluation Protocols for Business-Centric LLM Systems · Score 57
summary matched "guardrail"；has PDF；has rich summary
原始来源

Terminal and SWE Agents9 篇

《What Makes Interaction Trajectories Effective for Training Terminal Agents?》〔方法〕：Stronger code agents are commonly assumed to be superior teachers for post-training, yet this assumption remains poorly disentangled from…

What Makes Interaction Trajectories Effective for Training Terminal Agents? · Score 115
title matched "terminal agent"；summary matched "Terminal-Bench"；summary matched "code agent"
原始来源
Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing · Score 97
title matched "code agent"；summary matched "coding agent"；has PDF
原始来源
Dependency-Guided Repository-Level C-to-Rust Translation with Reinforcement Alignment · Score 97
title matched "repository-level"；summary matched "repository level"；has PDF
原始来源
Handoff Debt: The Rediscovery Cost When Coding Agents Take Over Interrupted Tasks · Score 79
title matched "coding agent"；has PDF；has rich summary
原始来源
VulnAgent-R2: Evidence-Calibrated Multi-Agent Auditing for Repository-Level Vulnerability Detection · Score 79
title matched "repository-level"；has PDF；has rich summary
原始来源

2026-06-02

命中 22 篇生成于 2026-06-02 13:56:35 (Asia/Shanghai)

Markdown JSON

LM15 篇

《POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems》〔评测 / 应用 / 方法〕：Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emerge…

POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems · Score 192
title matched "agent"；summary matched "language model"；summary matched "large language model"
原始来源
MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation · Score 184
title matched "LLM"；title matched "agent"；title matched "benchmark"
原始来源
Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling · Score 178
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源
AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design · Score 165
title matched "language model"；title matched "reasoning"；title matched "agent"
原始来源
Auditing Asset-Specific Preferences in Financial Large Language Models: Evidence from Bitcoin Representations and Portfolio Allocation · Score 163
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源

Agent Runtime Security6 篇

《Jailbreaking Multimodal Large Language Models using Multi-Clip Video》〔数据 / 应用 / 方法〕：As multimodal large language models (MLLMs) have advanced to process video inputs, concerns have emerged about their potential for mal…

Jailbreaking Multimodal Large Language Models using Multi-Clip Video · Score 63
title matched "jailbreak"；has PDF；has rich summary
原始来源
SentGuard: Sentence-Level Streaming Guardrails for Large Language Models · Score 62
title matched "guardrail"；has PDF；has rich summary
原始来源
AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations · Score 61
summary matched "prompt injection"；summary matched "indirect prompt injection"；has PDF
原始来源
SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning · Score 61
title matched "agent defense"；has PDF；has rich summary
原始来源
SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents · Score 44
summary matched "agent security"；has PDF；has rich summary
原始来源

Terminal and SWE Agents1 篇

《SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction》〔评测 / 应用 / 方法〕：Agent skills occupy a privileged position in the agent workflow, as agents are expected to implicitly follow and execute them, re…

SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction · Score 47
summary matched "coding agent"；has PDF；has rich summary
原始来源

2026-06-01

命中 0 篇生成于 2026-06-01 14:11:42 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

Agent Runtime Security0 篇

Agent Runtime Security 今日没有新的命中文献。

Terminal and SWE Agents0 篇

Terminal and SWE Agents 今日没有新的命中文献。

2026-05-31

命中 0 篇生成于 2026-05-31 13:24:50 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

Agent Runtime Security0 篇

Agent Runtime Security 今日没有新的命中文献。

Terminal and SWE Agents0 篇

Terminal and SWE Agents 今日没有新的命中文献。

2026-05-30

命中 0 篇生成于 2026-05-30 12:56:38 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

Agent Runtime Security0 篇

Agent Runtime Security 今日没有新的命中文献。

Terminal and SWE Agents0 篇

Terminal and SWE Agents 今日没有新的命中文献。

2026-05-29

命中 21 篇生成于 2026-05-29 13:18:32 (Asia/Shanghai)

Markdown JSON

LM15 篇

《FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via Board Game Simulations》〔评测 / 方法〕：Recently, large language models (LLMs) have achieved superior performance in static f…

FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via Board Game Simulations · Score 232
title matched "LLM"；title matched "reasoning"；title matched "benchmark"
原始来源
CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models · Score 196
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
Revisiting Anthropomorphic Reflection Markers in Large Language Model Reasoning · Score 196
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
Argument Quality Assessment with Large Language Models: A Pairwise Bradley-Terry Approach · Score 192
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs · Score 192
title matched "agent"；title matched "benchmark"；summary matched "language model"
原始来源

Agent Runtime Security4 篇

《Provably Secure Agent Guardrail》〔评测 / 应用 / 方法〕：As large language models transition from bounded generative engines to agents with expansive execution privileges, AI going out of control precipitates a funda…；《Robust an…

Provably Secure Agent Guardrail · Score 120
title matched "secure agent"；title matched "guardrail"；has PDF
原始来源
Robust and Efficient Guardrails with Latent Reasoning · Score 80
title matched "guardrail"；has PDF；has rich summary
原始来源
AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security · Score 58
summary matched "guardrail"；has PDF；has rich summary
原始来源
Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures · Score 58
summary matched "jailbreak"；has PDF；has rich summary
原始来源

Terminal and SWE Agents2 篇

《Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software》〔应用 / 方法〕：Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist sup…

Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software · Score 48
summary matched "coding agent"；has PDF；has rich summary
原始来源
Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas · Score 45
summary matched "coding agent"；has PDF；has rich summary
原始来源

2026-05-28

命中 21 篇生成于 2026-05-28 13:15:52 (Asia/Shanghai)

Markdown JSON

LM15 篇

《MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems》〔评测 / 方法〕：Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unr…

MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems · Score 199
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability · Score 184
title matched "LLM"；title matched "reasoning"；title matched "evaluation"
原始来源
TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning · Score 181
title matched "LLM"；title matched "reasoning"；summary matched "language model"
原始来源
Mobile-Aptus: Confidence-Driven Proactive and Robust Interaction in MLLM-based Mobile-Using Agents · Score 180
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源
The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic · Score 177
title matched "evaluation"；summary matched "language model"；summary matched "large language model"
原始来源

Agent Runtime Security5 篇

《Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents》〔数据 / 方法〕：Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software…

Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents · Score 70
title matched "computer-use agent"；has PDF；has rich summary
原始来源
Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests · Score 47
summary matched "jailbreak"；has PDF；has rich summary
原始来源
The Ethics of LLM Sandbox and Persona Dynamics · Score 46
summary matched "guardrail"；has PDF；has rich summary
原始来源
LACUNA: Safe Agents as Recursive Program Holes · Score 46
summary matched "prompt injection"；has PDF；has rich summary
原始来源
Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem · Score 45
summary matched "data exfiltration"；has PDF；has rich summary
原始来源

Terminal and SWE Agents1 篇

《Calibrating Conservatism for Scalable Oversight》〔方法〕：Agentic AI systems capable of autonomous planning and extended environmental interaction pose a fundamental control problem: how can humans maintain meaningful…

Calibrating Conservatism for Scalable Oversight · Score 48
summary matched "SWE-bench"；has PDF；has rich summary
原始来源

2026-05-27

命中 22 篇生成于 2026-05-27 13:23:19 (Asia/Shanghai)

Markdown JSON

LM15 篇

《Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry》〔评测 / 数据 / 应用 / 方法〕：Key knowledge for steel-industry volatile organic compounds (VOCs) governance is s…

Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry · Score 214
title matched "LLM"；title matched "reasoning"；summary matched "language model"
原始来源
Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation · Score 201
title matched "reasoning"；title matched "agent"；title matched "benchmark"
原始来源
Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments · Score 176
title matched "agent"；summary matched "language model"；summary matched "large language model"
原始来源
VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions · Score 175
title matched "agent"；summary matched "language model"；summary matched "large language model"
原始来源
MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation · Score 164
title matched "agent"；title matched "evaluation"；summary matched "language model"
原始来源

Agent Runtime Security7 篇

《EviACT: An Evidence-to-Action Framework for Agentic Program Repair》〔评测 / 方法〕：LLM-based agents have moved automated program repair (APR) from fixed-context patch generation to interactive repository-level repair. Howeve…

EviACT: An Evidence-to-Action Framework for Agentic Program Repair · Score 122
summary matched "guardrail"；has PDF；has rich summary
原始来源
Governed Evolution of Agent Runtimes through Executable Operational Cognition · Score 70
title matched "agent runtime"；has PDF；has rich summary
原始来源
Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals · Score 65
title matched "prompt injection"；has PDF；has rich summary
原始来源
BAIT: Boundary-Guided Disclosure Escalation via Self-Conditioned Reasoning · Score 45
summary matched "jailbreak"；has PDF；has rich summary
原始来源
AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian · Score 43
summary matched "guardrail"；has PDF；has rich summary
原始来源

Terminal and SWE Agents0 篇

Terminal and SWE Agents 今日没有新的命中文献。

2026-05-26

命中 18 篇生成于 2026-05-26 13:09:24 (Asia/Shanghai)

Markdown JSON

LM15 篇

《Automated Benchmark Auditing for AI Agents and Large Language Models》〔评测 / 数据 / 方法〕：Modern AI benchmarks operate at a complexity that outpaces traditional verification methods. Tasks authored by domain experts often co…

Automated Benchmark Auditing for AI Agents and Large Language Models · Score 244
title matched "language model"；title matched "large language model"；title matched "agent"
原始来源
Causal methods for LLM development and evaluation · Score 211
title matched "LLM"；title matched "evaluation"；summary matched "language model"
原始来源
PolyGnosis 2.0: Enhancing LLM Reasoning via Agentic Harness Engineering for Polymarket and OSINT Insight Extraction · Score 202
title matched "LLM"；title matched "reasoning"；title matched "agent"
原始来源
Can LLMs Time Travel? Enhancing Temporal Consistency in Legal Agentic Search through Reinforcement Learning · Score 197
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源
When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation · Score 180
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源

Agent Runtime Security3 篇

《CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents》〔评测 / 数据 / 应用 / 方法〕：Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use,…

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents · Score 62
title matched "computer-use agent"；has PDF；has rich summary
原始来源
How Agentic AI Coding Assistants Become the Attacker's Shell · Score 44
summary matched "prompt injection"；has PDF；has rich summary
原始来源
AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions · Score 41
summary matched "computer-use agent"；has PDF；has rich summary
原始来源

Terminal and SWE Agents0 篇

Terminal and SWE Agents 今日没有新的命中文献。

2026-05-25

命中 0 篇生成于 2026-05-25 13:26:33 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

Agent Runtime Security0 篇

Agent Runtime Security 今日没有新的命中文献。

Terminal and SWE Agents0 篇

Terminal and SWE Agents 今日没有新的命中文献。

2026-05-24

命中 0 篇生成于 2026-05-24 13:07:58 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

Agent Runtime Security0 篇

Agent Runtime Security 今日没有新的命中文献。

Terminal and SWE Agents0 篇

Terminal and SWE Agents 今日没有新的命中文献。

2026-05-23

命中 0 篇生成于 2026-05-23 12:48:29 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

Agent Runtime Security0 篇

Agent Runtime Security 今日没有新的命中文献。

Terminal and SWE Agents0 篇

Terminal and SWE Agents 今日没有新的命中文献。

2026-05-22

命中 21 篇生成于 2026-05-22 13:08:19 (Asia/Shanghai)

Markdown JSON

LM15 篇

《Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents》〔评测 / 方法〕：Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses…

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents · Score 196
title matched "LLM"；title matched "agent"；title matched "evaluation"
原始来源
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety · Score 192
title matched "agent"；title matched "benchmark"；summary matched "language model"
原始来源
ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning · Score 192
title matched "reasoning"；title matched "benchmark"；summary matched "LLM"
原始来源
LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance · Score 188
title matched "reasoning"；summary matched "language model"；summary matched "large language model"
原始来源
From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment · Score 174
title matched "LLM"；title matched "alignment"；summary matched "language model"
原始来源

Agent Runtime Security3 篇

《DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback》〔评测 / 应用 / 方法〕：LLM-powered AI agents require high-frequency state exploration (e.g., test-time tree search and reinforcement learn…

DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback · Score 48
summary matched "agent sandbox"；has PDF；has rich summary
原始来源
HarnessAPI: A Skill-First Framework for Unified Streaming APIs and MCP Tools · Score 47
summary matched "agent runtime"；has PDF；has rich summary
原始来源
Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents · Score 46
summary matched "guardrail"；has PDF；has rich summary
原始来源

Terminal and SWE Agents3 篇

《"Refactoring Runaway": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution》〔方法〕：Recent advances in coding agents have shown remarkable progress in software issue resolution. In pract…

"Refactoring Runaway": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution · Score 125
title matched "coding agent"；title matched "issue resolution"；summary matched "SWE-bench"
原始来源
TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks · Score 45
summary matched "Terminal-Bench"；has PDF；has rich summary
原始来源
Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study · Score 45
summary matched "coding agent"；has PDF；has rich summary
原始来源

2026-05-21

命中 17 篇生成于 2026-05-21 13:14:24 (Asia/Shanghai)

Markdown JSON

LM15 篇

《Tracing the ongoing emergence of human-like reasoning in Large Language Models》〔方法〕：Humans effortlessly go beyond literal meanings: If you mow the lawn, I will give you fifty dollars, is typically understood as implyin…

Tracing the ongoing emergence of human-like reasoning in Large Language Models · Score 184
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema · Score 167
title matched "LLM"；title matched "agent"；title matched "benchmark"
原始来源
Leveraging LLMs for Grammar Adaptation: A Study on Metamodel-Grammar Co-Evolution · Score 164
title matched "LLM"；title matched "RAG"；summary matched "language model"
原始来源
LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models · Score 163
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata · Score 160
title matched "benchmark"；summary matched "language model"；summary matched "large language model"
原始来源

Agent Runtime Security1 篇

《Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling》〔应用 / 方法〕：Computer-use agents (CUA) automate tasks specified with natural language such as "order the cheapest item from Taco Bell" by gene…

Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling · Score 48
summary matched "computer-use agent"；has PDF；has rich summary
原始来源

Terminal and SWE Agents1 篇

《SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents》〔评测 / 方法〕：As long-horizon coding agents produce more code than any developer can review, oversight collapses onto a single surface: the automated test s…

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents · Score 69
title matched "coding agent"；has PDF；has rich summary
原始来源

2026-05-20

命中 27 篇生成于 2026-05-20 13:10:58 (Asia/Shanghai)

Markdown JSON

LM15 篇

《MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models》〔评测 / 方法〕：Large language models (LLMs) are increasingly integrated into high-stakes decision-making. Inspired by the theory of \emph{inattention…

MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models · Score 236
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
OpenCompass: A Universal Evaluation Platform for Large Language Models · Score 232
title matched "language model"；title matched "large language model"；title matched "evaluation"
原始来源
Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking · Score 214
title matched "LLM"；title matched "agent"；title matched "evaluation"
原始来源
SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models · Score 214
title matched "language model"；title matched "large language model"；title matched "evaluation"
原始来源
LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening · Score 196
title matched "LLM"；title matched "reasoning"；title matched "benchmark"
原始来源

Agent Runtime Security7 篇

《Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models》〔评测 / 方法〕：Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex problems by generat…

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models · Score 80
title matched "jailbreak"；has PDF；has rich summary
原始来源
OpenComputer: Verifiable Software Worlds for Computer-Use Agents · Score 80
title matched "computer-use agent"；has PDF；has rich summary
原始来源
Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains · Score 80
title matched "guardrail"；has PDF；has rich summary
原始来源
A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents · Score 58
summary matched "agent runtime"；has PDF；has rich summary
原始来源
Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents · Score 58
summary matched "policy enforcement"；has PDF；has rich summary
原始来源

Terminal and SWE Agents5 篇

《Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study》〔评测 / 应用 / 方法〕：As autonomous coding agents see rapid adoption, their evaluation has primarily focused on task completion rates holding the tar…

Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study · Score 80
title matched "coding agent"；has PDF；has rich summary
原始来源
PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents · Score 58
summary matched "coding agent"；has PDF；has rich summary
原始来源
RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades · Score 58
summary matched "coding agent"；has PDF；has rich summary
原始来源
The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next · Score 58
summary matched "SWE-bench"；has PDF；has rich summary
原始来源
Toward Training Superintelligent Software Agents through Self-Play SWE-RL · Score 58
summary matched "SWE-bench"；has PDF；has rich summary
原始来源

2026-05-19

命中 22 篇生成于 2026-05-19 13:08:04 (Asia/Shanghai)

Markdown JSON

LM15 篇

《CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark》〔评测 / 数据 / 方法〕：Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view pe…

CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark · Score 217
title matched "LLM"；title matched "benchmark"；summary matched "language model"
原始来源
SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science · Score 181
title matched "LLM"；title matched "benchmark"；summary matched "language model"
原始来源
MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion · Score 176
title matched "agent"；summary matched "language model"；summary matched "large language model"
原始来源
SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents · Score 168
title matched "LLM"；title matched "agent"；title matched "benchmark"
原始来源
LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems · Score 158
title matched "agent"；summary matched "LLM"；summary matched "reasoning"
原始来源

Agent Runtime Security4 篇

《An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments》〔方法〕：LLM-based chatbot agents increasingly process user requests by combining natural-language reasoning with external…

An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments · Score 98
title matched "prompt injection"；summary matched "indirect prompt injection"；summary matched "jailbreak"
原始来源
Multilingual jailbreaking of LLMs using low-resource languages · Score 82
title matched "jailbreak"；summary matched "guardrail"；has PDF
原始来源
Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks · Score 68
summary matched "prompt injection"；has PDF；has rich summary
原始来源
Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models · Score 63
title matched "jailbreak"；has PDF；has rich summary
原始来源

Terminal and SWE Agents3 篇

《Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents》〔应用 / 方法〕：Behavioral studies of LLM-based software engineering agents extract operational rules about which traject…

Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents · Score 83
title matched "software engineering agent"；summary matched "SWE-bench"；has PDF
原始来源
SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution · Score 62
summary matched "Terminal-Bench"；summary matched "SWE-bench"；has PDF
原始来源
Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents · Score 48
summary matched "coding agent"；has PDF；has rich summary
原始来源

2026-05-18

命中 15 篇生成于 2026-05-18 13:13:17 (Asia/Shanghai)

Markdown JSON

LM15 篇

《CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency》〔评测 / 应用 / 方法〕：This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate…

CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency · Score 236
title matched "LLM"；title matched "agent"；title matched "benchmark"
原始来源
FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models · Score 232
title matched "language model"；title matched "large language model"；title matched "benchmark"
原始来源
MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models · Score 214
title matched "language model"；title matched "large language model"；title matched "benchmark"
原始来源
Large Language Models Could Be Rote Learners · Score 192
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
Look Before You Leap: Autonomous Exploration for LLM Agents · Score 192
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源

Agent Runtime Security0 篇

Agent Runtime Security 今日没有新的命中文献。

Terminal and SWE Agents0 篇

Terminal and SWE Agents 今日没有新的命中文献。

2026-05-17

命中 0 篇生成于 2026-05-17 12:57:14 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

Agent Runtime Security0 篇

Agent Runtime Security 今日没有新的命中文献。

Terminal and SWE Agents0 篇

Terminal and SWE Agents 今日没有新的命中文献。

2026-05-16

命中 0 篇生成于 2026-05-16 12:33:27 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

Agent Runtime Security0 篇

Agent Runtime Security 今日没有新的命中文献。

Terminal and SWE Agents0 篇

Terminal and SWE Agents 今日没有新的命中文献。

2026-05-15

命中 17 篇生成于 2026-05-15 14:57:29 (Asia/Shanghai)

Markdown JSON

LM11 篇

《Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks》〔评测 / 数据 / 方法〕：We introduce a reusable framework for auditing whether LLM attack benchmarks collectively cover the threat surface: a 4$\times…

Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks · Score 202
title matched "LLM"；title matched "RAG"；title matched "benchmark"
原始来源
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning · Score 161
title matched "LLM"；title matched "reasoning"；summary matched "language model"
原始来源
Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use · Score 161
title matched "LLM"；title matched "reasoning"；summary matched "language model"
原始来源
APWA: A Distributed Architecture for Parallelizable Agentic Workflows · Score 158
title matched "agent"；summary matched "language model"；summary matched "large language model"
原始来源
MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory · Score 144
title matched "agent"；title matched "evaluation"；summary matched "reasoning"
原始来源

Agent Runtime Security0 篇

Agent Runtime Security 今日没有新的命中文献。

Terminal and SWE Agents6 篇

《CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing》〔评测 / 方法〕：Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct/Thinking che…

CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing · Score 115
title matched "code agent"；summary matched "Terminal-Bench"；summary matched "SWE-bench"
原始来源
Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation · Score 97
title matched "repository-level"；summary matched "coding agent"；has PDF
原始来源
SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades · Score 97
title matched "coding agent"；summary matched "issue resolution"；has PDF
原始来源
Documentation-Guided Agentic Codebase Migration from C to Rust · Score 75
summary matched "coding agent"；summary matched "repository-level"；has PDF
原始来源
Comparing Developer and LLM Biases in Code Evaluation · Score 57
summary matched "code editing"；has PDF；has rich summary
原始来源

2026-05-14

命中 17 篇生成于 2026-05-14 12:52:54 (Asia/Shanghai)

Markdown JSON

LM15 篇

《RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation》〔评测 / 数据 / 方法〕：Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicia…

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation · Score 200
title matched "LLM"；title matched "agent"；title matched "benchmark"
原始来源
MILM: Large Language Models for Multimodal Irregular Time Series with Informative Sampling · Score 163
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
An LLM-Based System for Argument Reconstruction · Score 160
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源
OpenAaaS: An Open Agent-as-a-Service Framework for Distributed Materials-Informatics Research · Score 157
title matched "agent"；summary matched "language model"；summary matched "large language model"
原始来源
(How) Do Large Language Models Understand High-Level Message Sequence Charts? · Score 145
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源

Agent Runtime Security2 篇

《Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents》〔评测 / 应用 / 方法〕：Always-on AI agents (OpenClaw, Hermes Agent) run as a single persistent process under the owner's iden…

Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents · Score 66
title matched "prompt injection"；has PDF；has rich summary
原始来源
LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs · Score 63
title matched "guardrail"；has PDF；has rich summary
原始来源

2026-05-13

命中 17 篇生成于 2026-05-13 12:54:34 (Asia/Shanghai)

Markdown JSON

LM15 篇

《MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering》〔评测 / 数据 / 应用 / 方法〕：Evaluating large language models (LLMs) in the biomedical domain requi…

MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering · Score 225
title matched "LLM"；title matched "reasoning"；title matched "benchmark"
原始来源
ORCE: Order-Aware Alignment of Verbalized Confidence in Large Language Models · Score 222
title matched "language model"；title matched "large language model"；title matched "alignment"
原始来源
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models · Score 207
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
Question Difficulty Estimation for Large Language Models via Answer Plausibility Scoring · Score 182
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering · Score 177
title matched "evaluation"；summary matched "language model"；summary matched "large language model"
原始来源

Agent Runtime Security2 篇

《Metaphor Is Not All Attention Needs》〔应用 / 方法〕：Large language models are increasingly deployed in safety-critical applications, where their ability to resist harmful instructions is essential. Although post…；《A microser…

Metaphor Is Not All Attention Needs · Score 44
summary matched "jailbreak"；has PDF；has rich summary
原始来源
A microservices-based endpoint monitoring platform with predictive NLP models for real-time security and hate-speech risk alerting · Score 42
summary matched "data exfiltration"；has PDF；has rich summary
原始来源

2026-05-12

命中 21 篇生成于 2026-05-12 12:42:08 (Asia/Shanghai)

Markdown JSON

LM15 篇

《WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation》〔评测 / 方法〕：Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) ha…

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation · Score 205
title matched "agent"；title matched "benchmark"；title matched "evaluation"
原始来源
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox · Score 185
title matched "LLM"；title matched "agent"；title matched "evaluation"
原始来源
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents · Score 168
title matched "LLM"；title matched "agent"；title matched "benchmark"
原始来源
LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments · Score 167
title matched "LLM"；title matched "agent"；title matched "benchmark"
原始来源
Threat Modelling using Domain-Adapted Language Models: Empirical Evaluation and Insights · Score 163
title matched "language model"；title matched "evaluation"；summary matched "large language model"
原始来源

Agent Runtime Security6 篇

《Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization》〔评测 / 方法〕：Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-mod…

Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization · Score 69
title matched "jailbreak"；has PDF；has rich summary
原始来源
Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs · Score 67
title matched "guardrail"；has PDF；has rich summary
原始来源
Re-Triggering Safeguards within LLMs for Jailbreak Detection · Score 67
title matched "jailbreak"；has PDF；has rich summary
原始来源
Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing · Score 67
title matched "jailbreak"；has PDF；has rich summary
原始来源
RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems · Score 48
summary matched "prompt injection"；has PDF；has rich summary
原始来源

2026-05-11

命中 0 篇生成于 2026-05-11 13:03:07 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

Agent Runtime Security0 篇

Agent Runtime Security 今日没有新的命中文献。

2026-05-10

命中 0 篇生成于 2026-05-10 12:50:04 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

Agent Runtime Security0 篇

Agent Runtime Security 今日没有新的命中文献。

2026-05-09

命中 0 篇生成于 2026-05-09 12:29:32 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

Agent Runtime Security0 篇

Agent Runtime Security 今日没有新的命中文献。

2026-05-08

命中 13 篇生成于 2026-05-08 14:15:32 (Asia/Shanghai)

Markdown JSON

LM12 篇

《LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG》〔评测 / 数据 / 应用 / 方法〕：Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question…

LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG · Score 231
title matched "reasoning"；title matched "agent"；title matched "RAG"
原始来源
MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents · Score 213
title matched "LLM"；title matched "agent"；title matched "benchmark"
原始来源
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity · Score 213
title matched "LLM"；title matched "alignment"；title matched "evaluation"
原始来源
Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using LLM Judges with Closed-Loop Reinforcement Learning Feedback · Score 213
title matched "LLM"；title matched "agent"；title matched "evaluation"
原始来源
BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models · Score 209
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源

Agent Runtime Security1 篇

《Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation》〔评测 / 应用 / 方法〕：Self-hosted computer-use agents (SHCUAs), such as OpenClaw, combine natural-language interaction with direct acce…

Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation · Score 102
title matched "computer-use agent"；summary matched "prompt injection"；summary matched "indirect prompt injection"
原始来源

2026-05-07

命中 15 篇生成于 2026-05-07 12:38:06 (Asia/Shanghai)

Markdown JSON

LM15 篇

《Misaligned by Reward: Socially Undesirable Preferences in LLMs》〔评测 / 数据 / 方法〕：Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, exis…

Misaligned by Reward: Socially Undesirable Preferences in LLMs · Score 194
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源
SoK: Robustness in Large Language Models against Jailbreak Attacks · Score 181
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
Why Expert Alignment Is Hard: Evidence from Subjective Evaluation · Score 161
title matched "alignment"；title matched "evaluation"；summary matched "language model"
原始来源
KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels · Score 143
title matched "LLM"；title matched "benchmark"；summary matched "RAG"
原始来源
Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction · Score 142
title matched "LLM"；summary matched "language model"；summary matched "large language model"
原始来源

2026-05-06

命中 15 篇生成于 2026-05-06 12:37:23 (Asia/Shanghai)

Markdown JSON

LM15 篇

《Safety and accuracy follow different scaling laws in clinical large language models》〔评测 / 应用 / 方法〕：Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time comput…

Safety and accuracy follow different scaling laws in clinical large language models · Score 201
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones · Score 183
title matched "LLM"；title matched "reasoning"；title matched "agent"
原始来源
OracleProto: A Reproducible Framework for Benchmarking LLM Native Forecasting via Knowledge Cutoff and Temporal Masking · Score 161
title matched "LLM"；title matched "benchmark"；summary matched "language model"
原始来源
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems · Score 147
title matched "reasoning"；title matched "agent"；summary matched "benchmark"
原始来源
Benchmarking Parameter-Efficient Fine-Tuning of Large Language Models for Low-Resource Tajik Text Generation with the Tajik Web Corpus · Score 146
title matched "language model"；title matched "large language model"；title matched "benchmark"
原始来源

2026-05-05

命中 15 篇生成于 2026-05-05 12:20:54 (Asia/Shanghai)

Markdown JSON

LM15 篇

《StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models》〔评测 / 数据 / 方法〕：Static benchmarks for LLMs are increasingly compromised by contamination and overfitting especia…

StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models · Score 219
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks · Score 211
title matched "LLM"；title matched "benchmark"；summary matched "language model"
原始来源
Maistros: A Greek Large Language Model Adapted Through Knowledge Distillation From Large Reasoning Models · Score 197
title matched "language model"；title matched "large language model"；title matched "reasoning"
原始来源
MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation · Score 197
title matched "language model"；title matched "large language model"；title matched "alignment"
原始来源
OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice · Score 197
title matched "language model"；title matched "large language model"；title matched "LLM"
原始来源

2026-05-04

命中 0 篇生成于 2026-05-04 12:44:55 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

2026-05-03

命中 0 篇生成于 2026-05-03 12:44:59 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

2026-05-02

命中 0 篇生成于 2026-05-02 12:22:26 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

2026-05-01

命中 15 篇生成于 2026-05-01 12:53:56 (Asia/Shanghai)

Markdown JSON

LM15 篇

《Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper Agents》〔评测 / 应用 / 方法〕：We present Collabora…

Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper Agents · Score 193
title matched "reasoning"；title matched "agent"；summary matched "language model"
原始来源
What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design · Score 185
title matched "agent"；title matched "benchmark"；title matched "evaluation"
原始来源
Rethinking Agentic Reinforcement Learning In Large Language Models · Score 182
title matched "language model"；title matched "large language model"；title matched "agent"
原始来源
TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering · Score 181
title matched "reasoning"；title matched "benchmark"；summary matched "language model"
原始来源
LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning · Score 180
title matched "LLM"；title matched "reasoning"；summary matched "language model"
原始来源

2026-04-30

命中 0 篇生成于 2026-04-30 14:35:40 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

2026-04-29

命中 15 篇生成于 2026-04-29 12:26:28 (Asia/Shanghai)

Markdown JSON

LM15 篇

《LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation》〔评测 / 数据 / 方法〕：Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across hetero…

LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation · Score 197
title matched "LLM"；title matched "evaluation"；summary matched "language model"
原始来源
Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models · Score 178
title matched "language model"；title matched "large language model"；summary matched "LLM"
原始来源
DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios · Score 165
title matched "agent"；title matched "benchmark"；summary matched "LLM"
原始来源
From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling · Score 164
title matched "LLM"；title matched "agent"；summary matched "language model"
原始来源
SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing? · Score 158
title matched "agent"；summary matched "language model"；summary matched "large language model"
原始来源

2026-04-28

命中 0 篇生成于 2026-04-28 15:39:05 (Asia/Shanghai)

Markdown JSON

LM0 篇

LM 今日没有新的命中文献。

2026-04-27

命中 1 篇生成于 2026-04-27 11:55:55 (Asia/Shanghai)

Markdown JSON

LLM0 篇

LLM 今日没有新的命中文献。

Vision0 篇

Vision 今日没有新的命中文献。

PubMed AI0 篇

PubMed AI 今日没有新的命中文献。

OpenAlex AI1 篇

《AI-Driven Multi-Agent System for Autonomous Mining Operation Centers》〔方法〕：International audience

AI-Driven Multi-Agent System for Autonomous Mining Operation Centers · Score 64
title matched "agent"；has complete metadata
原始来源

2026-04-26

命中 1 篇生成于 2026-04-26 11:52:13 (Asia/Shanghai)

Markdown JSON

LLM0 篇

LLM 今日没有新的命中文献。

Vision0 篇

Vision 今日没有新的命中文献。

PubMed AI0 篇

PubMed AI 今日没有新的命中文献。

OpenAlex AI1 篇

《SOS::LM Sequence Initializer: Semantic Process Architecture for Controlled, Traceable, and Structured Language Model Outputs》〔评测 / 应用 / 方法〕：SOS::LM (Schloemer-Notation ::) defines a semantic process architecture for la…

SOS::LM Sequence Initializer: Semantic Process Architecture for Controlled, Traceable, and Structured Language Model Outputs · Score 100
title matched "language model"；summary matched "agent"；has DOI
原始来源

2026-04-25

命中 5 篇生成于 2026-04-25 11:28:34 (Asia/Shanghai)

Markdown JSON

LLM0 篇

LLM 今日没有新的命中文献。

Vision0 篇

Vision 今日没有新的命中文献。

PubMed AI5 篇

《Establishing Clinically Significant Change Benchmarks for the Moral Injury Outcome Scale in VA Behavioral Health Settings.》〔评测 / 方法〕：This study aimed to establish benchmarks for clinically significant change for the Mo…

Establishing Clinically Significant Change Benchmarks for the Moral Injury Outcome Scale in VA Behavioral Health Settings. · Score 102
title matched "benchmark"；title matched "clinical"；has DOI
原始来源
Generalist large language models in a specialized world: Evidence from the Italian national medical education pathway. · Score 90
title matched "language model"；summary matched "clinical"；has DOI
原始来源
Standardization of clinical trials subject ID schematics: A portfolio-wide model to enhance data integrity and regulatory compliance. · Score 81
title matched "clinical"；summary matched "benchmark"；has DOI
原始来源
Considerations about the proliferation of large language model chatbots and youth mental health. · Score 80
title matched "language model"；summary matched "clinical"；has DOI
原始来源
The application of large language models in meteorology graduate research: current status, impact, and prospects. · Score 72
title matched "language model"；has DOI；has rich summary
原始来源

OpenAlex AI0 篇

OpenAlex AI 今日没有新的命中文献。

2026-04-24

命中 30 篇生成于 2026-04-24 11:46:20 (Asia/Shanghai)

Markdown JSON

LLM15 篇

《Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows》〔评测 / 应用 / 方法〕：The Model Context Protocol (MCP) has become a common interface…

Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows · Score 106
title matched "agent"；summary matched "reasoning"；summary matched "benchmark"
原始来源
Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems · Score 106
title matched "agent"；summary matched "reasoning"；summary matched "benchmark"
原始来源
AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use · Score 102
title matched "agent"；summary matched "reasoning"；summary matched "benchmark"
原始来源
Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability · Score 90
title matched "evaluation"；summary matched "benchmark"；has PDF
原始来源
Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models · Score 90
title matched "agent"；summary matched "reasoning"；has PDF
原始来源

Vision10 篇

《Pre-process for segmentation task with nonlinear diffusion filters》〔方法〕：This paper deals with the case of using nonlinear diffusion filters to obtain piecewise constant images as a previous process for segmentation tec…

Pre-process for segmentation task with nonlinear diffusion filters · Score 102
title matched "diffusion"；title matched "segmentation"；has PDF
原始来源
KD-CVG: A Knowledge-Driven Approach for Creative Video Generation · Score 79
title matched "video generation"；summary matched "multimodal"；has PDF
原始来源
Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation · Score 77
title matched "video generation"；summary matched "diffusion"；has PDF
原始来源
Seeing Fast and Slow: Learning the Flow of Time in Videos · Score 68
summary matched "video generation"；summary matched "multimodal"；has PDF
原始来源
DCMorph: Face Morphing via Dual-Stream Cross-Attention Diffusion · Score 66
title matched "diffusion"；has PDF；has rich summary
原始来源

PubMed AI5 篇

《Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models.》〔数据 / 应用 / 方法〕：Prompt learning has emerged as one of the most effective paradigms for adapting pre-trained vision language models (VLMs) to…

Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models. · Score 93
title matched "language model"；summary matched "clinical"；has DOI
原始来源
Clinical Knowledge-Guided PET/CT Lesion Segmentation with Interpretable Fusion of Metabolic and Structural Cues. · Score 93
title matched "clinical"；summary matched "benchmark"；has DOI
原始来源
Accelerating real-world data collection using large language models in rare neoplasms: a bone sarcoma example. · Score 81
title matched "language model"；summary matched "clinical"；has DOI
原始来源
GATE: Graph and Text Exchange for Zero-Shot ECG Classification with LLM Prompts. · Score 71
summary matched "language model"；summary matched "clinical"；has DOI
原始来源
Learning from Prototypes: Contrastive Learning with Prior-Aware Multi-Label Chest X-ray Classification. · Score 71
summary matched "language model"；summary matched "clinical"；has DOI
原始来源

OpenAlex AI0 篇

OpenAlex AI 今日没有新的命中文献。

2026-04-23

命中 29 篇生成于 2026-04-23 11:42:13 (Asia/Shanghai)

Markdown JSON

LLM15 篇

《OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model》〔评测 / 方法〕：Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Neverth…

OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model · Score 129
title matched "reasoning"；title matched "benchmark"；summary matched "evaluation"
原始来源
V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization · Score 125
title matched "reasoning"；summary matched "alignment"；summary matched "benchmark"
原始来源
ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence · Score 124
title matched "benchmark"；summary matched "reasoning"；summary matched "alignment"
原始来源
Where and What: Reasoning Dynamic and Implicit Preferences in Situated Conversational Recommendation · Score 106
title matched "reasoning"；summary matched "alignment"；summary matched "benchmark"
原始来源
Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows · Score 105
title matched "agent"；summary matched "reasoning"；summary matched "benchmark"
原始来源

Vision9 篇

《LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model》〔方法〕：We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal underst…

LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model · Score 111
title matched "diffusion"；title matched "multimodal"；has PDF
原始来源
Hallucination Early Detection in Diffusion Models · Score 75
title matched "diffusion"；has DOI；has PDF
原始来源
ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control · Score 72
title matched "diffusion"；has PDF；has rich summary
原始来源
Amodal SAM: A Unified Amodal Segmentation Framework with Generalization · Score 70
title matched "segmentation"；has PDF；has rich summary
原始来源
GeoRelight: Learning Joint Geometrical Relighting and Reconstruction with Flexible Multi-Modal Diffusion Transformers · Score 70
title matched "diffusion"；has PDF；has rich summary
原始来源

PubMed AI5 篇

《Comparative evaluation of large language models for generating CAD-RADS 2.0-compliant diagnostic conclusions in cardiac CT reports.》〔评测 / 应用 / 方法〕：OBJECTIVES: Coronary computed tomography angiography (CCTA) has become…

Comparative evaluation of large language models for generating CAD-RADS 2.0-compliant diagnostic conclusions in cardiac CT reports. · Score 87
title matched "language model"；summary matched "clinical"；has DOI
原始来源
Immune checkpoint inhibitors in POLE/POLD1 proofreading-deficient CRC: from molecular basis to clinical practice and future directions. · Score 81
title matched "clinical"；summary matched "benchmark"；has DOI
原始来源
Establishing procedure-specific minimal clinically important difference and patient acceptable symptom state thresholds after anterior combined latissimus dorsi and teres major tendon transfer for irreparable anterosuperior cuff tears: minimum 5-year outcomes. · Score 81
title matched "clinical"；summary matched "benchmark"；has DOI
原始来源
Diagnostic Modalities and Nodal Staging in High-Risk Cutaneous Squamous Cell Carcinoma. · Score 65
summary matched "benchmark"；summary matched "clinical"；has DOI
原始来源
Defining the learning curve of multi-vessel MIDCAB using CUSUM analysis. · Score 62
summary matched "benchmark"；summary matched "clinical"；has DOI
原始来源

OpenAlex AI0 篇

OpenAlex AI 今日没有新的命中文献。

2026-04-22

命中 30 篇生成于 2026-04-22 11:37:03 (Asia/Shanghai)

Markdown JSON

LLM15 篇

《Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents》〔评测 / 应用 / 方法〕：Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization)…

Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents · Score 162
title matched "agent"；title matched "alignment"；summary matched "reasoning"
原始来源
Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps · Score 149
title matched "agent"；title matched "benchmark"；title matched "evaluation"
原始来源
Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment · Score 145
title matched "agent"；title matched "alignment"；summary matched "reasoning"
原始来源
Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views · Score 130
title matched "reasoning"；title matched "alignment"；summary matched "benchmark"
原始来源
Revac: A Social Deduction Reasoning Agent · Score 127
title matched "agent"；title matched "reasoning"；summary matched "evaluation"
原始来源

Vision10 篇

《PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving》〔评测 / 方法〕：This paper presents the first study on Unsupervised Domain Adaptation (UDA) for multimodal 3D panoptic segme…

PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving · Score 106
title matched "multimodal"；title matched "segmentation"；has PDF
原始来源
Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval · Score 100
title matched "diffusion"；title matched "multimodal"；has PDF
原始来源
ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis · Score 90
title matched "video generation"；summary matched "diffusion"；has PDF
原始来源
MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation · Score 89
title matched "video generation"；summary matched "diffusion"；has PDF
原始来源
MedFlowSeg: Flow Matching for Medical Image Segmentation with Frequency-Aware Attention · Score 89
title matched "segmentation"；summary matched "diffusion"；has PDF
原始来源

PubMed AI5 篇

《Classifying American Society of Anesthesiologists Physical Status With a Low-Rank-Adapted Large Language Model: Development and Validation Study.》〔评测 / 应用 / 方法〕：BACKGROUND: The American Society of Anesthesiologists Phy…

Classifying American Society of Anesthesiologists Physical Status With a Low-Rank-Adapted Large Language Model: Development and Validation Study. · Score 111
title matched "language model"；summary matched "benchmark"；summary matched "clinical"
原始来源
Enhancing large language model clinical support information with machine learning risk and explainability: a feasibility study. · Score 109
title matched "language model"；title matched "clinical"；has DOI
原始来源
Clinical Model Autophagy: The Risk of Interpretative Drift in Recursive Medical AI. · Score 93
title matched "clinical"；summary matched "language model"；has DOI
原始来源
APSevLM: Acute Pancreatitis Severity Language Model. · Score 90
title matched "language model"；summary matched "clinical"；has DOI
原始来源
Comparing Clinical Outcomes in Cardiac Surgical Patients Who Receive Sugammadex Versus Placebo: A Prospective Randomized Blinded Controlled Trial. · Score 88
title matched "clinical"；summary matched "benchmark"；has DOI
原始来源

OpenAlex AI0 篇

OpenAlex AI 今日没有新的命中文献。

2026-04-21

命中 30 篇生成于 2026-04-21 11:40:46 (Asia/Shanghai)

Markdown JSON

LLM15 篇

《MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval》〔评测 / 数据 / 方法〕：Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing…

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval · Score 112
title matched "reasoning"；title matched "benchmark"；has PDF
原始来源
Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion · Score 108
title matched "benchmark"；summary matched "reasoning"；summary matched "evaluation"
原始来源
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents · Score 107
title matched "agent"；summary matched "benchmark"；summary matched "evaluation"
原始来源
MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation · Score 107
title matched "agent"；summary matched "reasoning"；summary matched "benchmark"
原始来源
OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation · Score 106
title matched "reasoning"；summary matched "agent"；summary matched "benchmark"
原始来源

Vision10 篇

《AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation》〔应用 / 方法〕：Video diffusion transformers (DiTs) suffer from prohibitive inference latency due to quadratic attention complexity. Existing…

AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation · Score 87
title matched "video generation"；summary matched "diffusion"；has PDF
原始来源
DiffuSAM: Diffusion Guided Zero-Shot Object Grounding for Remote Sensing Imagery · Score 85
title matched "diffusion"；summary matched "segmentation"；has PDF
原始来源
Weakly-Supervised Referring Video Object Segmentation through Text Supervision · Score 76
title matched "segmentation"；summary matched "multimodal"；has PDF
原始来源
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation · Score 72
title matched "segmentation"；has PDF；has rich summary
原始来源
UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models · Score 71
title matched "diffusion"；has PDF；has rich summary
原始来源

PubMed AI5 篇

《Transforming oncology clinical trial matching through neuro-symbolic, multi-agent AI and an oncology-specific knowledge graph: a prospective evaluation in 3804 patients.》〔评测 / 数据 / 应用 / 方法〕：BACKGROUND: Clinical trial e…

Transforming oncology clinical trial matching through neuro-symbolic, multi-agent AI and an oncology-specific knowledge graph: a prospective evaluation in 3804 patients. · Score 101
title matched "clinical"；summary matched "language model"；summary matched "benchmark"
原始来源
Developing and evaluating definitions of real-world clinical endpoints for patients with early-stage triple-negative breast cancer using a United States of America secondary database. · Score 83
title matched "clinical"；summary matched "benchmark"；has DOI
原始来源
Investigating fine-tuning versus zero-shot learning for general large language models when predicting cancer survival from initial oncology consultation documents. · Score 83
title matched "language model"；summary matched "clinical"；has DOI
原始来源
A Comparative Evaluation of Three Large Language Models for Parent-Centered Questions About Anorexia Nervosa. · Score 83
title matched "language model"；summary matched "clinical"；has DOI
原始来源
Impacts of Multidisciplinary Lung Cancer Meeting Presentation in a Clinical Quality Registry. · Score 66
title matched "clinical"；has DOI；has rich summary
原始来源

OpenAlex AI0 篇

OpenAlex AI 今日没有新的命中文献。

2026-04-20

命中 3 篇生成于 2026-04-20 11:48:52 (Asia/Shanghai)

Markdown JSON

LLM0 篇

LLM 今日没有新的命中文献。

Vision0 篇

Vision 今日没有新的命中文献。

PubMed AI1 篇

《Medic Training at Military-Civilian Partnerships-A Narrative Review.》〔评测 / 应用 / 方法〕：INTRODUCTION: Military-Civilian Partnerships (MCP) were developed to mitigate degradation of combat medical readiness during peacetime…

Medic Training at Military-Civilian Partnerships-A Narrative Review. · Score 62
summary matched "benchmark"；summary matched "clinical"；has DOI
原始来源

OpenAlex AI2 篇

《Artificial Intelligence And The Transformation of Labor Markets》〔方法〕：The rapid advancement of artificial intelligence (AI) technologies, particularly generative AI and large language models, has reignited debates about…

Artificial Intelligence And The Transformation of Labor Markets · Score 60
summary matched "language model"；has DOI；has rich summary
原始来源
Artificial Intelligence And The Transformation of Labor Markets · Score 60
summary matched "language model"；has DOI；has rich summary
原始来源

2026-04-19

命中 0 篇生成于 2026-04-19 11:46:32 (Asia/Shanghai)

Markdown JSON

LLM0 篇

LLM 今日没有新的命中文献。

Vision0 篇

Vision 今日没有新的命中文献。

PubMed AI0 篇

PubMed AI 今日没有新的命中文献。

OpenAlex AI0 篇

OpenAlex AI 今日没有新的命中文献。

2026-04-18

命中 5 篇生成于 2026-04-18 11:26:55 (Asia/Shanghai)

Markdown JSON

LLM0 篇

LLM 今日没有新的命中文献。

Vision0 篇

Vision 今日没有新的命中文献。

PubMed AI5 篇

《Pretraining effective T5 generative models for clinical and biomedical applications.》〔评测 / 数据 / 应用 / 方法〕：This paper presents a study of the impact of corpus selection and vocabulary design on the performance of T5-base…

Pretraining effective T5 generative models for clinical and biomedical applications. · Score 108
title matched "clinical"；summary matched "language model"；summary matched "benchmark"
原始来源
MILU: a consensus ensemble benchmark for multimodal medical imaging lecture understanding. · Score 82
title matched "benchmark"；summary matched "language model"；has DOI
原始来源
Comparative performance of large language models and Drugs.com versus Lexicomp for antiseizure medication drug-drug interactions: A cross-sectional study with iterative prompting analysis. · Score 82
title matched "language model"；summary matched "clinical"；has DOI
原始来源
Weakly Supervised Composed Object Re-Identification With Large Models. · Score 68
summary matched "language model"；summary matched "benchmark"；has DOI
原始来源
An explainable multi-head attention network for healthcare IoT threat detection based on the MedDefender-MHAN framework. · Score 68
summary matched "benchmark"；summary matched "clinical"；has DOI
原始来源

OpenAlex AI0 篇

OpenAlex AI 今日没有新的命中文献。

2026-04-17

命中 29 篇生成于 2026-04-17 11:39:21 (Asia/Shanghai)

Markdown JSON

LLM15 篇

《CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas》〔评测 / 方法〕：It is increasingly important that LLM agents interact effectively and safely with other goal-pursuing agents, yet, re…

CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas · Score 130
title matched "agent"；title matched "benchmark"；summary matched "reasoning"
原始来源
IE as Cache: Information Extraction Enhanced Agentic Reasoning · Score 124
title matched "agent"；title matched "reasoning"；summary matched "benchmark"
原始来源
QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies · Score 123
title matched "benchmark"；summary matched "agent"；summary matched "alignment"
原始来源
From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench · Score 122
title matched "agent"；summary matched "reasoning"；summary matched "benchmark"
原始来源
An Axiomatic Benchmark for Evaluation of Scientific Novelty Metrics · Score 109
title matched "benchmark"；title matched "evaluation"；has PDF
原始来源

Vision9 篇

《SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation》〔应用 / 方法〕：Reliable uncertainty estimation is critical for medical image segmentation, where automated contours…

SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation · Score 72
title matched "segmentation"；has PDF；has rich summary
原始来源
Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization · Score 70
title matched "segmentation"；has PDF；has rich summary
原始来源
Boundary-Centric Active Learning for Temporal Action Segmentation · Score 70
title matched "segmentation"；has PDF；has rich summary
原始来源
An Analysis of Regularization and Fokker-Planck Residuals in Diffusion Models for Image Generation · Score 70
title matched "diffusion"；has PDF；has rich summary
原始来源
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework · Score 68
summary matched "diffusion"；summary matched "multimodal"；has PDF
原始来源

PubMed AI5 篇

《Applying natural language processing and large language models to clinical notes for phenotyping and diagnosing rare diseases: a systematic review.》〔评测 / 数据 / 应用 / 方法〕：OBJECTIVES: Patients with rare diseases often face…

Applying natural language processing and large language models to clinical notes for phenotyping and diagnosing rare diseases: a systematic review. · Score 113
title matched "language model"；title matched "clinical"；has DOI
原始来源
Evaluation of large language models with clinical guidance for vetting outpatient magnetic resonance imaging lumbar spine referrals. · Score 107
title matched "language model"；title matched "clinical"；has DOI
原始来源
From Image to Pixels: towards Fine-Grained Medical Vision-Language Models. · Score 106
title matched "language model"；summary matched "benchmark"；summary matched "clinical"
原始来源
Targeted use of large language models for EHR-based computable phenotyping. · Score 93
title matched "language model"；summary matched "clinical"；has DOI
原始来源
Dual perspectives on large language models in rheumatology: physician-rated quality and patient-centered usability of GPT-4o versus DeepSeek-V3. · Score 85
title matched "language model"；summary matched "clinical"；has DOI
原始来源

OpenAlex AI0 篇

OpenAlex AI 今日没有新的命中文献。

2026-04-16

命中 30 篇生成于 2026-04-16 11:43:00 (Asia/Shanghai)

Markdown JSON

LLM15 篇

《GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis》〔评测 / 应用 / 方法〕：The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift…

GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis · Score 162
title matched "agent"；title matched "benchmark"；summary matched "reasoning"
原始来源
HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark · Score 127
title matched "agent"；title matched "benchmark"；summary matched "evaluation"
原始来源
Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning · Score 120
title matched "evaluation"；summary matched "agent"；summary matched "reasoning"
原始来源
LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning · Score 112
title matched "reasoning"；title matched "benchmark"；has PDF
原始来源
Correct Prediction, Wrong Steps? Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis · Score 108
title matched "reasoning"；summary matched "benchmark"；summary matched "evaluation"
原始来源

Vision10 篇

《ROSE: Retrieval-Oriented Segmentation Enhancement》〔评测 / 方法〕：Existing segmentation models based on multimodal large language models (MLLMs), such as LISA, often struggle with novel or emerging entities due to their inab…

ROSE: Retrieval-Oriented Segmentation Enhancement · Score 90
title matched "segmentation"；summary matched "multimodal"；has PDF
原始来源
Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models · Score 88
title matched "multimodal"；summary matched "segmentation"；has PDF
原始来源
Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding · Score 78
title matched "multimodal"；summary matched "diffusion"；has PDF
原始来源
DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer · Score 78
title matched "diffusion"；summary matched "video generation"；has PDF
原始来源
Seedance 2.0: Advancing Video Generation for World Complexity · Score 72
title matched "video generation"；has PDF；has rich summary
原始来源

PubMed AI5 篇

《Augmenting Large Language Model With Prompt Engineering and Supervised Fine-Tuning in Non-Small Cell Lung Cancer Tumor-Node-Metastasis Staging: Framework Development and Validation.》〔评测 / 数据 / 应用 / 方法〕：BACKGROUND: Accu…

Augmenting Large Language Model With Prompt Engineering and Supervised Fine-Tuning in Non-Small Cell Lung Cancer Tumor-Node-Metastasis Staging: Framework Development and Validation. · Score 107
title matched "language model"；summary matched "benchmark"；summary matched "clinical"
原始来源
PKFAR: psychiatry knowledge-fused augmented reasoning with large language models. · Score 98
title matched "language model"；summary matched "benchmark"；summary matched "clinical"
原始来源
Fact-Checking Large Language Model Responses to a Health Care Prompt: Comparative Study. · Score 91
title matched "language model"；summary matched "clinical"；has DOI
原始来源
Fine-Tuned Large Language Models for Automated Radiology Impression Generation: A Multicenter Evaluation. · Score 86
title matched "language model"；summary matched "clinical"；has DOI
原始来源
A Multi-AI Agent Framework for Interactive Neurosurgical Education and Evaluation: From Vignettes to Virtual Conversations. · Score 76
summary matched "language model"；summary matched "benchmark"；summary matched "clinical"
原始来源

OpenAlex AI0 篇

OpenAlex AI 今日没有新的命中文献。

2026-04-15

命中 30 篇生成于 2026-04-15 11:35:50 (Asia/Shanghai)

Markdown JSON

LLM15 篇

《Parallax: Why AI Agents That Think Must Never Act》〔评测 / 应用 / 方法〕：Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise application…

Parallax: Why AI Agents That Think Must Never Act · Score 107
title matched "agent"；summary matched "reasoning"；summary matched "evaluation"
原始来源
Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents · Score 107
title matched "agent"；summary matched "reasoning"；summary matched "benchmark"
原始来源
Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss · Score 106
title matched "benchmark"；summary matched "reasoning"；summary matched "evaluation"
原始来源
Towards Long-horizon Agentic Multimodal Search · Score 106
title matched "agent"；summary matched "reasoning"；summary matched "benchmark"
原始来源
QuarkMedSearch: A Long-Horizon Deep Search Agent for Exploring Medical Intelligence · Score 105
title matched "agent"；summary matched "benchmark"；summary matched "evaluation"
原始来源

Vision9 篇

《RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation》〔评测 / 方法〕：Multimodal semantic segmentation has emerged as a powerful paradigm for enhancing scene understanding by leveragin…

RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation · Score 100
title matched "multimodal"；title matched "segmentation"；has PDF
原始来源
All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding · Score 78
title matched "multimodal"；summary matched "segmentation"；has PDF
原始来源
Probabilistic Feature Imputation and Uncertainty-Aware Multimodal Federated Aggregation · Score 71
title matched "multimodal"；has PDF；has rich summary
原始来源
AbdomenGen: Sequential Volume-Conditioned Diffusion Framework for Abdominal Anatomy Generation · Score 71
title matched "diffusion"；has PDF；has rich summary
原始来源
Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation · Score 70
title matched "segmentation"；has PDF；has rich summary
原始来源

PubMed AI5 篇

《VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model.》〔评测 / 数据 / 方法〕：The emergence of Large Vision-Language Models (LVLMs) marks significant strides towards achieving general artifi…

VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model. · Score 111
title matched "language model"；title matched "benchmark"；has DOI
原始来源
Multimodal large language models in brain tumor imaging: clinical applications and future perspectives. · Score 109
title matched "language model"；title matched "clinical"；has DOI
原始来源
Bridging the Modality Gap in Medical Vision-Language Models: A Hybrid Contrastive-Optimal Transport Framework for Enhanced Cross-Modal Alignment. · Score 107
title matched "language model"；summary matched "benchmark"；summary matched "clinical"
原始来源
User Experience and Early Clinical Outcomes of a Mental Wellness Chatbot for Depression and Anxiety: Pilot Evaluation Mixed Methods Study. · Score 93
title matched "clinical"；summary matched "language model"；has DOI
原始来源
Comparison of AI-based Chatbot Performance in Analyzing Clinical Scenarios versus Medical Residents: A Novel Approach in Chest Diseases Education. · Score 82
title matched "clinical"；summary matched "language model"；has DOI
原始来源

OpenAlex AI1 篇

《Demystifying Attitudes and Effects of Usage of Large-Language Models Among College-Aged Students》〔方法〕：In compiling literature for my senior seminar on combating hallucinations present within responses from large-langua…

Demystifying Attitudes and Effects of Usage of Large-Language Models Among College-Aged Students · Score 70
title matched "language model"；has rich summary；has complete metadata
原始来源

2026-04-14

命中 31 篇生成于 2026-04-14 11:37:06 (Asia/Shanghai)

Markdown JSON

LLM15 篇

《UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents》〔评测 / 数据 / 应用 / 方法〕：Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems throu…

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents · Score 145
title matched "agent"；title matched "evaluation"；summary matched "reasoning"
原始来源
General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks · Score 130
title matched "reasoning"；title matched "benchmark"；summary matched "evaluation"
原始来源
Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games · Score 129
title matched "agent"；title matched "reasoning"；summary matched "benchmark"
原始来源
FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning · Score 127
title matched "agent"；title matched "reasoning"；summary matched "evaluation"
原始来源
From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python · Score 126
title matched "agent"；title matched "benchmark"；summary matched "evaluation"
原始来源

Vision10 篇

《OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation》〔评测 / 数据 / 应用 / 方法〕：In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality…

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation · Score 112
title matched "video generation"；title matched "multimodal"；has PDF
原始来源
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation · Score 90
title matched "segmentation"；summary matched "multimodal"；has PDF
原始来源
GeomPrompt: Geometric Prompt Learning for RGB-D Semantic Segmentation Under Missing and Degraded Depth · Score 87
title matched "segmentation"；summary matched "multimodal"；has PDF
原始来源
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model · Score 86
title matched "multimodal"；summary matched "diffusion"；has PDF
原始来源
GazeVaLM: A Multi-Observer Eye-Tracking Benchmark for Evaluating Clinical Realism in AI-Generated X-Rays · Score 78
summary matched "diffusion"；summary matched "multimodal"；has DOI
原始来源

PubMed AI5 篇

《Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: Quantitative Pilot Feasibility Study.》〔评测 / 应用 / 方法〕：BACKGROUND: Translation of medical consulta…

Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: Quantitative Pilot Feasibility Study. · Score 107
title matched "language model"；summary matched "benchmark"；summary matched "clinical"
原始来源
Toward Sustainable Clinical Analysis: Benchmarking Plastic Use in LC-MS Sample Preparation - Exemplified by Ketamine Analogues in Whole Blood. · Score 107
title matched "benchmark"；title matched "clinical"；has DOI
原始来源
Text4Seg++: Advancing Image Segmentation via Generative Language Modeling. · Score 89
title matched "language model"；summary matched "benchmark"；has DOI
原始来源
Diversity in clinical Trials: The example of systemic lupus erythematosus. · Score 82
title matched "clinical"；summary matched "benchmark"；has DOI
原始来源
Comparative Performance of Gemini 3 Pro and GPT-5 Family Models on Ophthalmology Board-Style Questions. · Score 78
summary matched "language model"；summary matched "benchmark"；summary matched "clinical"
原始来源

OpenAlex AI1 篇

《ECO-Charge: Multi-Agent Smart-Charging for Electric Vehicles》〔方法〕：International audience

ECO-Charge: Multi-Agent Smart-Charging for Electric Vehicles · Score 64
title matched "agent"；has complete metadata
原始来源

2026-04-13

命中 0 篇生成于 2026-04-13 16:13:43 (Asia/Shanghai)

Markdown JSON

LLM0 篇

LLM 今日没有新的命中文献。

Vision0 篇

Vision 今日没有新的命中文献。

PubMed AI0 篇

PubMed AI 今日没有新的命中文献。

OpenAlex AI0 篇

OpenAlex AI 今日没有新的命中文献。

2026-04-12

命中 1 篇生成于 2026-04-12 22:15:33 (Asia/Shanghai)

Markdown JSON

LLM0 篇

LLM 今日没有新的命中文献。

Vision0 篇

Vision 今日没有新的命中文献。

PubMed AI1 篇

《Combining structural modeling and deep learning to calculate the E. coli protein interactome and functional networks.》〔数据 / 方法〕：We report on the integration of three methods that predict, on a proteome-wide scale, whet…

Combining structural modeling and deep learning to calculate the E. coli protein interactome and functional networks. · Score 48
summary matched "language model"；has DOI；has rich summary
原始来源

OpenAlex AI0 篇

OpenAlex AI 今日没有新的命中文献。

2026-04-11

命中 9 篇生成于 2026-04-11 23:09:08 (Asia/Shanghai)

Markdown JSON

LLM0 篇

LLM 今日没有新的命中文献。

Vision0 篇

Vision 今日没有新的命中文献。

PubMed AI4 篇

《Factors influencing large language model adoption among dental students: a cross-sectional study.》〔应用 / 方法〕：This research evaluates the factors influencing the behavioural intention (BI) to adopt large language models…

Evaluating the clinical decision-making performance of large language models in clinically oriented thoracic anatomy scenarios: a comparative evaluation study. · Score 104
title matched "language model"；title matched "clinical"；has DOI
原始来源
Exploratory study of large language models in surgical decision-making for lumbar disc herniation: a multicenter analysis based on multisource clinical information. · Score 104
title matched "language model"；title matched "clinical"；has DOI
原始来源
A hybrid large language model framework for structured data entry from code-switched persian clinical speech. · Score 104
title matched "language model"；title matched "clinical"；has DOI
原始来源
Factors influencing large language model adoption among dental students: a cross-sectional study. · Score 88
title matched "language model"；summary matched "clinical"；has DOI
原始来源

OpenAlex AI5 篇

《Coalition Drift: When Agents Drift Together Why multi-agent systems don't just drift individually — they drift as a group, and why that matters more than any single-agent failure mode.》〔方法〕：Most AI governance framework…

Coalition Drift: When Agents Drift Together Why multi-agent systems don't just drift individually — they drift as a group, and why that matters more than any single-agent failure mode. · Score 70
title matched "agent"；has DOI；has rich summary
原始来源
Coalition Drift: When Agents Drift Together Why multi-agent systems don't just drift individually — they drift as a group, and why that matters more than any single-agent failure mode. · Score 70
title matched "agent"；has DOI；has rich summary
原始来源
Coalition Formation Events: How Multi-Agent Systems Create Temporary Actors · Score 70
title matched "agent"；has DOI；has rich summary
原始来源
Coalition Formation Events: How Multi-Agent Systems Create Temporary Actors · Score 70
title matched "agent"；has DOI；has rich summary
原始来源
U-P Duality in Multi-Agent Systems: A Seven-Space Algorithm for Complex Nonlinear AI (Corrected Version) · Score 70
title matched "agent"；has DOI；has rich summary
原始来源

2026-04-10

命中 0 篇生成于 2026-04-10 18:14:08 (Asia/Shanghai)

Markdown JSON

LLM0 篇

LLM 今日没有新的命中文献。

Vision0 篇

Vision 今日没有新的命中文献。

PubMed AI0 篇

PubMed AI 今日没有新的命中文献。

OpenAlex AI0 篇

OpenAlex AI 今日没有新的命中文献。

2026-04-09

命中 5 篇生成于 2026-04-09 14:51:56 (Asia/Shanghai)

Markdown JSON

LLM0 篇

LLM 今日没有新的命中文献。

Vision0 篇

Vision 今日没有新的命中文献。

PubMed AI5 篇

《Subcategory vs category fluency: Items and networks in healthy young adults and simulation with a large language model.》〔评测 / 应用 / 方法〕：Category fluency tasks involve producing words constrained by a semantic field (ani…

2026-04-08

命中 25 篇生成于 2026-04-08 17:10:24 (Asia/Shanghai)

Markdown JSON

LLM15 篇

收录 15 篇，重点包括《Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework》、《Topological Characterization of Churn Flow and Unsupervised Correction to the Wu Flow-Regime Map in Small-Diameter Vertic…

Vision10 篇

收录 10 篇，重点包括《Action Images: End-to-End Policy Learning via Multiview Video Generation》、《DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models》。