Paper Digest Archive

研究日报归档页

汇总每天生成的 digest.json 和 digest.md,支持按 feed 过滤、按标题关键词搜索,并提供固定 feed 页面、关键词长期追踪页、持续升温视图,阅读清单、周度回顾,以及最近 7 天 / 30 天趋势页。

订阅入口

把站点从“按天翻”升级成“按主题长期追”。固定页更适合每天回看同一类研究信号。

最近 7 天

91

篇论文

7 个 digest

最近 30 天

378

篇论文

30 个 digest

全部归档

1016

篇论文

82 个 digest

2026-06-26

命中 26 篇生成于 2026-06-26 13:16:53 (Asia/Shanghai)
LM15 篇

《NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models》〔评测 / 数据 / 方法〕:Large language models (LLMs) have demonstrated strong performance across a wide range of tasks, but e…

  1. NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models · Score 218
    title matched "language model";title matched "large language model";title matched "benchmark"
  2. Joint Learning of Experiential Rules and Policies for Large Language Model Agents · Score 165
    title matched "language model";title matched "large language model";title matched "agent"
  3. The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans · Score 165
    title matched "language model";title matched "large language model";title matched "reasoning"
  4. Prompt Injection in Automated Résumé Screening with Large Language Models: Single and Multi-Injection Settings · Score 163
    title matched "language model";title matched "large language model";summary matched "LLM"
  5. Semantic Early-Stopping for Iterative LLM Agent Loops · Score 160
    title matched "LLM";title matched "agent";summary matched "language model"

《Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries》〔评测 / 应用 / 方法〕:With a profusion of jailbreaks for LLMs now widely known, a growing concern is that…

  1. Jailbreaking for the Average Jane: Choosing Optimal Jailbreaks via Bandit Algorithms for Automatically Enhanced Queries · Score 64
    title matched "jailbreak";has PDF;has rich summary
  2. Do Safety Guardrails Need to Reason? LeanGuard: A Fast and Light Approach for Robust Moderation · Score 59
    title matched "guardrail";has PDF;has rich summary
  3. AgentX: Towards Agent-Driven Self-Iteration of Industrial Recommender Systems · Score 41
    summary matched "guardrail";has PDF;has rich summary
  4. MIRROR: Novelty-Constrained Memory-Guided MCTS Red-Teaming for Agentic RAG · Score 40
    summary matched "prompt injection";has PDF;has rich summary

《Smaller Models, Unexpected Costs: Trade-offs in LLM Quantization for Automated Program Repair》〔评测 / 方法〕:Language Models (LLMs) are powerful toolsand have been increasingly adopted for complex software engineering tasks…

  1. Smaller Models, Unexpected Costs: Trade-offs in LLM Quantization for Automated Program Repair · Score 108
    title matched "program repair";title matched "automated program repair";has PDF
  2. To Run or Not to Run: Analyzing the Cost-Effectiveness of Code Execution in LLM-Based Program Repair · Score 83
    title matched "program repair";summary matched "SWE-bench";has PDF
  3. How Much Static Structure Do Code Agents Need? A Study of Deterministic Anchoring · Score 65
    title matched "code agent";has PDF;has rich summary
  4. A Deterministic Control Plane for LLM Coding Agents · Score 64
    title matched "coding agent";has PDF;has rich summary
  5. NOVA: A Verification-Aware Agent Harness for Architecture Evolution in Industrial Recommender Systems · Score 47
    summary matched "coding agent";has PDF;has rich summary

2026-06-25

命中 20 篇生成于 2026-06-25 13:11:21 (Asia/Shanghai)
LM15 篇

《InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy》〔评测 / 应用 / 方法〕:Large language models are increasingly deployed as investment res…

  1. InvestPhilBench: A Multi-Layer Dynamic Benchmark for Evaluating Large Language Model Procedural Reasoning in Expert Investment Philosophy · Score 188
    title matched "language model";title matched "large language model";title matched "reasoning"
  2. Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models · Score 182
    title matched "language model";title matched "large language model";summary matched "LLM"
  3. How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations · Score 182
    title matched "language model";title matched "reasoning";summary matched "LLM"
  4. MedGuards: Multi-Agent System for Reliable Medical Error Detection and Correction · Score 170
    title matched "agent";summary matched "language model";summary matched "large language model"
  5. Probabilistic Agents in Deterministic Audits: Evaluating Multi-Agent Systems for Automated Audits Based on the German IT-Grundschutz · Score 164
    title matched "agent";summary matched "LLM";summary matched "reasoning"

《How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring》〔方法〕:Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number…

  1. How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring · Score 78
    title matched "jailbreak";summary matched "prompt injection";has PDF
  2. The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems · Score 48
    summary matched "guardrail";has PDF;has rich summary
  3. AI Snitches Get Glitches: Towards Evading Agentic Surveillance · Score 44
    summary matched "prompt injection";has PDF;has rich summary

《Unlocking Model Potentials Through Adaptive Multi-Agent Scaffolding for Efficient Issue Resolution》〔评测 / 应用 / 方法〕:Resolving issues with ambiguous and incomplete descriptions, particularly concerning complex bugs, requi…

  1. Unlocking Model Potentials Through Adaptive Multi-Agent Scaffolding for Efficient Issue Resolution · Score 78
    title matched "issue resolution";summary matched "SWE-bench";has PDF
  2. Evaluating LLMs on Real-World Software Performance Optimization · Score 38
    summary matched "repository-level";has PDF;has rich summary

2026-06-24

命中 26 篇生成于 2026-06-24 13:06:49 (Asia/Shanghai)
LM15 篇

《AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning》〔评测 / 方法〕:Large language models are increasingly deployed as agents that reason over documents rather than answer from parametric knowledge.…

  1. AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning · Score 199
    title matched "reasoning";title matched "agent";title matched "benchmark"
  2. AI-PAVE-Br: Leveraging Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach · Score 195
    title matched "language model";title matched "large language model";title matched "RAG"
  3. A specialized reasoning large language model for accelerating rare disease diagnosis: a randomized AI physician assistance trial · Score 181
    title matched "language model";title matched "large language model";title matched "reasoning"
  4. EG-VQA: Benchmarking Verifiable Video Question Answering with Grounded Temporal Evidence · Score 177
    title matched "benchmark";summary matched "language model";summary matched "large language model"
  5. Are We Ready For An Agent-Native Memory System? · Score 177
    title matched "agent";summary matched "language model";summary matched "large language model"

《Burnyard: Future of Malware Analysis》〔方法〕:Malware analysis is a critical aspect of modern cybersecurity. The prevailing industry practice, sandboxing, involves executing suspicious binaries within isol…;《LLMs Prompted…

  1. Burnyard: Future of Malware Analysis · Score 47
    summary matched "sandboxing";has PDF;has rich summary
  2. LLMs Prompted for Legal Context Object More: Overrefusal from Small On-Premises LLMs in Criminal Legal Context · Score 44
    summary matched "jailbreak";has PDF;has rich summary
  3. Red-Teaming the Agentic Red-Team · Score 43
    summary matched "guardrail";has PDF;has rich summary
  4. PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models · Score 41
    summary matched "guardrail";has PDF;has rich summary
  5. Securing LLM-Agent Long-Term Memory Against Poisoning: Non-Malleable, Origin-Bound Authority with Machine-Checked Guarantees · Score 39
    summary matched "data exfiltration";has PDF;has rich summary

《SHERLOC: Structured Diagnostic Localization for Code Repair Agents》〔方法〕:LLM agents solve repository-level coding tasks through multi-turn tool use, but utilize half their budget on locating faults before editing. Dedic…

  1. SHERLOC: Structured Diagnostic Localization for Code Repair Agents · Score 105
    title matched "code repair";summary matched "SWE-bench";summary matched "repository-level"
  2. NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers? · Score 65
    title matched "coding agent";has PDF;has rich summary
  3. Bayesian control for coding agents · Score 64
    title matched "coding agent";has PDF;has rich summary
  4. Detecting AI Coding Agents in Open Source: A Validated Multi-Method Census of 180 Million Repositories · Score 63
    title matched "coding agent";has PDF;has rich summary
  5. LemonHarness Technical Report · Score 39
    summary matched "Terminal-Bench";has PDF;has rich summary

2026-06-23

命中 19 篇生成于 2026-06-23 13:10:02 (Asia/Shanghai)
LM15 篇

《AIR: Adaptive Interleaved Reasoning with Code in MLLMs》〔评测 / 数据 / 应用 / 方法〕:Following the paradigm shift initiated by OpenAI o3, interleaved reasoning with code to enhance multimodal large language models (MLLMs) has be…

  1. AIR: Adaptive Interleaved Reasoning with Code in MLLMs · Score 200
    title matched "LLM";title matched "reasoning";summary matched "language model"
  2. TriggerBench: Investigating Prospective Memory for Large Language Models · Score 197
    title matched "language model";title matched "large language model";summary matched "LLM"
  3. Can LLMs Reliably Self-Report Adversarial Prefills, and How? · Score 160
    title matched "LLM";summary matched "language model";summary matched "large language model"
  4. Evaluation Awareness Is Not One Capability: Evidence from Open Language Models · Score 145
    title matched "language model";title matched "evaluation";summary matched "instruction tuning"
  5. POTracker: Optimizing Large Language Models for Standard-Compliant Power Outage Report Generation · Score 145
    title matched "language model";title matched "large language model";summary matched "LLM"

《Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity?》〔评测 / 应用 / 方法〕:Computer-use agents (CUAs) now act on a user's behalf across personal applications such as email, calendars, and to-do lists. Thi…

  1. Capable but Careless: Do Computer-Use Agents Follow Contextual Integrity? · Score 64
    title matched "computer-use agent";has PDF;has rich summary
  2. TROPT: An Open Framework for Unifying and Advancing Discrete Text Optimization · Score 46
    summary matched "jailbreak";has PDF;has rich summary
  3. GIF: Locally Sound Geometric Information Flow Control for LLMs · Score 43
    summary matched "prompt injection";has PDF;has rich summary

《Tmax: A simple recipe for terminal agents》〔评测 / 数据 / 应用 / 方法〕:Terminal-using agents have quickly become the most popular downstream application of language models (LMs). Despite their prevalence, relatively little acad…

  1. Tmax: A simple recipe for terminal agents · Score 84
    title matched "terminal agent";summary matched "Terminal-Bench";has PDF

2026-06-19

命中 22 篇生成于 2026-06-19 14:26:15 (Asia/Shanghai)
LM15 篇

《QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation》〔评测 / 方法〕:Large Language Models (LLMs) have made significant progress in reasoning, particularly in ded…

  1. QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation · Score 221
    title matched "language model";title matched "large language model";title matched "reasoning"
  2. LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems · Score 201
    title matched "LLM";title matched "agent";title matched "benchmark"
  3. Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM Inference · Score 191
    title matched "LLM";summary matched "language model";summary matched "large language model"
  4. Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems · Score 162
    title matched "LLM";title matched "agent";summary matched "language model"
  5. Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users · Score 162
    title matched "LLM";title matched "alignment";summary matched "language model"

《What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?》〔方法〕:Prior work has shown that in-context demonstrations can jailbreak language models, but it remains unclear how models interpret different type…

  1. What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations? · Score 46
    summary matched "jailbreak";has PDF;has rich summary
  2. Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems · Score 46
    summary matched "jailbreak";has PDF;has rich summary
  3. RACL: Reasoning-Agent Control Layers for Continuous Metaheuristic Learning · Score 41
    summary matched "guardrail";has PDF;has rich summary
  4. Beyond Static Endpoints: Tool Programs as an Interface for Flexible Agentic Web Services · Score 39
    summary matched "sandboxing";has PDF;has rich summary

《Probe-and-Refine Tuning of Repository Guidance for Coding Agents》〔应用 / 方法〕:LLM-based coding agents need higher-level operational knowledge about a repository (which files house which subsystems, how to run the test sui…

  1. Probe-and-Refine Tuning of Repository Guidance for Coding Agents · Score 87
    title matched "coding agent";summary matched "SWE-bench";has PDF
  2. Phoenix: Safe GitHub Issue Resolution via Multi-Agent LLMs · Score 83
    title matched "issue resolution";summary matched "SWE-bench";has PDF
  3. N-Version Programming with Coding Agents · Score 63
    title matched "coding agent";has PDF;has rich summary

2026-06-18

命中 17 篇生成于 2026-06-18 14:03:08 (Asia/Shanghai)
LM15 篇

《Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play》〔评测 / 方法〕:Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with exec…

  1. Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play · Score 185
    title matched "language model";title matched "large language model";title matched "agent"
  2. IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages · Score 182
    title matched "language model";title matched "large language model";title matched "benchmark"
  3. A Technical Taxonomy of LLM Agent Communication Protocols · Score 160
    title matched "LLM";title matched "agent";summary matched "language model"
  4. Quantifying and Auditing LLM Evaluation via Positive--Unlabeled Learning · Score 159
    title matched "LLM";title matched "evaluation";summary matched "language model"
  5. Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA · Score 158
    title matched "LLM";summary matched "language model";summary matched "large language model"

《CodeSentinel: A Three-Layer Defense Against Indirect Prompt Injection in Code Contexts》〔方法〕:Code large language models increasingly retrieve external code context from repositories, documentation, issue threads, and co…

  1. CodeSentinel: A Three-Layer Defense Against Indirect Prompt Injection in Code Contexts · Score 108
    title matched "prompt injection";title matched "indirect prompt injection";has PDF

《Data Intelligence Agents: Interpreting, Modeling, and Querying Enterprise Data via Autonomous Coding Agents》〔评测 / 应用 / 方法〕:Production data integration is bottlenecked by repeated, lossy handoffs between data owners, en…

  1. Data Intelligence Agents: Interpreting, Modeling, and Querying Enterprise Data via Autonomous Coding Agents · Score 69
    title matched "coding agent";has PDF;has rich summary

2026-06-17

命中 23 篇生成于 2026-06-17 14:22:19 (Asia/Shanghai)
LM15 篇

《Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports》〔评测 / 数据 / 应用 / 方法〕:Classifying Cyber Threat Intelligence (CTI) using MITRE Adversarial Tactics, Techniques, and Common Knowled…

  1. Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports · Score 176
    title matched "LLM";summary matched "language model";summary matched "large language model"
  2. Reading between the Lines: Leveraging Large Language Models for Global Dementia and Depression Assessment from Clinical Interviews · Score 164
    title matched "language model";title matched "large language model";title matched "RAG"
  3. The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act · Score 162
    title matched "reasoning";title matched "benchmark";summary matched "language model"
  4. WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning · Score 162
    title matched "reasoning";title matched "agent";summary matched "language model"
  5. From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning · Score 161
    title matched "language model";title matched "reasoning";summary matched "large language model"

《Seeing Is Not Screening: Multimodal Hidden Instruction Attacks on Agent Skill Scanners》〔应用 / 方法〕:Agent skills are emerging as an important attack surface in LLM-based systems. Through an empirical study of existing ski…

  1. Seeing Is Not Screening: Multimodal Hidden Instruction Attacks on Agent Skill Scanners · Score 47
    summary matched "privilege escalation";has PDF;has rich summary
  2. A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models · Score 47
    summary matched "jailbreak";has PDF;has rich summary
  3. PreAct: Computer-Using Agents that Get Faster on Repeated Tasks · Score 43
    summary matched "guardrail";has PDF;has rich summary

《All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code》〔方法〕:Software practitioners increasingly use AI coding agents that generate test code alongside production code in open source pull requests (PRs). Recent…

  1. All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code · Score 46
    summary matched "coding agent";has PDF;has rich summary
  2. LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling · Score 44
    summary matched "SWE-bench";has PDF;has rich summary
  3. VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination · Score 44
    summary matched "code generation benchmark";has PDF;has rich summary
  4. GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine? · Score 42
    summary matched "coding agent";has PDF;has rich summary
  5. Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering · Score 40
    summary matched "coding agent";has PDF;has rich summary

2026-06-16

命中 23 篇生成于 2026-06-16 14:38:43 (Asia/Shanghai)
LM15 篇

《OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models》〔评测 / 应用 / 方法〕:Equipping Large Language Model (LLM) agents with effective skills is crucial for solving complex tasks in real-world systems…

  1. OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models · Score 235
    title matched "language model";title matched "large language model";title matched "agent"
  2. Context-Aware RL for Agentic and Multimodal LLMs · Score 199
    title matched "LLM";title matched "agent";summary matched "language model"
  3. Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio · Score 185
    title matched "LLM";title matched "agent";title matched "benchmark"
  4. Consensus-based Agentic Large Language Model Framework for Harmonized Tariff Schedule Code Classification · Score 184
    title matched "language model";title matched "large language model";title matched "agent"
  5. Scalable Circuit Learning for Interpreting Large Language Models · Score 162
    title matched "language model";title matched "large language model";summary matched "LLM"

《Automated jailbreak attack targeting multiple defense strategies》〔评测 / 方法〕:Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. However, their safety remains a critical c…

  1. Automated jailbreak attack targeting multiple defense strategies · Score 65
    title matched "jailbreak";has PDF;has rich summary
  2. MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents · Score 65
    title matched "computer-use agent";has PDF;has rich summary
  3. DoubtProbe: Black-Box Jailbreak Defense via Structural Verification and Semantic Auditing · Score 61
    title matched "jailbreak";has PDF;has rich summary
  4. KVEraser: Learning to Steer KV Cache for Efficient Localized Context Erasing · Score 47
    summary matched "prompt injection";has PDF;has rich summary
  5. Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models · Score 44
    summary matched "jailbreak";has PDF;has rich summary

《Agent trajectories as programs: fingerprinting and programming coding-agent behavior》〔评测 / 数据 / 应用 / 方法〕:Benchmark scores tell you what an agent got right; they do not tell you how it got there. In this work, we introd…

  1. Agent trajectories as programs: fingerprinting and programming coding-agent behavior · Score 64
    summary matched "SWE-bench";summary matched "coding agent";has PDF
  2. Towards LLM Accelerated Rapid Reviews for Software Tool Discovery -- Case for Log Anomaly Detection · Score 44
    summary matched "coding agent";has PDF;has rich summary
  3. No Resource, No Benchmarks, No Problem? Evaluating and Improving LLMs for Code Generation in No-Resource Languages · Score 44
    summary matched "code generation benchmark";has PDF;has rich summary

2026-06-12

命中 22 篇生成于 2026-06-12 13:55:02 (Asia/Shanghai)
LM15 篇

《EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments》〔评测 / 应用 / 方法〕:Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations as…

  1. EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments · Score 200
    title matched "LLM";title matched "agent";summary matched "language model"
  2. Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents · Score 178
    title matched "agent";title matched "benchmark";summary matched "language model"
  3. An LLM System for Autonomous Variational Quantum Circuit Design · Score 174
    title matched "LLM";summary matched "language model";summary matched "large language model"
  4. Mod-Guide: An LLM-based Content Moderation Feedback System to Address Insensitive Speech toward Indigenous Ethnic and Religious Minority Communities · Score 168
    title matched "LLM";summary matched "language model";summary matched "large language model"
  5. SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning · Score 164
    title matched "reasoning";title matched "agent";summary matched "language model"

《Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda》〔应用 / 方法〕:LLM-based agents are entering regulated industries where they automate judgment intensive quality management processes. W…

  1. Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda · Score 44
    summary matched "guardrail";has PDF;has rich summary
  2. ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm · Score 41
    summary matched "computer-use agent";has PDF;has rich summary
  3. Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents · Score 40
    summary matched "agent runtime";has PDF;has rich summary
  4. No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions · Score 38
    summary matched "prompt injection";has PDF;has rich summary
  5. Nous: An Attempt to Extract and Inject the Cognition Behind Prediction-Market Behavior · Score 38
    summary matched "prompt injection";has PDF;has rich summary

《Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset》〔数据 / 方法〕:AI coding agents are increasingly used to generate pull requests (PRs) that propose code fixes in sof…

  1. Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset · Score 57
    summary matched "coding agent";has DOI;has PDF
  2. Recursive Agent Harnesses · Score 47
    summary matched "coding agent";has PDF;has rich summary

2026-06-11

命中 22 篇生成于 2026-06-11 13:59:12 (Asia/Shanghai)
LM15 篇

《Measuring Epistemic Resilience of LLMs Under Misleading Medical Context》〔评测 / 应用 / 方法〕:Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores…

  1. Measuring Epistemic Resilience of LLMs Under Misleading Medical Context · Score 194
    title matched "LLM";summary matched "language model";summary matched "large language model"
  2. Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation · Score 182
    title matched "LLM";title matched "benchmark";title matched "evaluation"
  3. OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models · Score 178
    title matched "language model";title matched "reasoning";summary matched "alignment"
  4. Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization · Score 159
    title matched "reasoning";summary matched "language model";summary matched "large language model"
  5. ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing · Score 159
    title matched "alignment";summary matched "language model";summary matched "large language model"

《Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code》〔评测 / 方法〕:Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce mali…

  1. Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code · Score 60
    title matched "jailbreak";has PDF;has rich summary
  2. OCELOT: Inference-Leakage Budgets for Privacy-Preserving LLM Agents · Score 47
    summary matched "jailbreak";has PDF;has rich summary
  3. Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers · Score 41
    summary matched "jailbreak";has PDF;has rich summary
  4. External Experience Serving in Production LLM Systems: A Deployment-Oriented Study of Quality-Cost Trade-offs · Score 38
    summary matched "prompt injection";has PDF;has rich summary

《PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents》〔应用 / 方法〕:AI coding assistants now support a growing share of software work, from quick scripts to production applications. Yet th…

  1. PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents · Score 69
    title matched "coding agent";has PDF;has rich summary
  2. Exploration Structure in LLM Agents for Multi-File Change Localization · Score 59
    summary matched "SWE-bench";summary matched "SWE bench";has PDF
  3. Agents All the Way Down; A Methodology for Building Custom AI Agents from Substrate to Production · Score 39
    summary matched "code agent";has PDF;has rich summary

2026-06-10

命中 25 篇生成于 2026-06-10 13:25:04 (Asia/Shanghai)
LM15 篇

《T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains》〔评测 / 应用 / 方法〕:Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic sys…

  1. T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains · Score 217
    title matched "agent";title matched "benchmark";summary matched "language model"
  2. Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution · Score 215
    title matched "LLM";title matched "agent";summary matched "language model"
  3. ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity · Score 200
    title matched "agent";title matched "benchmark";summary matched "language model"
  4. Structure from Reasoning, Numbers from Search: On-Premise Open LLMs as Structural Priors for Coupled MIMO Controller Tuning · Score 180
    title matched "LLM";title matched "reasoning";summary matched "language model"
  5. Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam? · Score 175
    title matched "LLM";summary matched "language model";summary matched "large language model"

《Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation》〔评测 / 应用 / 方法〕:Large language model (LLM) agents are rapidly moving from conversational interfaces to software components that plan, invoke t…

  1. Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation · Score 78
    summary matched "agent security";summary matched "LLM agent security";summary matched "prompt injection"
  2. Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields · Score 68
    title matched "computer-use agent";has PDF;has rich summary
  3. Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories · Score 48
    summary matched "computer-use agent";has PDF;has rich summary
  4. It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO · Score 45
    summary matched "guardrail";has PDF;has rich summary
  5. Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization · Score 44
    summary matched "prompt injection";has PDF;has rich summary

《Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages》〔评测 / 方法〕:LLM-based coding agents are usually evaluated in familiar software settings: mainstream languages, common libraries, and…

  1. Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages · Score 103
    title matched "coding agent";summary matched "Terminal-Bench";summary matched "SWE-bench"
  2. AutoPDE: Reliable Agentic PDE Solving via Explicitly Represented Solver Strategies · Score 60
    summary matched "coding agent";summary matched "code agent";has PDF
  3. DeNovoSWE: Scaling Long-Horizon Environments for Generating Entire Repositories from Scratch · Score 60
    summary matched "code agent";summary matched "bug fixing";has PDF

2026-06-09

命中 22 篇生成于 2026-06-09 13:12:49 (Asia/Shanghai)
LM15 篇

《SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks》〔评测 / 方法〕:Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and op…

  1. SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks · Score 238
    title matched "reasoning";title matched "agent";title matched "benchmark"
  2. Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving · Score 180
    title matched "LLM";title matched "benchmark";summary matched "language model"
  3. TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs · Score 179
    title matched "LLM";title matched "benchmark";summary matched "language model"
  4. Gradient-Guided Reward Optimization for Inference-time Alignment · Score 176
    title matched "alignment";summary matched "language model";summary matched "large language model"
  5. IS-CoT: Breaking the Long-form Generation Collapse via Interleaved Structural Thinking · Score 173
    summary matched "language model";summary matched "large language model";summary matched "LLM"

《WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces》〔评测 / 方法〕:Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line ex…

  1. WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces · Score 83
    title matched "computer-use agent";summary matched "agent runtime";has PDF
  2. Brain-Prompt Injection: A Route-Safety Audit for BCI-LLM Agents · Score 63
    title matched "prompt injection";has PDF;has rich summary
  3. What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks · Score 47
    summary matched "guardrail";has PDF;has rich summary
  4. PRISM: Recovering Instruction Sets from Language Model Activations · Score 45
    summary matched "prompt injection";has PDF;has rich summary

《SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation》〔方法〕:Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them ca…

  1. SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation · Score 48
    summary matched "coding agent";has PDF;has rich summary
  2. From 0-to-1 to 1-to-N: Reproducible Engineering Evidence for MetaAI Recursive Self-Design · Score 46
    summary matched "SWE-bench";has PDF;has rich summary
  3. Self-Harness: Harnesses That Improve Themselves · Score 44
    summary matched "Terminal-Bench";has PDF;has rich summary

2026-06-05

命中 30 篇生成于 2026-06-05 13:25:00 (Asia/Shanghai)
LM15 篇

《MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models》〔评测 / 方法〕:Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that…

  1. MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models · Score 214
    title matched "language model";title matched "large language model";title matched "benchmark"
  2. CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments · Score 210
    title matched "LLM";title matched "agent";summary matched "language model"
  3. AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints · Score 196
    title matched "language model";title matched "large language model";title matched "agent"
  4. The Tell-Tale Norm: $\ell_2$ Magnitude as a Signal for Reasoning Dynamics in Large Language Models · Score 196
    title matched "language model";title matched "large language model";title matched "reasoning"
  5. Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems · Score 192
    title matched "LLM";title matched "agent";summary matched "language model"

《GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection》〔评测 / 数据 / 应用 / 方法〕:Large Language Models (LLMs) have transformed natural language processing, but they remai…

  1. GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection · Score 138
    title matched "prompt injection";title matched "jailbreak";summary matched "guardrail"
  2. From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents · Score 80
    title matched "guardrail";has PDF;has rich summary
  3. Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack · Score 76
    summary matched "jailbreak";summary matched "guardrail";has PDF
  4. Beyond Similarity: Trustworthy Memory Search for Personal AI Agents · Score 58
    summary matched "jailbreak";has PDF;has rich summary
  5. The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models · Score 58
    summary matched "guardrail";has PDF;has rich summary

《ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer》〔评测 / 方法〕:The rapid proliferation of Agent Development Kits (ADKs), SDK-level frameworks for building LLM-powered autonomous agents, has outpaced any…

  1. ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer · Score 94
    summary matched "Terminal-Bench";summary matched "SWE-bench";summary matched "coding agent"
  2. Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement · Score 80
    title matched "code agent";has PDF;has rich summary
  3. Knowledge Matters: Injecting Project and Testing Knowledge into LLM-based Unit Test Generation · Score 80
    title matched "test generation";has PDF;has rich summary
  4. SmellBench: Towards Fine-Grained Evaluation of Code Agents on Refactoring Tasks · Score 80
    title matched "code agent";has PDF;has rich summary
  5. From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws · Score 76
    summary matched "Terminal-Bench";summary matched "SWE-bench";has PDF

2026-06-04

命中 27 篇生成于 2026-06-04 14:02:06 (Asia/Shanghai)
LM15 篇

《A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs》〔评测 / 应用 / 方法〕:Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under mult…

  1. A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs · Score 191
    title matched "LLM";title matched "evaluation";summary matched "language model"
  2. Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases · Score 191
    title matched "language model";title matched "large language model";summary matched "LLM"
  3. Self-Evolving Deep Research via Joint Generation and Evaluation · Score 187
    title matched "evaluation";summary matched "language model";summary matched "large language model"
  4. Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents · Score 177
    title matched "LLM";title matched "agent";title matched "benchmark"
  5. Large Language Models in K-12 Education: Alignment with State Curriculum Standards and Student Personas · Score 177
    title matched "language model";title matched "large language model";title matched "alignment"

《MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models》〔评测 / 应用 / 方法〕:Diffusion large language models (dLLMs) generate text by iteratively denoising partially masked sequences unde…

  1. MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models · Score 79
    title matched "jailbreak";has PDF;has rich summary
  2. What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems · Score 79
    title matched "prompt injection";has PDF;has rich summary
  3. Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents · Score 75
    summary matched "prompt injection";summary matched "indirect prompt injection";has PDF
  4. AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning · Score 57
    summary matched "agent runtime";has PDF;has rich summary
  5. From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents · Score 57
    summary matched "prompt injection";has PDF;has rich summary

《Latent Anchor-Driven Test Generation for Deep Neural Networks》〔数据 / 应用 / 方法〕:Deep Neural Networks (DNNs) are increasingly being deployed in security-critical and safety-sensitive applications, which makes rigorous test…

  1. Latent Anchor-Driven Test Generation for Deep Neural Networks · Score 79
    title matched "test generation";has PDF;has rich summary
  2. Can Generalist Agents Automate Data Curation? · Score 57
    summary matched "coding agent";has PDF;has rich summary
  3. Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation · Score 57
    summary matched "SWE-bench";has PDF;has rich summary
  4. The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development? · Score 57
    summary matched "code agent";has PDF;has rich summary
  5. The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents · Score 57
    summary matched "SWE-bench";has PDF;has rich summary

2026-06-03

命中 32 篇生成于 2026-06-03 14:09:56 (Asia/Shanghai)
LM15 篇

《Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning》〔评测 / 方法〕:Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficie…

  1. Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning · Score 213
    title matched "LLM";title matched "reasoning";title matched "agent"
  2. Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models · Score 213
    title matched "language model";title matched "large language model";title matched "alignment"
  3. Can Factual Opinions Be Edited (Manipulated) in Large Language Models? · Score 191
    title matched "language model";title matched "large language model";summary matched "LLM"
  4. Large Language Models Are Overconfident in Their Own Responses · Score 191
    title matched "language model";title matched "large language model";summary matched "LLM"
  5. Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models · Score 177
    title matched "language model";title matched "large language model";title matched "alignment"

《D-Judge: Disrupting Multi-Turn Jailbreaks using Semantics-Preserving Output Rewriting》〔评测 / 数据 / 方法〕:Multi-turn jailbreak attacks pose a growing threat to large language model (LLM) safety because they exploit feedback…

  1. D-Judge: Disrupting Multi-Turn Jailbreaks using Semantics-Preserving Output Rewriting · Score 79
    title matched "jailbreak";has PDF;has rich summary
  2. MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents · Score 79
    title matched "computer-use agent";has PDF;has rich summary
  3. MultiTurnPSB: Evaluating Multi-Turn Jailbreak Attacks an dClassifier-Based Defenses for Medical AI Safety · Score 79
    title matched "jailbreak";has PDF;has rich summary
  4. From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework · Score 75
    summary matched "prompt injection";summary matched "malicious tool";has PDF
  5. Acceptance-Test-Driven Evaluation Protocols for Business-Centric LLM Systems · Score 57
    summary matched "guardrail";has PDF;has rich summary

《What Makes Interaction Trajectories Effective for Training Terminal Agents?》〔方法〕:Stronger code agents are commonly assumed to be superior teachers for post-training, yet this assumption remains poorly disentangled from…

  1. What Makes Interaction Trajectories Effective for Training Terminal Agents? · Score 115
    title matched "terminal agent";summary matched "Terminal-Bench";summary matched "code agent"
  2. Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing · Score 97
    title matched "code agent";summary matched "coding agent";has PDF
  3. Dependency-Guided Repository-Level C-to-Rust Translation with Reinforcement Alignment · Score 97
    title matched "repository-level";summary matched "repository level";has PDF
  4. Handoff Debt: The Rediscovery Cost When Coding Agents Take Over Interrupted Tasks · Score 79
    title matched "coding agent";has PDF;has rich summary
  5. VulnAgent-R2: Evidence-Calibrated Multi-Agent Auditing for Repository-Level Vulnerability Detection · Score 79
    title matched "repository-level";has PDF;has rich summary

2026-06-02

命中 22 篇生成于 2026-06-02 13:56:35 (Asia/Shanghai)
LM15 篇

《POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems》〔评测 / 应用 / 方法〕:Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emerge…

  1. POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems · Score 192
    title matched "agent";summary matched "language model";summary matched "large language model"
  2. MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation · Score 184
    title matched "LLM";title matched "agent";title matched "benchmark"
  3. Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling · Score 178
    title matched "LLM";summary matched "language model";summary matched "large language model"
  4. AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design · Score 165
    title matched "language model";title matched "reasoning";title matched "agent"
  5. Auditing Asset-Specific Preferences in Financial Large Language Models: Evidence from Bitcoin Representations and Portfolio Allocation · Score 163
    title matched "language model";title matched "large language model";summary matched "LLM"

《Jailbreaking Multimodal Large Language Models using Multi-Clip Video》〔数据 / 应用 / 方法〕:As multimodal large language models (MLLMs) have advanced to process video inputs, concerns have emerged about their potential for mal…

  1. Jailbreaking Multimodal Large Language Models using Multi-Clip Video · Score 63
    title matched "jailbreak";has PDF;has rich summary
  2. SentGuard: Sentence-Level Streaming Guardrails for Large Language Models · Score 62
    title matched "guardrail";has PDF;has rich summary
  3. AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations · Score 61
    summary matched "prompt injection";summary matched "indirect prompt injection";has PDF
  4. SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning · Score 61
    title matched "agent defense";has PDF;has rich summary
  5. SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents · Score 44
    summary matched "agent security";has PDF;has rich summary

《SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction》〔评测 / 应用 / 方法〕:Agent skills occupy a privileged position in the agent workflow, as agents are expected to implicitly follow and execute them, re…

  1. SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction · Score 47
    summary matched "coding agent";has PDF;has rich summary

2026-05-29

命中 21 篇生成于 2026-05-29 13:18:32 (Asia/Shanghai)
LM15 篇

《FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via Board Game Simulations》〔评测 / 方法〕:Recently, large language models (LLMs) have achieved superior performance in static f…

  1. FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via Board Game Simulations · Score 232
    title matched "LLM";title matched "reasoning";title matched "benchmark"
  2. CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models · Score 196
    title matched "language model";title matched "large language model";title matched "reasoning"
  3. Revisiting Anthropomorphic Reflection Markers in Large Language Model Reasoning · Score 196
    title matched "language model";title matched "large language model";title matched "reasoning"
  4. Argument Quality Assessment with Large Language Models: A Pairwise Bradley-Terry Approach · Score 192
    title matched "language model";title matched "large language model";summary matched "LLM"
  5. KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs · Score 192
    title matched "agent";title matched "benchmark";summary matched "language model"

《Provably Secure Agent Guardrail》〔评测 / 应用 / 方法〕:As large language models transition from bounded generative engines to agents with expansive execution privileges, AI going out of control precipitates a funda…;《Robust an…

  1. Provably Secure Agent Guardrail · Score 120
    title matched "secure agent";title matched "guardrail";has PDF
  2. Robust and Efficient Guardrails with Latent Reasoning · Score 80
    title matched "guardrail";has PDF;has rich summary
  3. AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security · Score 58
    summary matched "guardrail";has PDF;has rich summary
  4. Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures · Score 58
    summary matched "jailbreak";has PDF;has rich summary

《Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software》〔应用 / 方法〕:Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist sup…

  1. Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software · Score 48
    summary matched "coding agent";has PDF;has rich summary
  2. Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas · Score 45
    summary matched "coding agent";has PDF;has rich summary

2026-05-28

命中 21 篇生成于 2026-05-28 13:15:52 (Asia/Shanghai)
LM15 篇

《MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems》〔评测 / 方法〕:Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unr…

  1. MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems · Score 199
    title matched "language model";title matched "large language model";summary matched "LLM"
  2. Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability · Score 184
    title matched "LLM";title matched "reasoning";title matched "evaluation"
  3. TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning · Score 181
    title matched "LLM";title matched "reasoning";summary matched "language model"
  4. Mobile-Aptus: Confidence-Driven Proactive and Robust Interaction in MLLM-based Mobile-Using Agents · Score 180
    title matched "LLM";title matched "agent";summary matched "language model"
  5. The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic · Score 177
    title matched "evaluation";summary matched "language model";summary matched "large language model"

《Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents》〔数据 / 方法〕:Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software…

  1. Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents · Score 70
    title matched "computer-use agent";has PDF;has rich summary
  2. Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests · Score 47
    summary matched "jailbreak";has PDF;has rich summary
  3. The Ethics of LLM Sandbox and Persona Dynamics · Score 46
    summary matched "guardrail";has PDF;has rich summary
  4. LACUNA: Safe Agents as Recursive Program Holes · Score 46
    summary matched "prompt injection";has PDF;has rich summary
  5. Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem · Score 45
    summary matched "data exfiltration";has PDF;has rich summary

《Calibrating Conservatism for Scalable Oversight》〔方法〕:Agentic AI systems capable of autonomous planning and extended environmental interaction pose a fundamental control problem: how can humans maintain meaningful…

  1. Calibrating Conservatism for Scalable Oversight · Score 48
    summary matched "SWE-bench";has PDF;has rich summary

2026-05-27

命中 22 篇生成于 2026-05-27 13:23:19 (Asia/Shanghai)
LM15 篇

《Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry》〔评测 / 数据 / 应用 / 方法〕:Key knowledge for steel-industry volatile organic compounds (VOCs) governance is s…

  1. Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry · Score 214
    title matched "LLM";title matched "reasoning";summary matched "language model"
  2. Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation · Score 201
    title matched "reasoning";title matched "agent";title matched "benchmark"
  3. Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments · Score 176
    title matched "agent";summary matched "language model";summary matched "large language model"
  4. VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions · Score 175
    title matched "agent";summary matched "language model";summary matched "large language model"
  5. MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation · Score 164
    title matched "agent";title matched "evaluation";summary matched "language model"

《EviACT: An Evidence-to-Action Framework for Agentic Program Repair》〔评测 / 方法〕:LLM-based agents have moved automated program repair (APR) from fixed-context patch generation to interactive repository-level repair. Howeve…

  1. EviACT: An Evidence-to-Action Framework for Agentic Program Repair · Score 122
    summary matched "guardrail";has PDF;has rich summary
  2. Governed Evolution of Agent Runtimes through Executable Operational Cognition · Score 70
    title matched "agent runtime";has PDF;has rich summary
  3. Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals · Score 65
    title matched "prompt injection";has PDF;has rich summary
  4. BAIT: Boundary-Guided Disclosure Escalation via Self-Conditioned Reasoning · Score 45
    summary matched "jailbreak";has PDF;has rich summary
  5. AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian · Score 43
    summary matched "guardrail";has PDF;has rich summary

Terminal and SWE Agents 今日没有新的命中文献。

2026-05-26

命中 18 篇生成于 2026-05-26 13:09:24 (Asia/Shanghai)
LM15 篇

《Automated Benchmark Auditing for AI Agents and Large Language Models》〔评测 / 数据 / 方法〕:Modern AI benchmarks operate at a complexity that outpaces traditional verification methods. Tasks authored by domain experts often co…

  1. Automated Benchmark Auditing for AI Agents and Large Language Models · Score 244
    title matched "language model";title matched "large language model";title matched "agent"
  2. Causal methods for LLM development and evaluation · Score 211
    title matched "LLM";title matched "evaluation";summary matched "language model"
  3. PolyGnosis 2.0: Enhancing LLM Reasoning via Agentic Harness Engineering for Polymarket and OSINT Insight Extraction · Score 202
    title matched "LLM";title matched "reasoning";title matched "agent"
  4. Can LLMs Time Travel? Enhancing Temporal Consistency in Legal Agentic Search through Reinforcement Learning · Score 197
    title matched "LLM";title matched "agent";summary matched "language model"
  5. When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation · Score 180
    title matched "LLM";title matched "agent";summary matched "language model"

《CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents》〔评测 / 数据 / 应用 / 方法〕:Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use,…

  1. CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents · Score 62
    title matched "computer-use agent";has PDF;has rich summary
  2. How Agentic AI Coding Assistants Become the Attacker's Shell · Score 44
    summary matched "prompt injection";has PDF;has rich summary
  3. AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions · Score 41
    summary matched "computer-use agent";has PDF;has rich summary

Terminal and SWE Agents 今日没有新的命中文献。

2026-05-22

命中 21 篇生成于 2026-05-22 13:08:19 (Asia/Shanghai)
LM15 篇

《Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents》〔评测 / 方法〕:Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses…

  1. Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents · Score 196
    title matched "LLM";title matched "agent";title matched "evaluation"
  2. Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety · Score 192
    title matched "agent";title matched "benchmark";summary matched "language model"
  3. ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning · Score 192
    title matched "reasoning";title matched "benchmark";summary matched "LLM"
  4. LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance · Score 188
    title matched "reasoning";summary matched "language model";summary matched "large language model"
  5. From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment · Score 174
    title matched "LLM";title matched "alignment";summary matched "language model"

《DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback》〔评测 / 应用 / 方法〕:LLM-powered AI agents require high-frequency state exploration (e.g., test-time tree search and reinforcement learn…

  1. DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback · Score 48
    summary matched "agent sandbox";has PDF;has rich summary
  2. HarnessAPI: A Skill-First Framework for Unified Streaming APIs and MCP Tools · Score 47
    summary matched "agent runtime";has PDF;has rich summary
  3. Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents · Score 46
    summary matched "guardrail";has PDF;has rich summary

《"Refactoring Runaway": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution》〔方法〕:Recent advances in coding agents have shown remarkable progress in software issue resolution. In pract…

  1. "Refactoring Runaway": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution · Score 125
    title matched "coding agent";title matched "issue resolution";summary matched "SWE-bench"
  2. TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks · Score 45
    summary matched "Terminal-Bench";has PDF;has rich summary
  3. Why Are Agentic Pull Requests Merged or Rejected? An Empirical Study · Score 45
    summary matched "coding agent";has PDF;has rich summary

2026-05-21

命中 17 篇生成于 2026-05-21 13:14:24 (Asia/Shanghai)
LM15 篇

《Tracing the ongoing emergence of human-like reasoning in Large Language Models》〔方法〕:Humans effortlessly go beyond literal meanings: If you mow the lawn, I will give you fifty dollars, is typically understood as implyin…

  1. Tracing the ongoing emergence of human-like reasoning in Large Language Models · Score 184
    title matched "language model";title matched "large language model";title matched "reasoning"
  2. What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema · Score 167
    title matched "LLM";title matched "agent";title matched "benchmark"
  3. Leveraging LLMs for Grammar Adaptation: A Study on Metamodel-Grammar Co-Evolution · Score 164
    title matched "LLM";title matched "RAG";summary matched "language model"
  4. LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models · Score 163
    title matched "language model";title matched "large language model";summary matched "LLM"
  5. WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata · Score 160
    title matched "benchmark";summary matched "language model";summary matched "large language model"

《Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling》〔应用 / 方法〕:Computer-use agents (CUA) automate tasks specified with natural language such as "order the cheapest item from Taco Bell" by gene…

  1. Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling · Score 48
    summary matched "computer-use agent";has PDF;has rich summary

《SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents》〔评测 / 方法〕:As long-horizon coding agents produce more code than any developer can review, oversight collapses onto a single surface: the automated test s…

  1. SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents · Score 69
    title matched "coding agent";has PDF;has rich summary

2026-05-20

命中 27 篇生成于 2026-05-20 13:10:58 (Asia/Shanghai)
LM15 篇

《MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models》〔评测 / 方法〕:Large language models (LLMs) are increasingly integrated into high-stakes decision-making. Inspired by the theory of \emph{inattention…

  1. MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models · Score 236
    title matched "language model";title matched "large language model";title matched "reasoning"
  2. OpenCompass: A Universal Evaluation Platform for Large Language Models · Score 232
    title matched "language model";title matched "large language model";title matched "evaluation"
  3. Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking · Score 214
    title matched "LLM";title matched "agent";title matched "evaluation"
  4. SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models · Score 214
    title matched "language model";title matched "large language model";title matched "evaluation"
  5. LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening · Score 196
    title matched "LLM";title matched "reasoning";title matched "benchmark"

《Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models》〔评测 / 方法〕:Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex problems by generat…

  1. Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models · Score 80
    title matched "jailbreak";has PDF;has rich summary
  2. OpenComputer: Verifiable Software Worlds for Computer-Use Agents · Score 80
    title matched "computer-use agent";has PDF;has rich summary
  3. Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains · Score 80
    title matched "guardrail";has PDF;has rich summary
  4. A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents · Score 58
    summary matched "agent runtime";has PDF;has rich summary
  5. Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents · Score 58
    summary matched "policy enforcement";has PDF;has rich summary

《Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study》〔评测 / 应用 / 方法〕:As autonomous coding agents see rapid adoption, their evaluation has primarily focused on task completion rates holding the tar…

  1. Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study · Score 80
    title matched "coding agent";has PDF;has rich summary
  2. PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents · Score 58
    summary matched "coding agent";has PDF;has rich summary
  3. RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades · Score 58
    summary matched "coding agent";has PDF;has rich summary
  4. The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next · Score 58
    summary matched "SWE-bench";has PDF;has rich summary
  5. Toward Training Superintelligent Software Agents through Self-Play SWE-RL · Score 58
    summary matched "SWE-bench";has PDF;has rich summary

2026-05-19

命中 22 篇生成于 2026-05-19 13:08:04 (Asia/Shanghai)
LM15 篇

《CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark》〔评测 / 数据 / 方法〕:Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view pe…

  1. CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark · Score 217
    title matched "LLM";title matched "benchmark";summary matched "language model"
  2. SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science · Score 181
    title matched "LLM";title matched "benchmark";summary matched "language model"
  3. MA$^{2}$P: A Meta-Cognitive Autonomous Intelligent Agents Framework for Complex Persuasion · Score 176
    title matched "agent";summary matched "language model";summary matched "large language model"
  4. SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents · Score 168
    title matched "LLM";title matched "agent";title matched "benchmark"
  5. LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems · Score 158
    title matched "agent";summary matched "LLM";summary matched "reasoning"

《An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments》〔方法〕:LLM-based chatbot agents increasingly process user requests by combining natural-language reasoning with external…

  1. An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments · Score 98
    title matched "prompt injection";summary matched "indirect prompt injection";summary matched "jailbreak"
  2. Multilingual jailbreaking of LLMs using low-resource languages · Score 82
    title matched "jailbreak";summary matched "guardrail";has PDF
  3. Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks · Score 68
    summary matched "prompt injection";has PDF;has rich summary
  4. Acoustic Interference: A New Paradigm Weaponizing Acoustic Latent Semantic for Universal Jailbreak against Large Audio Language Models · Score 63
    title matched "jailbreak";has PDF;has rich summary

《Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents》〔应用 / 方法〕:Behavioral studies of LLM-based software engineering agents extract operational rules about which traject…

  1. Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents · Score 83
    title matched "software engineering agent";summary matched "SWE-bench";has PDF
  2. SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution · Score 62
    summary matched "Terminal-Bench";summary matched "SWE-bench";has PDF
  3. Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents · Score 48
    summary matched "coding agent";has PDF;has rich summary

2026-05-18

命中 15 篇生成于 2026-05-18 13:13:17 (Asia/Shanghai)
LM15 篇

《CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency》〔评测 / 应用 / 方法〕:This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate…

  1. CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency · Score 236
    title matched "LLM";title matched "agent";title matched "benchmark"
  2. FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models · Score 232
    title matched "language model";title matched "large language model";title matched "benchmark"
  3. MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models · Score 214
    title matched "language model";title matched "large language model";title matched "benchmark"
  4. Large Language Models Could Be Rote Learners · Score 192
    title matched "language model";title matched "large language model";summary matched "LLM"
  5. Look Before You Leap: Autonomous Exploration for LLM Agents · Score 192
    title matched "LLM";title matched "agent";summary matched "language model"

Agent Runtime Security 今日没有新的命中文献。

Terminal and SWE Agents 今日没有新的命中文献。

2026-05-15

命中 17 篇生成于 2026-05-15 14:57:29 (Asia/Shanghai)
LM11 篇

《Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks》〔评测 / 数据 / 方法〕:We introduce a reusable framework for auditing whether LLM attack benchmarks collectively cover the threat surface: a 4$\times…

  1. Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks · Score 202
    title matched "LLM";title matched "RAG";title matched "benchmark"
  2. SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning · Score 161
    title matched "LLM";title matched "reasoning";summary matched "language model"
  3. Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use · Score 161
    title matched "LLM";title matched "reasoning";summary matched "language model"
  4. APWA: A Distributed Architecture for Parallelizable Agentic Workflows · Score 158
    title matched "agent";summary matched "language model";summary matched "large language model"
  5. MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory · Score 144
    title matched "agent";title matched "evaluation";summary matched "reasoning"

Agent Runtime Security 今日没有新的命中文献。

《CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing》〔评测 / 方法〕:Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct/Thinking che…

  1. CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing · Score 115
    title matched "code agent";summary matched "Terminal-Bench";summary matched "SWE-bench"
  2. Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation · Score 97
    title matched "repository-level";summary matched "coding agent";has PDF
  3. SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades · Score 97
    title matched "coding agent";summary matched "issue resolution";has PDF
  4. Documentation-Guided Agentic Codebase Migration from C to Rust · Score 75
    summary matched "coding agent";summary matched "repository-level";has PDF
  5. Comparing Developer and LLM Biases in Code Evaluation · Score 57
    summary matched "code editing";has PDF;has rich summary

2026-05-14

命中 17 篇生成于 2026-05-14 12:52:54 (Asia/Shanghai)
LM15 篇

《RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation》〔评测 / 数据 / 方法〕:Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicia…

  1. RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation · Score 200
    title matched "LLM";title matched "agent";title matched "benchmark"
  2. MILM: Large Language Models for Multimodal Irregular Time Series with Informative Sampling · Score 163
    title matched "language model";title matched "large language model";summary matched "LLM"
  3. An LLM-Based System for Argument Reconstruction · Score 160
    title matched "LLM";summary matched "language model";summary matched "large language model"
  4. OpenAaaS: An Open Agent-as-a-Service Framework for Distributed Materials-Informatics Research · Score 157
    title matched "agent";summary matched "language model";summary matched "large language model"
  5. (How) Do Large Language Models Understand High-Level Message Sequence Charts? · Score 145
    title matched "language model";title matched "large language model";summary matched "LLM"

《Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents》〔评测 / 应用 / 方法〕:Always-on AI agents (OpenClaw, Hermes Agent) run as a single persistent process under the owner's iden…

  1. Sleeper Channels and Provenance Gates: Persistent Prompt Injection in Always-on Autonomous AI Agents · Score 66
    title matched "prompt injection";has PDF;has rich summary
  2. LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs · Score 63
    title matched "guardrail";has PDF;has rich summary

2026-05-13

命中 17 篇生成于 2026-05-13 12:54:34 (Asia/Shanghai)
LM15 篇

《MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering》〔评测 / 数据 / 应用 / 方法〕:Evaluating large language models (LLMs) in the biomedical domain requi…

  1. MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering · Score 225
    title matched "LLM";title matched "reasoning";title matched "benchmark"
  2. ORCE: Order-Aware Alignment of Verbalized Confidence in Large Language Models · Score 222
    title matched "language model";title matched "large language model";title matched "alignment"
  3. Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models · Score 207
    title matched "language model";title matched "large language model";title matched "reasoning"
  4. Question Difficulty Estimation for Large Language Models via Answer Plausibility Scoring · Score 182
    title matched "language model";title matched "large language model";summary matched "LLM"
  5. Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering · Score 177
    title matched "evaluation";summary matched "language model";summary matched "large language model"

《Metaphor Is Not All Attention Needs》〔应用 / 方法〕:Large language models are increasingly deployed in safety-critical applications, where their ability to resist harmful instructions is essential. Although post…;《A microser…

  1. Metaphor Is Not All Attention Needs · Score 44
    summary matched "jailbreak";has PDF;has rich summary
  2. A microservices-based endpoint monitoring platform with predictive NLP models for real-time security and hate-speech risk alerting · Score 42
    summary matched "data exfiltration";has PDF;has rich summary

2026-05-12

命中 21 篇生成于 2026-05-12 12:42:08 (Asia/Shanghai)
LM15 篇

《WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation》〔评测 / 方法〕:Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) ha…

  1. WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation · Score 205
    title matched "agent";title matched "benchmark";title matched "evaluation"
  2. ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox · Score 185
    title matched "LLM";title matched "agent";title matched "evaluation"
  3. AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents · Score 168
    title matched "LLM";title matched "agent";title matched "benchmark"
  4. LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments · Score 167
    title matched "LLM";title matched "agent";title matched "benchmark"
  5. Threat Modelling using Domain-Adapted Language Models: Empirical Evaluation and Insights · Score 163
    title matched "language model";title matched "evaluation";summary matched "large language model"

《Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization》〔评测 / 方法〕:Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-mod…

  1. Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization · Score 69
    title matched "jailbreak";has PDF;has rich summary
  2. Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs · Score 67
    title matched "guardrail";has PDF;has rich summary
  3. Re-Triggering Safeguards within LLMs for Jailbreak Detection · Score 67
    title matched "jailbreak";has PDF;has rich summary
  4. Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing · Score 67
    title matched "jailbreak";has PDF;has rich summary
  5. RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems · Score 48
    summary matched "prompt injection";has PDF;has rich summary

2026-05-11

命中 0 篇生成于 2026-05-11 13:03:07 (Asia/Shanghai)
LM0 篇

LM 今日没有新的命中文献。

Agent Runtime Security 今日没有新的命中文献。

2026-05-10

命中 0 篇生成于 2026-05-10 12:50:04 (Asia/Shanghai)
LM0 篇

LM 今日没有新的命中文献。

Agent Runtime Security 今日没有新的命中文献。

2026-05-09

命中 0 篇生成于 2026-05-09 12:29:32 (Asia/Shanghai)
LM0 篇

LM 今日没有新的命中文献。

Agent Runtime Security 今日没有新的命中文献。

2026-05-08

命中 13 篇生成于 2026-05-08 14:15:32 (Asia/Shanghai)
LM12 篇

《LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG》〔评测 / 数据 / 应用 / 方法〕:Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question…

  1. LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG · Score 231
    title matched "reasoning";title matched "agent";title matched "RAG"
  2. MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents · Score 213
    title matched "LLM";title matched "agent";title matched "benchmark"
  3. Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity · Score 213
    title matched "LLM";title matched "alignment";title matched "evaluation"
  4. Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using LLM Judges with Closed-Loop Reinforcement Learning Feedback · Score 213
    title matched "LLM";title matched "agent";title matched "evaluation"
  5. BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models · Score 209
    title matched "language model";title matched "large language model";summary matched "LLM"

《Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation》〔评测 / 应用 / 方法〕:Self-hosted computer-use agents (SHCUAs), such as OpenClaw, combine natural-language interaction with direct acce…

  1. Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation · Score 102
    title matched "computer-use agent";summary matched "prompt injection";summary matched "indirect prompt injection"

2026-05-07

命中 15 篇生成于 2026-05-07 12:38:06 (Asia/Shanghai)
LM15 篇

《Misaligned by Reward: Socially Undesirable Preferences in LLMs》〔评测 / 数据 / 方法〕:Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, exis…

  1. Misaligned by Reward: Socially Undesirable Preferences in LLMs · Score 194
    title matched "LLM";summary matched "language model";summary matched "large language model"
  2. SoK: Robustness in Large Language Models against Jailbreak Attacks · Score 181
    title matched "language model";title matched "large language model";summary matched "LLM"
  3. Why Expert Alignment Is Hard: Evidence from Subjective Evaluation · Score 161
    title matched "alignment";title matched "evaluation";summary matched "language model"
  4. KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels · Score 143
    title matched "LLM";title matched "benchmark";summary matched "RAG"
  5. Low-Cost Black-Box Detection of LLM Hallucinations via Dynamical System Prediction · Score 142
    title matched "LLM";summary matched "language model";summary matched "large language model"

2026-05-06

命中 15 篇生成于 2026-05-06 12:37:23 (Asia/Shanghai)
LM15 篇

《Safety and accuracy follow different scaling laws in clinical large language models》〔评测 / 应用 / 方法〕:Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time comput…

  1. Safety and accuracy follow different scaling laws in clinical large language models · Score 201
    title matched "language model";title matched "large language model";summary matched "LLM"
  2. Say the Mission, Execute the Swarm: Agent-Enhanced LLM Reasoning in the Web-of-Drones · Score 183
    title matched "LLM";title matched "reasoning";title matched "agent"
  3. OracleProto: A Reproducible Framework for Benchmarking LLM Native Forecasting via Knowledge Cutoff and Temporal Masking · Score 161
    title matched "LLM";title matched "benchmark";summary matched "language model"
  4. Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems · Score 147
    title matched "reasoning";title matched "agent";summary matched "benchmark"
  5. Benchmarking Parameter-Efficient Fine-Tuning of Large Language Models for Low-Resource Tajik Text Generation with the Tajik Web Corpus · Score 146
    title matched "language model";title matched "large language model";title matched "benchmark"

2026-05-05

命中 15 篇生成于 2026-05-05 12:20:54 (Asia/Shanghai)
LM15 篇

《StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models》〔评测 / 数据 / 方法〕:Static benchmarks for LLMs are increasingly compromised by contamination and overfitting especia…

  1. StressEval: Failure-Driven Dynamic Benchmarking for Knowledge-Intensive Reasoning in Large Language Models · Score 219
    title matched "language model";title matched "large language model";title matched "reasoning"
  2. Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks · Score 211
    title matched "LLM";title matched "benchmark";summary matched "language model"
  3. Maistros: A Greek Large Language Model Adapted Through Knowledge Distillation From Large Reasoning Models · Score 197
    title matched "language model";title matched "large language model";title matched "reasoning"
  4. MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation · Score 197
    title matched "language model";title matched "large language model";title matched "alignment"
  5. OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice · Score 197
    title matched "language model";title matched "large language model";title matched "LLM"

2026-05-04

命中 0 篇生成于 2026-05-04 12:44:55 (Asia/Shanghai)
LM0 篇

LM 今日没有新的命中文献。

2026-05-03

命中 0 篇生成于 2026-05-03 12:44:59 (Asia/Shanghai)
LM0 篇

LM 今日没有新的命中文献。

2026-05-02

命中 0 篇生成于 2026-05-02 12:22:26 (Asia/Shanghai)
LM0 篇

LM 今日没有新的命中文献。

2026-05-01

命中 15 篇生成于 2026-05-01 12:53:56 (Asia/Shanghai)
LM15 篇

《Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper Agents》〔评测 / 应用 / 方法〕:We present Collabora…

  1. Collaborative Agent Reasoning Engineering (CARE): A Three-Party Design Methodology for Systematically Engineering AI Agents with Subject Matter Experts, Developers, and Helper Agents · Score 193
    title matched "reasoning";title matched "agent";summary matched "language model"
  2. What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design · Score 185
    title matched "agent";title matched "benchmark";title matched "evaluation"
  3. Rethinking Agentic Reinforcement Learning In Large Language Models · Score 182
    title matched "language model";title matched "large language model";title matched "agent"
  4. TopBench: A Benchmark for Implicit Prediction and Reasoning over Tabular Question Answering · Score 181
    title matched "reasoning";title matched "benchmark";summary matched "language model"
  5. LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning · Score 180
    title matched "LLM";title matched "reasoning";summary matched "language model"

2026-04-30

命中 0 篇生成于 2026-04-30 14:35:40 (Asia/Shanghai)
LM0 篇

LM 今日没有新的命中文献。

2026-04-29

命中 15 篇生成于 2026-04-29 12:26:28 (Asia/Shanghai)
LM15 篇

《LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation》〔评测 / 数据 / 方法〕:Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across hetero…

  1. LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation · Score 197
    title matched "LLM";title matched "evaluation";summary matched "language model"
  2. Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models · Score 178
    title matched "language model";title matched "large language model";summary matched "LLM"
  3. DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios · Score 165
    title matched "agent";title matched "benchmark";summary matched "LLM"
  4. From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling · Score 164
    title matched "LLM";title matched "agent";summary matched "language model"
  5. SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing? · Score 158
    title matched "agent";summary matched "language model";summary matched "large language model"

2026-04-28

命中 0 篇生成于 2026-04-28 15:39:05 (Asia/Shanghai)
LM0 篇

LM 今日没有新的命中文献。

2026-04-26

命中 1 篇生成于 2026-04-26 11:52:13 (Asia/Shanghai)
LLM0 篇

LLM 今日没有新的命中文献。

Vision0 篇

Vision 今日没有新的命中文献。

PubMed AI 今日没有新的命中文献。

《SOS::LM Sequence Initializer: Semantic Process Architecture for Controlled, Traceable, and Structured Language Model Outputs》〔评测 / 应用 / 方法〕:SOS::LM (Schloemer-Notation ::) defines a semantic process architecture for la…

  1. SOS::LM Sequence Initializer: Semantic Process Architecture for Controlled, Traceable, and Structured Language Model Outputs · Score 100
    title matched "language model";summary matched "agent";has DOI

2026-04-25

命中 5 篇生成于 2026-04-25 11:28:34 (Asia/Shanghai)
LLM0 篇

LLM 今日没有新的命中文献。

Vision0 篇

Vision 今日没有新的命中文献。

《Establishing Clinically Significant Change Benchmarks for the Moral Injury Outcome Scale in VA Behavioral Health Settings.》〔评测 / 方法〕:This study aimed to establish benchmarks for clinically significant change for the Mo…

  1. Establishing Clinically Significant Change Benchmarks for the Moral Injury Outcome Scale in VA Behavioral Health Settings. · Score 102
    title matched "benchmark";title matched "clinical";has DOI
  2. Generalist large language models in a specialized world: Evidence from the Italian national medical education pathway. · Score 90
    title matched "language model";summary matched "clinical";has DOI
  3. Standardization of clinical trials subject ID schematics: A portfolio-wide model to enhance data integrity and regulatory compliance. · Score 81
    title matched "clinical";summary matched "benchmark";has DOI
  4. Considerations about the proliferation of large language model chatbots and youth mental health. · Score 80
    title matched "language model";summary matched "clinical";has DOI
  5. The application of large language models in meteorology graduate research: current status, impact, and prospects. · Score 72
    title matched "language model";has DOI;has rich summary

OpenAlex AI 今日没有新的命中文献。

2026-04-24

命中 30 篇生成于 2026-04-24 11:46:20 (Asia/Shanghai)
LLM15 篇

《Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows》〔评测 / 应用 / 方法〕:The Model Context Protocol (MCP) has become a common interface…

  1. Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows · Score 106
    title matched "agent";summary matched "reasoning";summary matched "benchmark"
  2. Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems · Score 106
    title matched "agent";summary matched "reasoning";summary matched "benchmark"
  3. AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use · Score 102
    title matched "agent";summary matched "reasoning";summary matched "benchmark"
  4. Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability · Score 90
    title matched "evaluation";summary matched "benchmark";has PDF
  5. Nemobot Games: Crafting Strategic AI Gaming Agents for Interactive Learning with Large Language Models · Score 90
    title matched "agent";summary matched "reasoning";has PDF
Vision10 篇

《Pre-process for segmentation task with nonlinear diffusion filters》〔方法〕:This paper deals with the case of using nonlinear diffusion filters to obtain piecewise constant images as a previous process for segmentation tec…

  1. Pre-process for segmentation task with nonlinear diffusion filters · Score 102
    title matched "diffusion";title matched "segmentation";has PDF
  2. KD-CVG: A Knowledge-Driven Approach for Creative Video Generation · Score 79
    title matched "video generation";summary matched "multimodal";has PDF
  3. Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation · Score 77
    title matched "video generation";summary matched "diffusion";has PDF
  4. Seeing Fast and Slow: Learning the Flow of Time in Videos · Score 68
    summary matched "video generation";summary matched "multimodal";has PDF
  5. DCMorph: Face Morphing via Dual-Stream Cross-Attention Diffusion · Score 66
    title matched "diffusion";has PDF;has rich summary

《Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models.》〔数据 / 应用 / 方法〕:Prompt learning has emerged as one of the most effective paradigms for adapting pre-trained vision language models (VLMs) to…

  1. Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models. · Score 93
    title matched "language model";summary matched "clinical";has DOI
  2. Clinical Knowledge-Guided PET/CT Lesion Segmentation with Interpretable Fusion of Metabolic and Structural Cues. · Score 93
    title matched "clinical";summary matched "benchmark";has DOI
  3. Accelerating real-world data collection using large language models in rare neoplasms: a bone sarcoma example. · Score 81
    title matched "language model";summary matched "clinical";has DOI
  4. GATE: Graph and Text Exchange for Zero-Shot ECG Classification with LLM Prompts. · Score 71
    summary matched "language model";summary matched "clinical";has DOI
  5. Learning from Prototypes: Contrastive Learning with Prior-Aware Multi-Label Chest X-ray Classification. · Score 71
    summary matched "language model";summary matched "clinical";has DOI

OpenAlex AI 今日没有新的命中文献。

2026-04-23

命中 29 篇生成于 2026-04-23 11:42:13 (Asia/Shanghai)
LLM15 篇

《OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model》〔评测 / 方法〕:Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Neverth…

  1. OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model · Score 129
    title matched "reasoning";title matched "benchmark";summary matched "evaluation"
  2. V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization · Score 125
    title matched "reasoning";summary matched "alignment";summary matched "benchmark"
  3. ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence · Score 124
    title matched "benchmark";summary matched "reasoning";summary matched "alignment"
  4. Where and What: Reasoning Dynamic and Implicit Preferences in Situated Conversational Recommendation · Score 106
    title matched "reasoning";summary matched "alignment";summary matched "benchmark"
  5. Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows · Score 105
    title matched "agent";summary matched "reasoning";summary matched "benchmark"
Vision9 篇

《LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model》〔方法〕:We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal underst…

  1. LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model · Score 111
    title matched "diffusion";title matched "multimodal";has PDF
  2. Hallucination Early Detection in Diffusion Models · Score 75
    title matched "diffusion";has DOI;has PDF
  3. ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control · Score 72
    title matched "diffusion";has PDF;has rich summary
  4. Amodal SAM: A Unified Amodal Segmentation Framework with Generalization · Score 70
    title matched "segmentation";has PDF;has rich summary
  5. GeoRelight: Learning Joint Geometrical Relighting and Reconstruction with Flexible Multi-Modal Diffusion Transformers · Score 70
    title matched "diffusion";has PDF;has rich summary

《Comparative evaluation of large language models for generating CAD-RADS 2.0-compliant diagnostic conclusions in cardiac CT reports.》〔评测 / 应用 / 方法〕:OBJECTIVES: Coronary computed tomography angiography (CCTA) has become…

  1. Comparative evaluation of large language models for generating CAD-RADS 2.0-compliant diagnostic conclusions in cardiac CT reports. · Score 87
    title matched "language model";summary matched "clinical";has DOI
  2. Immune checkpoint inhibitors in POLE/POLD1 proofreading-deficient CRC: from molecular basis to clinical practice and future directions. · Score 81
    title matched "clinical";summary matched "benchmark";has DOI
  3. Establishing procedure-specific minimal clinically important difference and patient acceptable symptom state thresholds after anterior combined latissimus dorsi and teres major tendon transfer for irreparable anterosuperior cuff tears: minimum 5-year outcomes. · Score 81
    title matched "clinical";summary matched "benchmark";has DOI
  4. Diagnostic Modalities and Nodal Staging in High-Risk Cutaneous Squamous Cell Carcinoma. · Score 65
    summary matched "benchmark";summary matched "clinical";has DOI
  5. Defining the learning curve of multi-vessel MIDCAB using CUSUM analysis. · Score 62
    summary matched "benchmark";summary matched "clinical";has DOI

OpenAlex AI 今日没有新的命中文献。

2026-04-22

命中 30 篇生成于 2026-04-22 11:37:03 (Asia/Shanghai)
LLM15 篇

《Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents》〔评测 / 应用 / 方法〕:Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization)…

  1. Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents · Score 162
    title matched "agent";title matched "alignment";summary matched "reasoning"
  2. Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps · Score 149
    title matched "agent";title matched "benchmark";title matched "evaluation"
  3. Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment · Score 145
    title matched "agent";title matched "alignment";summary matched "reasoning"
  4. Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views · Score 130
    title matched "reasoning";title matched "alignment";summary matched "benchmark"
  5. Revac: A Social Deduction Reasoning Agent · Score 127
    title matched "agent";title matched "reasoning";summary matched "evaluation"
Vision10 篇

《PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving》〔评测 / 方法〕:This paper presents the first study on Unsupervised Domain Adaptation (UDA) for multimodal 3D panoptic segme…

  1. PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving · Score 106
    title matched "multimodal";title matched "segmentation";has PDF
  2. Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval · Score 100
    title matched "diffusion";title matched "multimodal";has PDF
  3. ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis · Score 90
    title matched "video generation";summary matched "diffusion";has PDF
  4. MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation · Score 89
    title matched "video generation";summary matched "diffusion";has PDF
  5. MedFlowSeg: Flow Matching for Medical Image Segmentation with Frequency-Aware Attention · Score 89
    title matched "segmentation";summary matched "diffusion";has PDF

《Classifying American Society of Anesthesiologists Physical Status With a Low-Rank-Adapted Large Language Model: Development and Validation Study.》〔评测 / 应用 / 方法〕:BACKGROUND: The American Society of Anesthesiologists Phy…

  1. Classifying American Society of Anesthesiologists Physical Status With a Low-Rank-Adapted Large Language Model: Development and Validation Study. · Score 111
    title matched "language model";summary matched "benchmark";summary matched "clinical"
  2. Enhancing large language model clinical support information with machine learning risk and explainability: a feasibility study. · Score 109
    title matched "language model";title matched "clinical";has DOI
  3. Clinical Model Autophagy: The Risk of Interpretative Drift in Recursive Medical AI. · Score 93
    title matched "clinical";summary matched "language model";has DOI
  4. APSevLM: Acute Pancreatitis Severity Language Model. · Score 90
    title matched "language model";summary matched "clinical";has DOI
  5. Comparing Clinical Outcomes in Cardiac Surgical Patients Who Receive Sugammadex Versus Placebo: A Prospective Randomized Blinded Controlled Trial. · Score 88
    title matched "clinical";summary matched "benchmark";has DOI

OpenAlex AI 今日没有新的命中文献。

2026-04-21

命中 30 篇生成于 2026-04-21 11:40:46 (Asia/Shanghai)
LLM15 篇

《MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval》〔评测 / 数据 / 方法〕:Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing…

  1. MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval · Score 112
    title matched "reasoning";title matched "benchmark";has PDF
  2. Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion · Score 108
    title matched "benchmark";summary matched "reasoning";summary matched "evaluation"
  3. ClawEnvKit: Automatic Environment Generation for Claw-Like Agents · Score 107
    title matched "agent";summary matched "benchmark";summary matched "evaluation"
  4. MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation · Score 107
    title matched "agent";summary matched "reasoning";summary matched "benchmark"
  5. OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation · Score 106
    title matched "reasoning";summary matched "agent";summary matched "benchmark"
Vision10 篇

《AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation》〔应用 / 方法〕:Video diffusion transformers (DiTs) suffer from prohibitive inference latency due to quadratic attention complexity. Existing…

  1. AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation · Score 87
    title matched "video generation";summary matched "diffusion";has PDF
  2. DiffuSAM: Diffusion Guided Zero-Shot Object Grounding for Remote Sensing Imagery · Score 85
    title matched "diffusion";summary matched "segmentation";has PDF
  3. Weakly-Supervised Referring Video Object Segmentation through Text Supervision · Score 76
    title matched "segmentation";summary matched "multimodal";has PDF
  4. AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation · Score 72
    title matched "segmentation";has PDF;has rich summary
  5. UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models · Score 71
    title matched "diffusion";has PDF;has rich summary

《Transforming oncology clinical trial matching through neuro-symbolic, multi-agent AI and an oncology-specific knowledge graph: a prospective evaluation in 3804 patients.》〔评测 / 数据 / 应用 / 方法〕:BACKGROUND: Clinical trial e…

  1. Transforming oncology clinical trial matching through neuro-symbolic, multi-agent AI and an oncology-specific knowledge graph: a prospective evaluation in 3804 patients. · Score 101
    title matched "clinical";summary matched "language model";summary matched "benchmark"
  2. Developing and evaluating definitions of real-world clinical endpoints for patients with early-stage triple-negative breast cancer using a United States of America secondary database. · Score 83
    title matched "clinical";summary matched "benchmark";has DOI
  3. Investigating fine-tuning versus zero-shot learning for general large language models when predicting cancer survival from initial oncology consultation documents. · Score 83
    title matched "language model";summary matched "clinical";has DOI
  4. A Comparative Evaluation of Three Large Language Models for Parent-Centered Questions About Anorexia Nervosa. · Score 83
    title matched "language model";summary matched "clinical";has DOI
  5. Impacts of Multidisciplinary Lung Cancer Meeting Presentation in a Clinical Quality Registry. · Score 66
    title matched "clinical";has DOI;has rich summary

OpenAlex AI 今日没有新的命中文献。

2026-04-20

命中 3 篇生成于 2026-04-20 11:48:52 (Asia/Shanghai)
LLM0 篇

LLM 今日没有新的命中文献。

Vision0 篇

Vision 今日没有新的命中文献。

《Medic Training at Military-Civilian Partnerships-A Narrative Review.》〔评测 / 应用 / 方法〕:INTRODUCTION: Military-Civilian Partnerships (MCP) were developed to mitigate degradation of combat medical readiness during peacetime…

  1. Medic Training at Military-Civilian Partnerships-A Narrative Review. · Score 62
    summary matched "benchmark";summary matched "clinical";has DOI

《Artificial Intelligence And The Transformation of Labor Markets》〔方法〕:The rapid advancement of artificial intelligence (AI) technologies, particularly generative AI and large language models, has reignited debates about…

  1. Artificial Intelligence And The Transformation of Labor Markets · Score 60
    summary matched "language model";has DOI;has rich summary
  2. Artificial Intelligence And The Transformation of Labor Markets · Score 60
    summary matched "language model";has DOI;has rich summary

2026-04-19

命中 0 篇生成于 2026-04-19 11:46:32 (Asia/Shanghai)
LLM0 篇

LLM 今日没有新的命中文献。

Vision0 篇

Vision 今日没有新的命中文献。

PubMed AI 今日没有新的命中文献。

OpenAlex AI 今日没有新的命中文献。

2026-04-18

命中 5 篇生成于 2026-04-18 11:26:55 (Asia/Shanghai)
LLM0 篇

LLM 今日没有新的命中文献。

Vision0 篇

Vision 今日没有新的命中文献。

《Pretraining effective T5 generative models for clinical and biomedical applications.》〔评测 / 数据 / 应用 / 方法〕:This paper presents a study of the impact of corpus selection and vocabulary design on the performance of T5-base…

  1. Pretraining effective T5 generative models for clinical and biomedical applications. · Score 108
    title matched "clinical";summary matched "language model";summary matched "benchmark"
  2. MILU: a consensus ensemble benchmark for multimodal medical imaging lecture understanding. · Score 82
    title matched "benchmark";summary matched "language model";has DOI
  3. Comparative performance of large language models and Drugs.com versus Lexicomp for antiseizure medication drug-drug interactions: A cross-sectional study with iterative prompting analysis. · Score 82
    title matched "language model";summary matched "clinical";has DOI
  4. Weakly Supervised Composed Object Re-Identification With Large Models. · Score 68
    summary matched "language model";summary matched "benchmark";has DOI
  5. An explainable multi-head attention network for healthcare IoT threat detection based on the MedDefender-MHAN framework. · Score 68
    summary matched "benchmark";summary matched "clinical";has DOI

OpenAlex AI 今日没有新的命中文献。

2026-04-17

命中 29 篇生成于 2026-04-17 11:39:21 (Asia/Shanghai)
LLM15 篇

《CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas》〔评测 / 方法〕:It is increasingly important that LLM agents interact effectively and safely with other goal-pursuing agents, yet, re…

  1. CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas · Score 130
    title matched "agent";title matched "benchmark";summary matched "reasoning"
  2. IE as Cache: Information Extraction Enhanced Agentic Reasoning · Score 124
    title matched "agent";title matched "reasoning";summary matched "benchmark"
  3. QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies · Score 123
    title matched "benchmark";summary matched "agent";summary matched "alignment"
  4. From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench · Score 122
    title matched "agent";summary matched "reasoning";summary matched "benchmark"
  5. An Axiomatic Benchmark for Evaluation of Scientific Novelty Metrics · Score 109
    title matched "benchmark";title matched "evaluation";has PDF
Vision9 篇

《SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation》〔应用 / 方法〕:Reliable uncertainty estimation is critical for medical image segmentation, where automated contours…

  1. SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation · Score 72
    title matched "segmentation";has PDF;has rich summary
  2. Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization · Score 70
    title matched "segmentation";has PDF;has rich summary
  3. Boundary-Centric Active Learning for Temporal Action Segmentation · Score 70
    title matched "segmentation";has PDF;has rich summary
  4. An Analysis of Regularization and Fokker-Planck Residuals in Diffusion Models for Image Generation · Score 70
    title matched "diffusion";has PDF;has rich summary
  5. RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework · Score 68
    summary matched "diffusion";summary matched "multimodal";has PDF

《Applying natural language processing and large language models to clinical notes for phenotyping and diagnosing rare diseases: a systematic review.》〔评测 / 数据 / 应用 / 方法〕:OBJECTIVES: Patients with rare diseases often face…

  1. Applying natural language processing and large language models to clinical notes for phenotyping and diagnosing rare diseases: a systematic review. · Score 113
    title matched "language model";title matched "clinical";has DOI
  2. Evaluation of large language models with clinical guidance for vetting outpatient magnetic resonance imaging lumbar spine referrals. · Score 107
    title matched "language model";title matched "clinical";has DOI
  3. From Image to Pixels: towards Fine-Grained Medical Vision-Language Models. · Score 106
    title matched "language model";summary matched "benchmark";summary matched "clinical"
  4. Targeted use of large language models for EHR-based computable phenotyping. · Score 93
    title matched "language model";summary matched "clinical";has DOI
  5. Dual perspectives on large language models in rheumatology: physician-rated quality and patient-centered usability of GPT-4o versus DeepSeek-V3. · Score 85
    title matched "language model";summary matched "clinical";has DOI

OpenAlex AI 今日没有新的命中文献。

2026-04-16

命中 30 篇生成于 2026-04-16 11:43:00 (Asia/Shanghai)
LLM15 篇

《GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis》〔评测 / 应用 / 方法〕:The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift…

  1. GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis · Score 162
    title matched "agent";title matched "benchmark";summary matched "reasoning"
  2. HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark · Score 127
    title matched "agent";title matched "benchmark";summary matched "evaluation"
  3. Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning · Score 120
    title matched "evaluation";summary matched "agent";summary matched "reasoning"
  4. LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning · Score 112
    title matched "reasoning";title matched "benchmark";has PDF
  5. Correct Prediction, Wrong Steps? Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis · Score 108
    title matched "reasoning";summary matched "benchmark";summary matched "evaluation"
Vision10 篇

《ROSE: Retrieval-Oriented Segmentation Enhancement》〔评测 / 方法〕:Existing segmentation models based on multimodal large language models (MLLMs), such as LISA, often struggle with novel or emerging entities due to their inab…

  1. ROSE: Retrieval-Oriented Segmentation Enhancement · Score 90
    title matched "segmentation";summary matched "multimodal";has PDF
  2. Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models · Score 88
    title matched "multimodal";summary matched "segmentation";has PDF
  3. Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding · Score 78
    title matched "multimodal";summary matched "diffusion";has PDF
  4. DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer · Score 78
    title matched "diffusion";summary matched "video generation";has PDF
  5. Seedance 2.0: Advancing Video Generation for World Complexity · Score 72
    title matched "video generation";has PDF;has rich summary

《Augmenting Large Language Model With Prompt Engineering and Supervised Fine-Tuning in Non-Small Cell Lung Cancer Tumor-Node-Metastasis Staging: Framework Development and Validation.》〔评测 / 数据 / 应用 / 方法〕:BACKGROUND: Accu…

  1. Augmenting Large Language Model With Prompt Engineering and Supervised Fine-Tuning in Non-Small Cell Lung Cancer Tumor-Node-Metastasis Staging: Framework Development and Validation. · Score 107
    title matched "language model";summary matched "benchmark";summary matched "clinical"
  2. PKFAR: psychiatry knowledge-fused augmented reasoning with large language models. · Score 98
    title matched "language model";summary matched "benchmark";summary matched "clinical"
  3. Fact-Checking Large Language Model Responses to a Health Care Prompt: Comparative Study. · Score 91
    title matched "language model";summary matched "clinical";has DOI
  4. Fine-Tuned Large Language Models for Automated Radiology Impression Generation: A Multicenter Evaluation. · Score 86
    title matched "language model";summary matched "clinical";has DOI
  5. A Multi-AI Agent Framework for Interactive Neurosurgical Education and Evaluation: From Vignettes to Virtual Conversations. · Score 76
    summary matched "language model";summary matched "benchmark";summary matched "clinical"

OpenAlex AI 今日没有新的命中文献。

2026-04-15

命中 30 篇生成于 2026-04-15 11:35:50 (Asia/Shanghai)
LLM15 篇

《Parallax: Why AI Agents That Think Must Never Act》〔评测 / 应用 / 方法〕:Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise application…

  1. Parallax: Why AI Agents That Think Must Never Act · Score 107
    title matched "agent";summary matched "reasoning";summary matched "evaluation"
  2. Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents · Score 107
    title matched "agent";summary matched "reasoning";summary matched "benchmark"
  3. Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss · Score 106
    title matched "benchmark";summary matched "reasoning";summary matched "evaluation"
  4. Towards Long-horizon Agentic Multimodal Search · Score 106
    title matched "agent";summary matched "reasoning";summary matched "benchmark"
  5. QuarkMedSearch: A Long-Horizon Deep Search Agent for Exploring Medical Intelligence · Score 105
    title matched "agent";summary matched "benchmark";summary matched "evaluation"
Vision9 篇

《RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation》〔评测 / 方法〕:Multimodal semantic segmentation has emerged as a powerful paradigm for enhancing scene understanding by leveragin…

  1. RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation · Score 100
    title matched "multimodal";title matched "segmentation";has PDF
  2. All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding · Score 78
    title matched "multimodal";summary matched "segmentation";has PDF
  3. Probabilistic Feature Imputation and Uncertainty-Aware Multimodal Federated Aggregation · Score 71
    title matched "multimodal";has PDF;has rich summary
  4. AbdomenGen: Sequential Volume-Conditioned Diffusion Framework for Abdominal Anatomy Generation · Score 71
    title matched "diffusion";has PDF;has rich summary
  5. Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation · Score 70
    title matched "segmentation";has PDF;has rich summary

《VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model.》〔评测 / 数据 / 方法〕:The emergence of Large Vision-Language Models (LVLMs) marks significant strides towards achieving general artifi…

  1. VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model. · Score 111
    title matched "language model";title matched "benchmark";has DOI
  2. Multimodal large language models in brain tumor imaging: clinical applications and future perspectives. · Score 109
    title matched "language model";title matched "clinical";has DOI
  3. Bridging the Modality Gap in Medical Vision-Language Models: A Hybrid Contrastive-Optimal Transport Framework for Enhanced Cross-Modal Alignment. · Score 107
    title matched "language model";summary matched "benchmark";summary matched "clinical"
  4. User Experience and Early Clinical Outcomes of a Mental Wellness Chatbot for Depression and Anxiety: Pilot Evaluation Mixed Methods Study. · Score 93
    title matched "clinical";summary matched "language model";has DOI
  5. Comparison of AI-based Chatbot Performance in Analyzing Clinical Scenarios versus Medical Residents: A Novel Approach in Chest Diseases Education. · Score 82
    title matched "clinical";summary matched "language model";has DOI

《Demystifying Attitudes and Effects of Usage of Large-Language Models Among College-Aged Students》〔方法〕:In compiling literature for my senior seminar on combating hallucinations present within responses from large-langua…

  1. Demystifying Attitudes and Effects of Usage of Large-Language Models Among College-Aged Students · Score 70
    title matched "language model";has rich summary;has complete metadata

2026-04-14

命中 31 篇生成于 2026-04-14 11:37:06 (Asia/Shanghai)
LLM15 篇

《UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents》〔评测 / 数据 / 应用 / 方法〕:Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems throu…

  1. UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents · Score 145
    title matched "agent";title matched "evaluation";summary matched "reasoning"
  2. General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks · Score 130
    title matched "reasoning";title matched "benchmark";summary matched "evaluation"
  3. Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games · Score 129
    title matched "agent";title matched "reasoning";summary matched "benchmark"
  4. FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning · Score 127
    title matched "agent";title matched "reasoning";summary matched "evaluation"
  5. From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python · Score 126
    title matched "agent";title matched "benchmark";summary matched "evaluation"
Vision10 篇

《OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation》〔评测 / 数据 / 应用 / 方法〕:In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality…

  1. OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation · Score 112
    title matched "video generation";title matched "multimodal";has PDF
  2. LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation · Score 90
    title matched "segmentation";summary matched "multimodal";has PDF
  3. GeomPrompt: Geometric Prompt Learning for RGB-D Semantic Segmentation Under Missing and Degraded Depth · Score 87
    title matched "segmentation";summary matched "multimodal";has PDF
  4. Anthropogenic Regional Adaptation in Multimodal Vision-Language Model · Score 86
    title matched "multimodal";summary matched "diffusion";has PDF
  5. GazeVaLM: A Multi-Observer Eye-Tracking Benchmark for Evaluating Clinical Realism in AI-Generated X-Rays · Score 78
    summary matched "diffusion";summary matched "multimodal";has DOI

《Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: Quantitative Pilot Feasibility Study.》〔评测 / 应用 / 方法〕:BACKGROUND: Translation of medical consulta…

  1. Comparing Large Language Models and Traditional Machine Translation Tools for Translating Medical Consultation Summaries: Quantitative Pilot Feasibility Study. · Score 107
    title matched "language model";summary matched "benchmark";summary matched "clinical"
  2. Toward Sustainable Clinical Analysis: Benchmarking Plastic Use in LC-MS Sample Preparation - Exemplified by Ketamine Analogues in Whole Blood. · Score 107
    title matched "benchmark";title matched "clinical";has DOI
  3. Text4Seg++: Advancing Image Segmentation via Generative Language Modeling. · Score 89
    title matched "language model";summary matched "benchmark";has DOI
  4. Diversity in clinical Trials: The example of systemic lupus erythematosus. · Score 82
    title matched "clinical";summary matched "benchmark";has DOI
  5. Comparative Performance of Gemini 3 Pro and GPT-5 Family Models on Ophthalmology Board-Style Questions. · Score 78
    summary matched "language model";summary matched "benchmark";summary matched "clinical"

《ECO-Charge: Multi-Agent Smart-Charging for Electric Vehicles》〔方法〕:International audience

  1. ECO-Charge: Multi-Agent Smart-Charging for Electric Vehicles · Score 64
    title matched "agent";has complete metadata

2026-04-13

命中 0 篇生成于 2026-04-13 16:13:43 (Asia/Shanghai)
LLM0 篇

LLM 今日没有新的命中文献。

Vision0 篇

Vision 今日没有新的命中文献。

PubMed AI 今日没有新的命中文献。

OpenAlex AI 今日没有新的命中文献。

2026-04-12

命中 1 篇生成于 2026-04-12 22:15:33 (Asia/Shanghai)
LLM0 篇

LLM 今日没有新的命中文献。

Vision0 篇

Vision 今日没有新的命中文献。

《Combining structural modeling and deep learning to calculate the E. coli protein interactome and functional networks.》〔数据 / 方法〕:We report on the integration of three methods that predict, on a proteome-wide scale, whet…

  1. Combining structural modeling and deep learning to calculate the E. coli protein interactome and functional networks. · Score 48
    summary matched "language model";has DOI;has rich summary

OpenAlex AI 今日没有新的命中文献。

2026-04-11

命中 9 篇生成于 2026-04-11 23:09:08 (Asia/Shanghai)
LLM0 篇

LLM 今日没有新的命中文献。

Vision0 篇

Vision 今日没有新的命中文献。

《Factors influencing large language model adoption among dental students: a cross-sectional study.》〔应用 / 方法〕:This research evaluates the factors influencing the behavioural intention (BI) to adopt large language models…

  1. Evaluating the clinical decision-making performance of large language models in clinically oriented thoracic anatomy scenarios: a comparative evaluation study. · Score 104
    title matched "language model";title matched "clinical";has DOI
  2. Exploratory study of large language models in surgical decision-making for lumbar disc herniation: a multicenter analysis based on multisource clinical information. · Score 104
    title matched "language model";title matched "clinical";has DOI
  3. A hybrid large language model framework for structured data entry from code-switched persian clinical speech. · Score 104
    title matched "language model";title matched "clinical";has DOI
  4. Factors influencing large language model adoption among dental students: a cross-sectional study. · Score 88
    title matched "language model";summary matched "clinical";has DOI

《Coalition Drift: When Agents Drift Together Why multi-agent systems don't just drift individually — they drift as a group, and why that matters more than any single-agent failure mode.》〔方法〕:Most AI governance framework…

  1. Coalition Drift: When Agents Drift Together Why multi-agent systems don't just drift individually — they drift as a group, and why that matters more than any single-agent failure mode. · Score 70
    title matched "agent";has DOI;has rich summary
  2. Coalition Drift: When Agents Drift Together Why multi-agent systems don't just drift individually — they drift as a group, and why that matters more than any single-agent failure mode. · Score 70
    title matched "agent";has DOI;has rich summary
  3. Coalition Formation Events: How Multi-Agent Systems Create Temporary Actors · Score 70
    title matched "agent";has DOI;has rich summary
  4. Coalition Formation Events: How Multi-Agent Systems Create Temporary Actors · Score 70
    title matched "agent";has DOI;has rich summary
  5. U-P Duality in Multi-Agent Systems: A Seven-Space Algorithm for Complex Nonlinear AI (Corrected Version) · Score 70
    title matched "agent";has DOI;has rich summary

2026-04-10

命中 0 篇生成于 2026-04-10 18:14:08 (Asia/Shanghai)
LLM0 篇

LLM 今日没有新的命中文献。

Vision0 篇

Vision 今日没有新的命中文献。

PubMed AI 今日没有新的命中文献。

OpenAlex AI 今日没有新的命中文献。

2026-04-09

命中 5 篇生成于 2026-04-09 14:51:56 (Asia/Shanghai)

2026-04-08

命中 25 篇生成于 2026-04-08 17:10:24 (Asia/Shanghai)