language model Topic Archive

language model Topic Archive language-model.html 关键词 language model 的长期追踪 RSS，汇总历史命中文献。 zh-CN Wed, 22 Apr 2026 03:37:20 +0000 Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps ../papers/arxiv-66f4fae6bbd8.html https://arxiv.org/abs/2604.19533v1#2026-04-22#language-model Wed, 22 Apr 2026 11:37:03 +0800 We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The benchmark wraps 106 real attack procedures from the OTRF Security-Datasets corpus - spanning 86 MITRE ATT&CK sub-techniques across 12 tactics - into a Gymnasium reinforcement-learning environment. Each… Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment ../papers/arxiv-3ca660d54bb4.html https://arxiv.org/abs/2604.19548v1#2026-04-22#language-model Wed, 22 Apr 2026 11:37:03 +0800 Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance reliability, multi-agent frameworks assigning specialized roles are increasingly adopted to enable self-reflection and mutual auditing. While such role-playing effectively leverages domain expert knowledge, we find it simultaneously induces a human-like cognitive bias known as Actor-Observer Asymmetry (AOA). Specifically, an agent acting… Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views ../papers/arxiv-0d90d26515bd.html https://arxiv.org/abs/2604.19716v1#2026-04-22#language-model Wed, 22 Apr 2026 11:37:03 +0800 Large Language Models (LLMs) still struggle with multi-step logical reasoning. Existing approaches either purely refine the reasoning chain in natural language form or attach a symbolic solver as an external module. In this work, we instead ask whether LLMs contain a shared internal logical subspace that simultaneously aligns natural-language and symbolic-language views of the reasoning process. Our hypothesis is that this logical subspace captures logical reasoning capabilities in LLMs that ar… Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews ../papers/arxiv-dcac53916c57.html https://arxiv.org/abs/2604.19502v1#2026-04-22#language-model Wed, 22 Apr 2026 11:37:03 +0800 The rapid adoption of Large Language Models (LLMs) has spurred interest in automated peer review; however, progress is currently stifled by benchmarks that treat reviewing primarily as a rating prediction task. We argue that the utility of a review lies in its textual justification--its arguments, questions, and critique--rather than a scalar score. To address this, we introduce Beyond Rating, a holistic evaluation framework that assesses AI reviewers across five dimensions: Content Faithfulnes… A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding ../papers/arxiv-5fe8f705aa06.html https://arxiv.org/abs/2604.19689v1#2026-04-22#language-model Wed, 22 Apr 2026 11:37:03 +0800 Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-… Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic ../papers/arxiv-424f40f3b425.html https://arxiv.org/abs/2604.19567v1#2026-04-22#language-model Wed, 22 Apr 2026 11:37:03 +0800 Reinforcement learning (RL) as post-training is crucial for enhancing the reasoning ability of large language models (LLMs) in coding and math. However, their capacity for visual semantic arithmetic, inferring relationships from images, remains underexplored. The classic text analogy "king"-"man"+"woman" = "queen" illustrates relational reasoning, yet replacing text with images of "king" and "man" significantly reduces performance because it requires commonsense knowledge and the extraction of… SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models ../papers/arxiv-6f4a587095d1.html https://arxiv.org/abs/2604.19638v1#2026-04-22#language-model Wed, 22 Apr 2026 11:37:03 +0800 Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gem… Time Series Augmented Generation for Financial Applications ../papers/arxiv-a14f6e5fa3da.html https://arxiv.org/abs/2604.19633v1#2026-04-22#language-model Wed, 22 Apr 2026 11:37:03 +0800 Evaluating the reasoning capabilities of Large Language Models (LLMs) for complex, quantitative financial tasks is a critical and unsolved challenge. Standard benchmarks often fail to isolate an agent's core ability to parse queries and orchestrate computations. To address this, we introduce a novel evaluation methodology and benchmark designed to rigorously measure an LLM agent's reasoning for financial time-series analysis. We apply this methodology in a large-scale empirical study using our… Lost in Translation: Do LVLM Judges Generalize Across Languages? ../papers/arxiv-542a2e2a02e6.html https://arxiv.org/abs/2604.19405v1#2026-04-22#language-model Wed, 22 Apr 2026 11:37:03 +0800 Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these evaluators are almost exclusively assessed on English-centric benchmarks, leaving open the question of how well these evaluators generalize across languages. To answer this question, we introduce MM-JudgeBench, the first large-scale benchmark for multilingual and multimodal judge model evaluation, which includes over 60K… Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language ../papers/arxiv-db59ef9531cc.html https://arxiv.org/abs/2604.19667v1#2026-04-22#language-model Wed, 22 Apr 2026 11:37:03 +0800 At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve-making development costly, time-consuming, and error-prone. To study whether large language models ca… EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation ../papers/arxiv-3a3bbc2e6e3a.html https://arxiv.org/abs/2604.19105v1#2026-04-22#language-model Wed, 22 Apr 2026 11:37:03 +0800 Faithfully modeling human behavior in dynamic environments is a foundational challenge for embodied intelligence. While conditional motion synthesis has achieved significant advances, egocentric motion generation remains largely underexplored due to the inherent complexity of first-person perception. In this work, we investigate Egocentric Vision-Language (Ego-VL) motion generation. This task requires synthesizing 3D human motion conditioned jointly on first-person visual observations and natur… Classifying American Society of Anesthesiologists Physical Status With a Low-Rank-Adapted Large Language Model: Development and Validation Study. ../papers/doi-8b199115e87e.html https://pubmed.ncbi.nlm.nih.gov/42013456/#2026-04-22#language-model Wed, 22 Apr 2026 11:37:03 +0800 BACKGROUND: The American Society of Anesthesiologists Physical Status (ASA-PS) classification is integral to preoperative risk assessment; yet, assignment remains subjective and labor-intensive. Recent large language models (LLMs) process free-text electronic health records (EHRs), but few studies have evaluated parameter-efficient adaptations that both predict ASA-PS and provide clinician-readable rationales. Low-rank adaptation (LoRA) is a parameter-efficient technique that updates only a sma… Enhancing large language model clinical support information with machine learning risk and explainability: a feasibility study. ../papers/doi-eefd4e77621d.html https://pubmed.ncbi.nlm.nih.gov/42012584/#2026-04-22#language-model Wed, 22 Apr 2026 11:37:03 +0800 BACKGROUND: Current machine learning (ML) prediction models offer limited guidance for individualized actionable management. Large language models (LLMs) can transform ML model-predicted risk estimates with Shapley Additive Explanations (SHAP) into clinically meaningful support information, yet the added value of incorporating ML-derived data and the relative performance of different LLMs remain uncertain. To address these gaps, we used our previously developed IMPACT framework to evaluate the… Clinical Model Autophagy: The Risk of Interpretative Drift in Recursive Medical AI. ../papers/doi-637d5e47b283.html https://pubmed.ncbi.nlm.nih.gov/42013455/#2026-04-22#language-model Wed, 22 Apr 2026 11:37:03 +0800 The rapid integration of large language models into electronic medical record systems introduces a critical theoretical vulnerability. Drawing on foundational computer science proofs of "model collapse," this viewpoint introduces the concept of "Clinical Model Autophagy"-a systemic degradation of diagnostic integrity that occurs when clinical artificial intelligence (AI) models are recursively trained on unverified, AI-generated synthetic data. As these recursive models may progressively regres… APSevLM: Acute Pancreatitis Severity Language Model. ../papers/doi-e00fc28ccec0.html https://pubmed.ncbi.nlm.nih.gov/42013267/#2026-04-22#language-model Wed, 22 Apr 2026 11:37:03 +0800 Approximately one-fifth of patients with acute pancreatitis (AP) develop severe forms, which are associated with high mortality rates, making early prediction of severity crucial for effective patient management. In this study, we present APSevLM (Acute Pancreatitis Severity Language Model), a large language model (LLM)-based approach that integrates admission-time clinical data, imaging reports, and expert knowledge to predict AP severity at an early stage. Through a comprehensive evaluation u… Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion ../papers/arxiv-df421e6da9eb.html https://arxiv.org/abs/2604.18566v1#2026-04-21#language-model Tue, 21 Apr 2026 11:40:46 +0800 We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmarks for System Dynamics AI assistance: the \textbf{CLD Leaderboard} (53 tests, structured causal loop diagram extraction) and the \textbf{Discussion Leaderboard} (interactive model discussion, feedback explanation, and model building coaching). On CLD extraction, cloud models achieve 77--89\% overall pass rates; the best… MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation ../papers/arxiv-d54216ff47bf.html https://arxiv.org/abs/2604.18509v1#2026-04-21#language-model Tue, 21 Apr 2026 11:40:46 +0800 Large language models (LLMs) are widely used in retrieval-augmented generation (RAG) to incorporate external knowledge at inference time. However, when retrieved contexts are noisy, incomplete, or heterogeneous, a single generation process often struggles to reconcile evidence effectively. We propose \textbf{MASS-RAG}, a multi-agent synthesis approach to retrieval-augmented generation that structures evidence processing into multiple role-specialized agents. MASS-RAG applies distinct agents for… StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning ../papers/arxiv-4d1ad4b081bb.html https://arxiv.org/abs/2604.18401v1#2026-04-21#language-model Tue, 21 Apr 2026 11:40:46 +0800 General agents have given rise to phenomenal applications such as OpenClaw and Claude Code. As these agent systems (a.k.a. Harnesses) strive for bolder goals, they demand increasingly stronger agentic capabilities from foundation Large Language Models (LLMs). Agentic Reinforcement Learning (RL) is emerging as a central post-training paradigm for empowering LLMs with these capabilities and is playing an increasingly pivotal role in agent training. Unlike single-turn token-level alignment or reas… HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents ../papers/arxiv-5cc1d83ffee2.html https://arxiv.org/abs/2604.18349v1#2026-04-21#language-model Tue, 21 Apr 2026 11:40:46 +0800 Long-term conversational large language model (LLM) agents require memory systems that can recover relevant evidence from historical interactions without overwhelming the answer stage with irrelevant context. However, existing memory systems, including hierarchical ones, still often rely solely on vector similarity for retrieval. It tends to produce bloated evidence sets: adding many superficially similar dialogue turns yields little additional recall, but lowers retrieval precision, increases… Semantic Step Prediction: Multi-Step Latent Forecasting in LLM Reasoning Trajectories via Step Sampling ../papers/arxiv-4797a7249e58.html https://arxiv.org/abs/2604.18464v1#2026-04-21#language-model Tue, 21 Apr 2026 11:40:46 +0800 Semantic Tube Prediction (STP) leverages representation geometric to regularize LLM hidden-state trajectories toward locally linear geodesics during fine-tuning, thereby greatly improving data efficiency. The original STP recipe samples random token sub-spans, which is compatible with the base large language model (LLM) training architecture. Inspired by STP, we are interested to investigate whether the sampling position can further enhance the semantic structure of multi-step reasoning, and he… Training and Agentic Inference Strategies for LLM-based Manim Animation Generation ../papers/arxiv-993f63372808.html https://arxiv.org/abs/2604.18364v1#2026-04-21#language-model Tue, 21 Apr 2026 11:40:46 +0800 Generating programmatic animation using libraries such as Manim presents unique challenges for Large Language Models (LLMs), requiring spatial reasoning, temporal sequencing, and familiarity with domain-specific APIs that are underrepresented in general pre-training data. A systematic study of how training and inference strategies interact in this setting is lacking in current research. This study introduces ManimTrainer, a training pipeline that combines Supervised Fine-tuning (SFT) with Reinf… Multilingual Training and Evaluation Resources for Vision-Language Models ../papers/arxiv-bb0f0a1b4a2e.html https://arxiv.org/abs/2604.18347v1#2026-04-21#language-model Tue, 21 Apr 2026 11:40:46 +0800 Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for training, and (ii) the scarcity of comprehensive evaluation benchmarks across languages. In this work, we address these gaps by introducing a new comprehensive suite of resources for VLMs training and evaluation spanning five European languages (English,… Weakly-Supervised Referring Video Object Segmentation through Text Supervision ../papers/arxiv-ccd0dd55c2f1.html https://arxiv.org/abs/2604.17797v1#2026-04-21#language-model Tue, 21 Apr 2026 11:40:46 +0800 Referring video object segmentation (RVOS) aims to segment the target instance in a video, referred by a text expression. Conventional approaches are mostly supervised learning, requiring expensive pixel-level mask annotations. To tackle it, weakly-supervised RVOS has recently been proposed to replace mask annotations with bounding boxes or points, which are however still costly and labor-intensive. In this paper, we design a novel weakly-supervised RVOS method, namely WSRVOS, to train the mode… Revisiting Change VQA in Remote Sensing with Structured and Native Multimodal Qwen Models ../papers/arxiv-c7fa0d917c8c.html https://arxiv.org/abs/2604.18429v1#2026-04-21#language-model Tue, 21 Apr 2026 11:40:46 +0800 Change visual question answering (Change VQA) addresses the problem of answering natural-language questions about semantic changes between bi-temporal remote sensing (RS) images. Although vision-language models (VLMs) have recently been studied for temporal RS image understanding, Change VQA remains underexplored in the context of modern multimodal models. In this letter, we revisit the CDVQA benchmark using recent Qwen models under a unified low-rank adaptation (LoRA) setting. We compare Qwen3… Transforming oncology clinical trial matching through neuro-symbolic, multi-agent AI and an oncology-specific knowledge graph: a prospective evaluation in 3804 patients. ../papers/doi-a39ecce65f3a.html https://pubmed.ncbi.nlm.nih.gov/42004487/#2026-04-21#language-model Tue, 21 Apr 2026 11:40:46 +0800 BACKGROUND: Clinical trial enrollment in oncology remains critically low, with fewer than 5% of eligible adults participating, in large part due to the complexity and labor intensity of eligibility screening. We prospectively evaluated a neuro-symbolic, multi-agent artificial intelligence (AI) platform integrating domain-specific large language model (LLM) agents, an oncology-specific knowledge graph, a real-time recommendation engine, and human-in-the-loop review to determine whether automated… Investigating fine-tuning versus zero-shot learning for general large language models when predicting cancer survival from initial oncology consultation documents. ../papers/doi-eebfc182eb48.html https://pubmed.ncbi.nlm.nih.gov/42004490/#2026-04-21#language-model Tue, 21 Apr 2026 11:40:46 +0800 BACKGROUND: Unstructured oncology consultation notes contain rich clinical information that may support survival prediction. Open-weight large language models (LLMs) can utilize these notes with zero-shot inference or fine-tuning, but their relative value for this setting remains unclear. The objective of this study is to evaluate open-weight LLMs for predicting 60-month survival from initial oncology consultation notes, comparing (i) zero-shot performance, (ii) performance after fine-tuning, a… A Comparative Evaluation of Three Large Language Models for Parent-Centered Questions About Anorexia Nervosa. ../papers/doi-db4a2a7daf35.html https://pubmed.ncbi.nlm.nih.gov/42003757/#2026-04-21#language-model Tue, 21 Apr 2026 11:40:46 +0800 BACKGROUND: Large language models (LLMs) are increasingly used to obtain health information, including guidance on child and adolescent mental health. In anorexia nervosa (AN), where early recognition and timely intervention are critical, the accuracy of AI-generated information available to parents may have important clinical implications. This study evaluated the performance of LLMs in responding to parent-oriented questions about AN. METHODS: A comparative model evaluation was conducted usin… Artificial Intelligence And The Transformation of Labor Markets ../papers/doi-8e5c28ce273b.html https://doi.org/10.5281/zenodo.19641429#2026-04-20#language-model Mon, 20 Apr 2026 11:48:52 +0800 The rapid advancement of artificial intelligence (AI) technologies, particularly generative AI and large language models, has reignited debates about the future of work and the potential for widespread labor market disruption. This article examines the socioeconomic implications of AI-driven automation through the lens of political economy and labor sociology. Drawing on recent empirical studies, industry reports, and historical analyses of technological transitions, the article evaluates compe… Artificial Intelligence And The Transformation of Labor Markets ../papers/doi-0b4fe06c6a1d.html https://doi.org/10.5281/zenodo.19641430#2026-04-20#language-model Mon, 20 Apr 2026 11:48:52 +0800 The rapid advancement of artificial intelligence (AI) technologies, particularly generative AI and large language models, has reignited debates about the future of work and the potential for widespread labor market disruption. This article examines the socioeconomic implications of AI-driven automation through the lens of political economy and labor sociology. Drawing on recent empirical studies, industry reports, and historical analyses of technological transitions, the article evaluates compe… Pretraining effective T5 generative models for clinical and biomedical applications. ../papers/doi-d4977a45ef49.html https://pubmed.ncbi.nlm.nih.gov/41996418/#2026-04-18#language-model Sat, 18 Apr 2026 11:26:55 +0800 This paper presents a study of the impact of corpus selection and vocabulary design on the performance of T5-based language models in clinical and biomedical domains. We introduce five different T5-EHR models, each pretrained from scratch using different combinations of clinical and biomedical corpora alongside domain-specific vocabularies. We evaluated these models across a variety of clinical and biomedical tasks to quantify the impact of pretraining data and vocabulary tokenization choices o… MILU: a consensus ensemble benchmark for multimodal medical imaging lecture understanding. ../papers/doi-04f076dfee40.html https://pubmed.ncbi.nlm.nih.gov/41994492/#2026-04-18#language-model Sat, 18 Apr 2026 11:26:55 +0800 PURPOSE: Vision-language models (VLMs) are increasingly used to interpret multimodal educational materials, yet their reliability on diagram-, equation-, and text-dense scientific lecture slides remains poorly understood. This work introduces Medical Imaging Lecture Understanding (MILU), a large-scale benchmark designed to characterize cross-model variability in structured understanding of real medical imaging lectures. APPROACH: MILU includes 23 lecture sets with 1117 slides. LLaVA-OneVision,… Comparative performance of large language models and Drugs.com versus Lexicomp for antiseizure medication drug-drug interactions: A cross-sectional study with iterative prompting analysis. ../papers/doi-b257aeab2d15.html https://pubmed.ncbi.nlm.nih.gov/41994367/#2026-04-18#language-model Sat, 18 Apr 2026 11:26:55 +0800 BACKGROUND: Antiseizure medications (ASMs) are frequently co-prescribed and are associated with a high risk of clinically significant drug-drug interactions (DDIs). Large language models (LLMs) are increasingly used for clinical queries, yet their performance in detecting ASM-related DDIs compared with established drug interaction databases remains uncertain. METHODS: A cross-sectional comparative study evaluated 186 ASM-comedication pairs (126 classified as major/moderate by Lexicomp) using Ch… Weakly Supervised Composed Object Re-Identification With Large Models. ../papers/doi-4950fa4bce35.html https://pubmed.ncbi.nlm.nih.gov/41996440/#2026-04-18#language-model Sat, 18 Apr 2026 11:26:55 +0800 Existing object re-identification (re-ID) and composed image retrieval (CIR) methods capture different aspects of real-world retrieval requirements; re-ID preserves identity but cannot specify desired appearance changes, whereas CIR supports attribute-guided retrieval but does not enforce identity consistency. To bridge this gap, we introduce composed object re-identification (CORI), a new task that requires the retrieved target to simultaneously satisfy identity preservation and text-guided at… QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies ../papers/arxiv-60286bc4afdd.html https://arxiv.org/abs/2604.15151v1#2026-04-17#language-model Fri, 17 Apr 2026 11:39:21 +0800 Large language models have demonstrated strong performance on general-purpose programming tasks, yet their ability to generate executable algorithmic trading strategies remains underexplored. Unlike standard code benchmarks, trading-strategy generation requires simultaneous mastery of domain-specific financial logic, knowledge of a specialized API, and the ability to produce code that is not only syntactically correct but also leads to actual trades on historical data. In this work, we present… Autonomous Evolution of EDA Tools: Multi-Agent Self-Evolved ABC ../papers/arxiv-c894f3778ac6.html https://arxiv.org/abs/2604.15082v1#2026-04-17#language-model Fri, 17 Apr 2026 11:39:21 +0800 This paper introduces the first \emph{self-evolving} logic synthesis framework, which leverages Large Language Model (LLM) agents to autonomously improve the source code of \textsc{ABC}, the widely adopted logic synthesis system. Our framework operates on the \emph{entire integrated ABC codebase}, and the output repository preserves its single-binary execution model and command interface. In the initial evolution cycle, we bootstrap the system using existing prior open-source synthesis componen… ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints ../papers/arxiv-148ce4c33832.html https://arxiv.org/abs/2604.14902v1#2026-04-17#language-model Fri, 17 Apr 2026 11:39:21 +0800 Intelligent embodied agents should not simply follow instructions, as real-world environments often involve unexpected conditions and exceptions. However, existing methods usually focus on directly executing instructions, without considering whether the target objects can actually be manipulated, meaning they fail to assess available affordances. To address this limitation, we introduce DynAfford, a benchmark that evaluates embodied agents in dynamic environments where object affordances may ch… From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning ../papers/arxiv-671db056cec2.html https://arxiv.org/abs/2604.15244v1#2026-04-17#language-model Fri, 17 Apr 2026 11:39:21 +0800 Speculative decoding (SD) accelerates large language model inference by allowing a lightweight draft model to propose outputs that a stronger target model verifies. However, its token-centric nature allows erroneous steps to propagate. Prior approaches mitigate this using external reward models, but incur additional latency, computational overhead, and limit generalizability. We propose SpecGuard, a verification-aware speculative decoding framework that performs step-level verification using on… RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography ../papers/arxiv-d12df90e00da.html https://arxiv.org/abs/2604.15231v1#2026-04-17#language-model Fri, 17 Apr 2026 11:39:21 +0800 Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tomography (CT). Yet, existing methods largely relegate clinicians to passive observers of final outputs, offering no interpretable reasoning trace for them to inspect, validate, or refine. To address this, we introduce RadAgent, a tool-using AI agent that generates CT reports through a stepwise and interpretable process. Each resulting report is accompanied by… RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models ../papers/arxiv-4a4068542625.html https://arxiv.org/abs/2604.14951v1#2026-04-17#language-model Fri, 17 Apr 2026 11:39:21 +0800 Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources -- such as APIs, computational utilities, and specialized models -- to solve complex tasks beyond the reach of standalone language generation. While recent advances in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have expanded their reasoning and perception capabilities, existing tool-use methods are predominantly limited to text-only inputs and closed-world s… Applying natural language processing and large language models to clinical notes for phenotyping and diagnosing rare diseases: a systematic review. ../papers/doi-caeec9f876b5.html https://pubmed.ncbi.nlm.nih.gov/41990239/#2026-04-17#language-model Fri, 17 Apr 2026 11:39:21 +0800 OBJECTIVES: Patients with rare diseases often face long delays before receiving a diagnosis. Using electronic health records for automated phenotyping and diagnosis of rare diseases is a promising approach but can be challenging because critical information is often recorded in unstructured notes rather than structured fields. This systematic review synthesizes the current literature applying natural language processing (NLP) and large language models (LLMs) for rare disease phenotyping and dia… Evaluation of large language models with clinical guidance for vetting outpatient magnetic resonance imaging lumbar spine referrals. ../papers/doi-2fe134b4d7bc.html https://pubmed.ncbi.nlm.nih.gov/41989203/#2026-04-17#language-model Fri, 17 Apr 2026 11:39:21 +0800 ObjectivesAccurate triage of lumbar spine magnetic resonance imaging (MRI) referrals for sciatica is important for patient assessment, diagnosis and surgical planning. This study evaluates the accuracy and speed of large language models (LLMs) in automatically vetting lumbar spine MRI referrals from general practice.MethodsThree LLMs (GPT-4, Claude Opus, Gemini) were tasked with assigning an outcome (Accept - Routine, Accept - Urgent, Reject) and flagging MRI contraindications for lumbar spine… From Image to Pixels: towards Fine-Grained Medical Vision-Language Models. ../papers/doi-71303bb82f13.html https://pubmed.ncbi.nlm.nih.gov/41989909/#2026-04-17#language-model Fri, 17 Apr 2026 11:39:21 +0800 Multimodal large language models (MLLMs) offer immense potential for biomedical AI, yet current applications remain limited to coarse-grained image understanding and basic textual queries-falling short of the fine-grained reasoning required in clinical contexts. In this work, we present a comprehensive solution spanning data, model, and training innovations to advance pixel-level multimodal intelligence in biomedicine. First, we construct MeCoVQA, a new visual-language benchmark that spans eigh… Targeted use of large language models for EHR-based computable phenotyping. ../papers/doi-d44eb8c5ebfc.html https://pubmed.ncbi.nlm.nih.gov/41990328/#2026-04-17#language-model Fri, 17 Apr 2026 11:39:21 +0800 OBJECTIVE: Computable phenotypes derived from electronic health records (EHRs) are central to clinical research and quality reporting. Although large language models (LLMs) can extract clinically rich information from unstructured notes, routine application to all patients is computationally expensive. We evaluated whether uncertainty-guided selective use of LLMs can improve phenotyping accuracy while preserving scalability. MATERIALS AND METHODS: We developed a selective augmentation framework… Dual perspectives on large language models in rheumatology: physician-rated quality and patient-centered usability of GPT-4o versus DeepSeek-V3. ../papers/doi-fa629176d611.html https://pubmed.ncbi.nlm.nih.gov/41989204/#2026-04-17#language-model Fri, 17 Apr 2026 11:39:21 +0800 OBJECTIVES: This study conducted an informatics system evaluation of two LLMs (GPT-4o and DeepSeek-V3) for patient education, combining clinician-rated quality with patient-perceived usability across thematically stratified queries. MATERIALS AND METHODS: In a blinded, within-subject design, 16 frequently asked questions about biologic therapies were categorized into three domains: treatment/drug selection, safety/adverse effects, and special conditions/daily life. Responses were standardized,… GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis ../papers/arxiv-283874153373.html https://arxiv.org/abs/2604.13888v1#2026-04-16#language-model Thu, 16 Apr 2026 11:43:00 +0800 The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift toward autonomous spatial analysis. However, evaluating these LLM-based agents remains challenging due to the complex, multi-step nature of geospatial workflows. Existing benchmarks primarily rely on static text or code matching, neglecting dynamic runtime feedback and the multimodal nature of spatial outputs. To address this gap, we introduce GeoAgentBench (GABench), a dynamic and i… Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning ../papers/arxiv-82411c54ef00.html https://arxiv.org/abs/2604.13804v1#2026-04-16#language-model Thu, 16 Apr 2026 11:43:00 +0800 The rapid evolution of multimodal large models has revolutionized the simulation of diverse characters in speech dialogue systems, enabling a novel interactive paradigm. Character attributes are manifested not only in textual responses but also through vocal features, as speech conveys rich paralinguistic information that is challenging to quantify. This poses significant difficulties in evaluating the character alignment of role-playing agents. To address these challenges, we present RoleJudge… LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning ../papers/arxiv-c517a8dff3b8.html https://arxiv.org/abs/2604.14140v1#2026-04-16#language-model Thu, 16 Apr 2026 11:43:00 +0800 As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Probl… TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration ../papers/arxiv-7436d1e41f94.html https://arxiv.org/abs/2604.14116v1#2026-04-16#language-model Thu, 16 Apr 2026 11:43:00 +0800 While Large Language Models (LLMs) have empowered AI research agents to perform isolated scientific tasks, automating complex, real-world workflows, such as LLM training, remains a significant challenge. In this paper, we introduce TREX, a multi-agent system that automates the entire LLM training life-cycle. By orchestrating collaboration between two core modules-the Researcher and the Executor-the system seamlessly performs requirement analysis, open-domain literature and data research, formul… The cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents ../papers/arxiv-7da57578b1cc.html https://arxiv.org/abs/2604.13759v1#2026-04-16#language-model Thu, 16 Apr 2026 11:43:00 +0800 Large language model (LLM) agents on multi-step tasks suffer reasoning degradation, looping, drift, stuck states, at rates up to 30% on hard tasks. Current solutions include hard step limits (abrupt) or LLM-as-judge monitoring (10-15% overhead per step). This paper introduces the Cognitive Companion, a parallel monitoring architecture with two implementations: an LLM-based Companion and a novel zero-overhead Probe-based Companion. We report a three-batch feasibility study centered on Gemma 4 E4… MAny: Merge Anything for Multimodal Continual Instruction Tuning ../papers/arxiv-b488936a3be9.html https://arxiv.org/abs/2604.14016v1#2026-04-16#language-model Thu, 16 Apr 2026 11:43:00 +0800 Multimodal Continual Instruction Tuning (MCIT) is essential for sequential task adaptation of Multimodal Large Language Models (MLLMs) but is severely restricted by catastrophic forgetting. While existing literature focuses on the reasoning language backbone, in this work, we expose a critical yet neglected dual-forgetting phenomenon across both perception drift in Cross-modal Projection Space and reasoning collapse in Low-rank Parameter Space. To resolve this, we present \textbf{MAny} (\textbf…