multimodal Topic Archive

multimodal Topic Archive multimodal.html 关键词 multimodal 的长期追踪 RSS，汇总历史命中文献。 zh-CN Wed, 22 Apr 2026 03:37:20 +0000 A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding ../papers/arxiv-5fe8f705aa06.html https://arxiv.org/abs/2604.19689v1#2026-04-22#multimodal Wed, 22 Apr 2026 11:37:03 +0800 Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-… SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models ../papers/arxiv-6f4a587095d1.html https://arxiv.org/abs/2604.19638v1#2026-04-22#multimodal Wed, 22 Apr 2026 11:37:03 +0800 Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gem… Lost in Translation: Do LVLM Judges Generalize Across Languages? ../papers/arxiv-542a2e2a02e6.html https://arxiv.org/abs/2604.19405v1#2026-04-22#multimodal Wed, 22 Apr 2026 11:37:03 +0800 Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these evaluators are almost exclusively assessed on English-centric benchmarks, leaving open the question of how well these evaluators generalize across languages. To answer this question, we introduce MM-JudgeBench, the first large-scale benchmark for multilingual and multimodal judge model evaluation, which includes over 60K… PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving ../papers/arxiv-523377f05ec5.html https://arxiv.org/abs/2604.19379v1#2026-04-22#multimodal Wed, 22 Apr 2026 11:37:03 +0800 This paper presents the first study on Unsupervised Domain Adaptation (UDA) for multimodal 3D panoptic segmentation (mm-3DPS), aiming to improve generalization under domain shifts commonly encountered in real-world autonomous driving. A straightforward solution is to employ a pseudo-labeling strategy, which is widely used in UDA to generate supervision for unlabeled target data, combined with an mm-3DPS backbone. However, existing supervised mm-3DPS methods rely heavily on strong cross-modal co… Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval ../papers/arxiv-39272031a7a0.html https://arxiv.org/abs/2604.19135v1#2026-04-22#multimodal Wed, 22 Apr 2026 11:37:03 +0800 This paper presents the first exploration of text-to-image diffusion models for zero-shot sketch-based 3D shape retrieval (ZS-SBSR). Existing sketch-based 3D shape retrieval methods struggle in zero-shot settings due to the absence of category supervision and the extreme sparsity of sketch inputs. Our key insight is that large-scale pretrained diffusion models inherently exhibit open-vocabulary capability and strong shape bias, making them well suited for zero-shot visual retrieval. We leverage… How Far Are Video Models from True Multimodal Reasoning? ../papers/arxiv-f1cd701c6156.html https://arxiv.org/abs/2604.19193v1#2026-04-22#multimodal Wed, 22 Apr 2026 11:37:03 +0800 Despite remarkable progress toward general-purpose video models, a critical question remains unanswered: how far are these models from achieving true multimodal reasoning? Existing benchmarks fail to address this question rigorously, as they remain constrained by straightforward task designs and fragmented evaluation metrics that neglect complex multimodal reasoning. To bridge this gap, we introduce CLVG-Bench, an evaluation framework designed to probe video models' zero-shot reasoning capabili… EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation ../papers/arxiv-3a3bbc2e6e3a.html https://arxiv.org/abs/2604.19105v1#2026-04-22#multimodal Wed, 22 Apr 2026 11:37:03 +0800 Faithfully modeling human behavior in dynamic environments is a foundational challenge for embodied intelligence. While conditional motion synthesis has achieved significant advances, egocentric motion generation remains largely underexplored due to the inherent complexity of first-person perception. In this work, we investigate Egocentric Vision-Language (Ego-VL) motion generation. This task requires synthesizing 3D human motion conditioned jointly on first-person visual observations and natur… MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval ../papers/arxiv-520299161763.html https://arxiv.org/abs/2604.18584v1#2026-04-21#multimodal Tue, 21 Apr 2026 11:40:46 +0800 Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, an… Multilingual Training and Evaluation Resources for Vision-Language Models ../papers/arxiv-bb0f0a1b4a2e.html https://arxiv.org/abs/2604.18347v1#2026-04-21#multimodal Tue, 21 Apr 2026 11:40:46 +0800 Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for training, and (ii) the scarcity of comprehensive evaluation benchmarks across languages. In this work, we address these gaps by introducing a new comprehensive suite of resources for VLMs training and evaluation spanning five European languages (English,… Weakly-Supervised Referring Video Object Segmentation through Text Supervision ../papers/arxiv-ccd0dd55c2f1.html https://arxiv.org/abs/2604.17797v1#2026-04-21#multimodal Tue, 21 Apr 2026 11:40:46 +0800 Referring video object segmentation (RVOS) aims to segment the target instance in a video, referred by a text expression. Conventional approaches are mostly supervised learning, requiring expensive pixel-level mask annotations. To tackle it, weakly-supervised RVOS has recently been proposed to replace mask annotations with bounding boxes or points, which are however still costly and labor-intensive. In this paper, we design a novel weakly-supervised RVOS method, namely WSRVOS, to train the mode… Revisiting Change VQA in Remote Sensing with Structured and Native Multimodal Qwen Models ../papers/arxiv-c7fa0d917c8c.html https://arxiv.org/abs/2604.18429v1#2026-04-21#multimodal Tue, 21 Apr 2026 11:40:46 +0800 Change visual question answering (Change VQA) addresses the problem of answering natural-language questions about semantic changes between bi-temporal remote sensing (RS) images. Although vision-language models (VLMs) have recently been studied for temporal RS image understanding, Change VQA remains underexplored in the context of modern multimodal models. In this letter, we revisit the CDVQA benchmark using recent Qwen models under a unified low-rank adaptation (LoRA) setting. We compare Qwen3… MILU: a consensus ensemble benchmark for multimodal medical imaging lecture understanding. ../papers/doi-04f076dfee40.html https://pubmed.ncbi.nlm.nih.gov/41994492/#2026-04-18#multimodal Sat, 18 Apr 2026 11:26:55 +0800 PURPOSE: Vision-language models (VLMs) are increasingly used to interpret multimodal educational materials, yet their reliability on diagram-, equation-, and text-dense scientific lecture slides remains poorly understood. This work introduces Medical Imaging Lecture Understanding (MILU), a large-scale benchmark designed to characterize cross-model variability in structured understanding of real medical imaging lectures. APPROACH: MILU includes 23 lecture sets with 1117 slides. LLaVA-OneVision,… Weakly Supervised Composed Object Re-Identification With Large Models. ../papers/doi-4950fa4bce35.html https://pubmed.ncbi.nlm.nih.gov/41996440/#2026-04-18#multimodal Sat, 18 Apr 2026 11:26:55 +0800 Existing object re-identification (re-ID) and composed image retrieval (CIR) methods capture different aspects of real-world retrieval requirements; re-ID preserves identity but cannot specify desired appearance changes, whereas CIR supports attribute-guided retrieval but does not enforce identity consistency. To bridge this gap, we introduce composed object re-identification (CORI), a new task that requires the retrieved target to simultaneously satisfy identity preservation and text-guided at… From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench ../papers/arxiv-913915b00c96.html https://arxiv.org/abs/2604.15037v1#2026-04-17#multimodal Fri, 17 Apr 2026 11:39:21 +0800 Recent advancements in LLM agents are gradually shifting from reactive, text-based paradigms toward proactive, multimodal interaction. However, existing benchmarks primarily focus on reactive responses, overlooking the complexities of proactive intervention and monitoring. To bridge this gap, we introduce ProVoice-Bench, the first evaluation framework specifically designed for proactive voice agents, featuring four novel tasks. By leveraging a multi-stage data synthesis pipeline, we curate 1,18… MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation ../papers/arxiv-9f9995d5a903.html https://arxiv.org/abs/2604.15309v1#2026-04-17#multimodal Fri, 17 Apr 2026 11:39:21 +0800 The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage genera… RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework ../papers/arxiv-27b12af34ce1.html https://arxiv.org/abs/2604.15308v1#2026-04-17#multimodal Fri, 17 Apr 2026 11:39:21 +0800 High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop plann… RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models ../papers/arxiv-4a4068542625.html https://arxiv.org/abs/2604.14951v1#2026-04-17#multimodal Fri, 17 Apr 2026 11:39:21 +0800 Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources -- such as APIs, computational utilities, and specialized models -- to solve complex tasks beyond the reach of standalone language generation. While recent advances in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have expanded their reasoning and perception capabilities, existing tool-use methods are predominantly limited to text-only inputs and closed-world s… From Image to Pixels: towards Fine-Grained Medical Vision-Language Models. ../papers/doi-71303bb82f13.html https://pubmed.ncbi.nlm.nih.gov/41989909/#2026-04-17#multimodal Fri, 17 Apr 2026 11:39:21 +0800 Multimodal large language models (MLLMs) offer immense potential for biomedical AI, yet current applications remain limited to coarse-grained image understanding and basic textual queries-falling short of the fine-grained reasoning required in clinical contexts. In this work, we present a comprehensive solution spanning data, model, and training innovations to advance pixel-level multimodal intelligence in biomedicine. First, we construct MeCoVQA, a new visual-language benchmark that spans eigh… GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis ../papers/arxiv-283874153373.html https://arxiv.org/abs/2604.13888v1#2026-04-16#multimodal Thu, 16 Apr 2026 11:43:00 +0800 The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift toward autonomous spatial analysis. However, evaluating these LLM-based agents remains challenging due to the complex, multi-step nature of geospatial workflows. Existing benchmarks primarily rely on static text or code matching, neglecting dynamic runtime feedback and the multimodal nature of spatial outputs. To address this gap, we introduce GeoAgentBench (GABench), a dynamic and i… Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning ../papers/arxiv-82411c54ef00.html https://arxiv.org/abs/2604.13804v1#2026-04-16#multimodal Thu, 16 Apr 2026 11:43:00 +0800 The rapid evolution of multimodal large models has revolutionized the simulation of diverse characters in speech dialogue systems, enabling a novel interactive paradigm. Character attributes are manifested not only in textual responses but also through vocal features, as speech conveys rich paralinguistic information that is challenging to quantify. This poses significant difficulties in evaluating the character alignment of role-playing agents. To address these challenges, we present RoleJudge… MAny: Merge Anything for Multimodal Continual Instruction Tuning ../papers/arxiv-b488936a3be9.html https://arxiv.org/abs/2604.14016v1#2026-04-16#multimodal Thu, 16 Apr 2026 11:43:00 +0800 Multimodal Continual Instruction Tuning (MCIT) is essential for sequential task adaptation of Multimodal Large Language Models (MLLMs) but is severely restricted by catastrophic forgetting. While existing literature focuses on the reasoning language backbone, in this work, we expose a critical yet neglected dual-forgetting phenomenon across both perception drift in Cross-modal Projection Space and reasoning collapse in Low-rank Parameter Space. To resolve this, we present \textbf{MAny} (\textbf… MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging ../papers/arxiv-309351a1c9e5.html https://arxiv.org/abs/2604.13756v1#2026-04-16#multimodal Thu, 16 Apr 2026 11:43:00 +0800 The potential of Multimodal Large Language Models (MLLMs) in domain of medical imaging raise the demands of systematic and rigorous evaluation frameworks that are aligned with the real-world medical imaging practice. Existing practices that report single or coarse-grained metrics are lack the granularity required for specialized clinical support and fail to assess the reliability of reasoning mechanisms. To address this, we propose a paradigm shift toward multidimensional, fine-grained and in-d… ROSE: Retrieval-Oriented Segmentation Enhancement ../papers/arxiv-e008501b0fb5.html https://arxiv.org/abs/2604.14147v1#2026-04-16#multimodal Thu, 16 Apr 2026 11:43:00 +0800 Existing segmentation models based on multimodal large language models (MLLMs), such as LISA, often struggle with novel or emerging entities due to their inability to incorporate up-to-date knowledge. To address this challenge, we introduce the Novel Emerging Segmentation Task (NEST), which focuses on segmenting (i) novel entities that MLLMs fail to recognize due to their absence from training data, and (ii) emerging entities that exist within the model's knowledge but demand up-to-date externa… Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models ../papers/arxiv-edb7485d7898.html https://arxiv.org/abs/2604.14044v1#2026-04-16#multimodal Thu, 16 Apr 2026 11:43:00 +0800 While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental "temporal blindness". Existing architectures lack intrinsic mechanisms for multi-temporal contrastive reasoning and struggle with precise spatial grounding. To address this, we first introduce Delta-QA, a comprehensive benchmark comprising 180k visual question-answering samples. Delta-QA unifies pixel-level segmentation and… Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding ../papers/arxiv-60b6a5d36d13.html https://arxiv.org/abs/2604.13540v1#2026-04-16#multimodal Thu, 16 Apr 2026 11:43:00 +0800 Unified Multimodal Models (UMMs) aim to integrate visual understanding and generation within a single structure. However, these models exhibit a notable capability mismatch, where their understanding capability significantly outperforms their generation. This mismatch indicates that the model's rich internal knowledge, while effective for understanding tasks, remains underactivated during generation. To address this, we draw inspiration from the human ``Thinking-While-Drawing'' paradigm, where… POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch ../papers/arxiv-cb64e4ef9ea1.html https://arxiv.org/abs/2604.14029v1#2026-04-16#multimodal Thu, 16 Apr 2026 11:43:00 +0800 While Large Multimodal Models (LMMs) demonstrate impressive visual perception, they remain epistemically constrained by their static parametric knowledge. To transcend these boundaries, multimodal search models have been adopted to actively interact with the external environment for evidence retrieval. Diverging from prevailing paradigms that merely retrofit general LMMs with search tools as modular extensions, we explore the potential of building a multimodal agentic search model from scratch.… Towards Long-horizon Agentic Multimodal Search ../papers/arxiv-c584099374f8.html https://arxiv.org/abs/2604.12890v1#2026-04-15#multimodal Wed, 15 Apr 2026 11:35:50 +0800 Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous information and high token costs associated with multimodal inputs over long horizons remains a critical challenge, as existing methods often suffer from context explosion or the loss of crucial visual signals. To address this, we propose a novel Long-horizon MultiModal deep search framework, named LMM-Searcher, centered on… Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs ../papers/arxiv-d7c2dcff959d.html https://arxiv.org/abs/2604.12896v1#2026-04-15#multimodal Wed, 15 Apr 2026 11:35:50 +0800 Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool-generated visual cues, MLLMs often fail to benefit from them. Existing approaches typically feed raw tool outputs into the model, but these dense, pixel-level representations are misaligned with the language-native reasoning strengths of LLMs, leading to weak perception and reliance on language priors. We argue that,… RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation ../papers/arxiv-3027c60dff03.html https://arxiv.org/abs/2604.12319v1#2026-04-15#multimodal Wed, 15 Apr 2026 11:35:50 +0800 Multimodal semantic segmentation has emerged as a powerful paradigm for enhancing scene understanding by leveraging complementary information from multiple sensing modalities (e.g., RGB, depth, and thermal). However, existing cross-modal fusion methods often implicitly assume that all modalities are equally reliable, which can lead to feature degradation when auxiliary modalities are noisy, misaligned, or incomplete. In this paper, we revisit cross-modal fusion from the perspective of modality… All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding ../papers/arxiv-ba711ee91078.html https://arxiv.org/abs/2604.12335v1#2026-04-15#multimodal Wed, 15 Apr 2026 11:35:50 +0800 Training multimodal large language models (MLLMs) for video understanding requires large-scale annotated data spanning diverse tasks such as object counting, question answering, and segmentation. However, collecting and annotating multimodal video data in real-world is costly, slow, and inherently limited in diversity and coverage. To address this challenge, we propose a unified synthetic data generation pipeline capable of automatically producing unlimited multimodal video data with rich and d… Probabilistic Feature Imputation and Uncertainty-Aware Multimodal Federated Aggregation ../papers/arxiv-b480ff0cabeb.html https://arxiv.org/abs/2604.12970v1#2026-04-15#multimodal Wed, 15 Apr 2026 11:35:50 +0800 Multimodal federated learning enables privacy-preserving collaborative model training across healthcare institutions. However, a fundamental challenge arises from modality heterogeneity: many clinical sites possess only a subset of modalities due to resource constraints or workflow variations. Existing approaches address this through feature imputation networks that synthesize missing modality representations, yet these methods produce point estimates without reliability measures, forcing downs… Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks ../papers/arxiv-df5064c793f1.html https://arxiv.org/abs/2604.12833v1#2026-04-15#multimodal Wed, 15 Apr 2026 11:35:50 +0800 Vision-Language Models (VLMs) have shown remarkable performance, yet their security remains insufficiently understood. Existing adversarial studies focus almost exclusively on the digital setting, leaving physical-world threats largely unexplored. As VLMs are increasingly deployed in real environments, this gap becomes critical, since adversarial perturbations must be physically realizable. Despite this practical relevance, physical attacks against VLMs have not been systematically studied. Suc… Multimodal large language models in brain tumor imaging: clinical applications and future perspectives. ../papers/doi-fb5d26b2eb57.html https://pubmed.ncbi.nlm.nih.gov/41979660/#2026-04-15#multimodal Wed, 15 Apr 2026 11:35:50 +0800 The use of multimodal data is essential for the precise diagnosis and treatment of brain tumors. In this context, multimodal data encompass multisequence magnetic resonance imaging, computed tomography, positron emission tomography, histopathological images, molecular and genomic profiles, structured clinical variables, and radiological reports. With the rapid advancement of artificial intelligence, integrating these heterogeneous data sources has become a central research direction for improvi… Bridging the Modality Gap in Medical Vision-Language Models: A Hybrid Contrastive-Optimal Transport Framework for Enhanced Cross-Modal Alignment. ../papers/doi-48f3f7f35ec5.html https://pubmed.ncbi.nlm.nih.gov/41979955/#2026-04-15#multimodal Wed, 15 Apr 2026 11:35:50 +0800 Vision-language models in healthcare face a critical limitation, i.e., the modality gap, where image and text embeddings occupy distantly separated regions in shared representation space. This is reinforced by traditional contrastive learning objectives, and manifests itself through fundamental constraints in cross-modal understanding and downstream task performance. Existing approaches focus on addressing input-level requirements, however, the geometric constraints imposed by multimodal contra… Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games ../papers/arxiv-c0dfb20b6ba0.html https://arxiv.org/abs/2604.11741v1#2026-04-14#multimodal Tue, 14 Apr 2026 11:37:06 +0800 Vision-language models (VLMs) have shown impressive capabilities in perceptual tasks, yet they degrade in complex multi-hop reasoning under multiplayer game settings with imperfect and deceptive information. In this paper, we study a representative multiplayer task, Murder Mystery Games, which require inferring hidden truths based on partial clues provided by roles with different intentions. To address this challenge, we propose a collaborative multi-agent framework for evaluating and synthesiz… Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving ../papers/arxiv-9e98a0ea67e8.html https://arxiv.org/abs/2604.11734v1#2026-04-14#multimodal Tue, 14 Apr 2026 11:37:06 +0800 Closed-loop cooperative driving requires planners that generate realistic multimodal multi-agent trajectories while improving safety and traffic efficiency. Existing diffusion planners can model multimodal behaviors from demonstrations, but they often exhibit weak scene consistency and remain poorly aligned with closed-loop objectives; meanwhile, stable online post-training in reactive multi-agent environments remains difficult. We present Multi-ORFT, which couples scene-conditioned diffusion p… OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation ../papers/arxiv-c9ca67ecc0bf.html https://arxiv.org/abs/2604.11804v1#2026-04-14#multimodal Tue, 14 Apr 2026 11:37:06 +0800 In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an… LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation ../papers/arxiv-be45283d75a9.html https://arxiv.org/abs/2604.11789v1#2026-04-14#multimodal Tue, 14 Apr 2026 11:37:06 +0800 Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision--language understanding, yet they remain limited in tasks requiring precise object-level grounding, fine-grained spatial reasoning, and controllable visual manipulation. In particular, existing systems often struggle to identify the correct instance, preserve object identity across interactions, and localize or modify designated regions with high precision. Object-centric vision provides a principled fram… GeomPrompt: Geometric Prompt Learning for RGB-D Semantic Segmentation Under Missing and Degraded Depth ../papers/arxiv-a58e7d937629.html https://arxiv.org/abs/2604.11585v1#2026-04-14#multimodal Tue, 14 Apr 2026 11:37:06 +0800 Multimodal perception systems for robotics and embodied AI often assume reliable RGB-D sensing, but in practice, depth is frequently missing, noisy, or corrupted. We thus present GeomPrompt, a lightweight cross-modal adaptation module that synthesizes a task-driven geometric prompt from RGB alone for the fourth channel of a frozen RGB-D semantic segmentation model, without depth supervision. We further introduce GeomPrompt-Recovery, an adaptation module that compensates for degraded depth by pr… Anthropogenic Regional Adaptation in Multimodal Vision-Language Model ../papers/arxiv-bb16b976d1a8.html https://arxiv.org/abs/2604.11490v1#2026-04-14#multimodal Tue, 14 Apr 2026 11:37:06 +0800 While the field of vision-language (VL) has achieved remarkable success in integrating visual and textual information across multiple languages and domains, there is still no dedicated framework for assessing human-centric alignment in vision-language systems. We offer two contributions to address this gap. First, we introduce Anthropogenic Regional Adaptation: a novel paradigm that aims to optimize model relevance to specific regional contexts while ensuring the retention of global generalizat… GazeVaLM: A Multi-Observer Eye-Tracking Benchmark for Evaluating Clinical Realism in AI-Generated X-Rays ../papers/doi-797bb9dad901.html https://arxiv.org/abs/2604.11653v1#2026-04-14#multimodal Tue, 14 Apr 2026 11:37:06 +0800 We introduce GazeVaLM, a public eye-tracking dataset for studying clinical perception during chest radiograph authenticity assessment. The dataset comprises 960 gaze recordings from 16 expert radiologists interpreting 30 real and 30 synthetic chest X-rays (generated by diffusion based generative AI) under two conditions: diagnostic assessment and real-fake classification (Visual Turing test). For each image-observer pair, we provide raw gaze samples, fixation maps, scanpaths, saliency density m… Text4Seg++: Advancing Image Segmentation via Generative Language Modeling. ../papers/doi-b67edb02c604.html https://pubmed.ncbi.nlm.nih.gov/41973591/#2026-04-14#multimodal Tue, 14 Apr 2026 11:37:06 +0800 Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks. However, effectively integrating image segmentation into these models remains a significant challenge. In this work, we propose a novel text-as-mask paradigm that casts image segmentation as a text generation problem, eliminating the need for additional decoders and significantly simplifying the segmentation process. Our key innovation is semantic descriptors, a new textual representation of s… Comparative Performance of Gemini 3 Pro and GPT-5 Family Models on Ophthalmology Board-Style Questions. ../papers/doi-a326948aeb7e.html https://pubmed.ncbi.nlm.nih.gov/41970036/#2026-04-14#multimodal Tue, 14 Apr 2026 11:37:06 +0800 OBJECTIVE: To compare the performance of state-of-the-art Gemini and GPT models on ophthalmology board-style questions and examine variation by subspecialty, cognitive complexity, and question type. DESIGN: A cross-sectional evaluation of 12 distinct large language model (LLM) configurations using a standardized ophthalmology question set. SUBJECTS: Five hundred multiple-choice questions (250 from the American Academy of Ophthalmology's Basic and Clinical Science Course [BCSC]; 250 StatPearls).… MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control ../papers/arxiv-da7fd45377bf.html https://arxiv.org/abs/2604.06156v1#2026-04-08#multimodal Wed, 08 Apr 2026 17:10:24 +0800 MLLMs have been successfully applied to multimodal embedding tasks, yet their generative reasoning capabilities remain underutilized. Directly incorporating chain-of-thought reasoning into embedding learning introduces two fundamental challenges. First, structural misalignment between instance-level reasoning and pairwise contrastive supervision may lead to shortcut behavior, where the model merely learns the superficial format of reasoning. Second, reasoning is not universally beneficial for e… Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents ../papers/arxiv-ad262155f5ef.html https://arxiv.org/abs/2604.06132v1#2026-04-08#multimodal Wed, 08 Apr 2026 17:10:24 +0800 Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verifie… Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery ../papers/arxiv-f5eea61bb28e.html https://arxiv.org/abs/2604.06124v1#2026-04-08#multimodal Wed, 08 Apr 2026 17:10:24 +0800 This study proposes a lightweight multimodal adaptation framework to bridge the representation gap between RGB-pretrained VLMs and thermal infrared imagery, and demonstrates its practical utility using a real drone-collected dataset. A thermal dataset was developed from drone-collected imagery and was used to fine-tune VLMs through multimodal projector alignment, enabling the transfer of information from RGB-based visual representations to thermal radiometric inputs. Three representative models… MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control ../papers/arxiv-da7fd45377bf.html https://arxiv.org/abs/2604.06156v1#2026-04-08#multimodal Wed, 08 Apr 2026 17:10:24 +0800 MLLMs have been successfully applied to multimodal embedding tasks, yet their generative reasoning capabilities remain underutilized. Directly incorporating chain-of-thought reasoning into embedding learning introduces two fundamental challenges. First, structural misalignment between instance-level reasoning and pairwise contrastive supervision may lead to shortcut behavior, where the model merely learns the superficial format of reasoning. Second, reasoning is not universally beneficial for e… Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery ../papers/arxiv-f5eea61bb28e.html https://arxiv.org/abs/2604.06124v1#2026-04-08#multimodal Wed, 08 Apr 2026 17:10:24 +0800 This study proposes a lightweight multimodal adaptation framework to bridge the representation gap between RGB-pretrained VLMs and thermal infrared imagery, and demonstrates its practical utility using a real drone-collected dataset. A thermal dataset was developed from drone-collected imagery and was used to fine-tune VLMs through multimodal projector alignment, enabling the transfer of information from RGB-based visual representations to thermal radiometric inputs. Three representative models… Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning ../papers/arxiv-1f39f1fd7a01.html https://arxiv.org/abs/2604.06079v1#2026-04-08#multimodal Wed, 08 Apr 2026 17:10:24 +0800 Graphics Program Synthesis is pivotal for interpreting and editing visual data, effectively facilitating the reverse-engineering of static visuals into editable TikZ code. While TikZ is the de facto standard for scientific schematics due to its programmatic flexibility, its requirement for rigorous spatial precision presents a significant challenge for Multimodal Large Language Models. Progress is currently stifled by two primary gaps: (1) Data Quality Gap: existing image-TikZ corpora often lac… CoStream: Codec-Guided Resource-Efficient System for Video Streaming Analytics ../papers/arxiv-244e7ca07a93.html https://arxiv.org/abs/2604.06036v1#2026-04-08#multimodal Wed, 08 Apr 2026 17:10:24 +0800 Video streaming analytics is a crucial workload for vision-language model serving, but the high cost of multimodal inference limits scalability. Prior systems reduce inference cost by exploiting temporal and spatial redundancy in video streams, but they target either the vision transformer (ViT) or the LLM with a limited view, leaving end-to-end opportunities untapped. Moreover, existing methods incur significant overhead to identify redundancy, either through offline profiling and training or…