Keyword Tracking

关键词追踪：multimodal

这个页面会长期追踪你配置里关心的关键词，并把命中的论文按日期沉淀下来。

返回归档首页查看趋势总览最新 JSON 订阅 RSS

近期走势

最近一次命中来自 LLM：A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

2026-04-09

2026-04-10

2026-04-11

2026-04-12

2026-04-13

2026-04-14

2026-04-15

2026-04-16

2026-04-17

2026-04-18

2026-04-19

2026-04-20

2026-04-21

2026-04-22

命中明细

按日期回看匹配到这个关键词的论文标题，并保留来源 feed 信息。

2026-04-22

2026-04-22 11:37:03 (Asia/Shanghai)

LLM

A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

查看原始来源

Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promis…

LLM

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

查看原始来源

Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insuffi…

LLM

Lost in Translation: Do LVLM Judges Generalize Across Languages?

查看原始来源

Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these eva…

Vision

PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving

查看原始来源

This paper presents the first study on Unsupervised Domain Adaptation (UDA) for multimodal 3D panoptic segmentation (mm-3DPS), aiming to improve generalization under domain shifts…

Vision

Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval

查看原始来源

This paper presents the first exploration of text-to-image diffusion models for zero-shot sketch-based 3D shape retrieval (ZS-SBSR). Existing sketch-based 3D shape retrieval metho…

Vision

How Far Are Video Models from True Multimodal Reasoning?

查看原始来源

Despite remarkable progress toward general-purpose video models, a critical question remains unanswered: how far are these models from achieving true multimodal reasoning? Existin…

Vision

EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation

查看原始来源

Faithfully modeling human behavior in dynamic environments is a foundational challenge for embodied intelligence. While conditional motion synthesis has achieved significant advan…

2026-04-21

2026-04-21 11:40:46 (Asia/Shanghai)

LLM

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

查看原始来源

Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and…

LLM

Multilingual Training and Evaluation Resources for Vision-Language Models

查看原始来源

Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limi…

Vision

Weakly-Supervised Referring Video Object Segmentation through Text Supervision

查看原始来源

Referring video object segmentation (RVOS) aims to segment the target instance in a video, referred by a text expression. Conventional approaches are mostly supervised learning, r…

Vision

Revisiting Change VQA in Remote Sensing with Structured and Native Multimodal Qwen Models

查看原始来源

Change visual question answering (Change VQA) addresses the problem of answering natural-language questions about semantic changes between bi-temporal remote sensing (RS) images.…

2026-04-18

2026-04-18 11:26:55 (Asia/Shanghai)

PubMed AI

MILU: a consensus ensemble benchmark for multimodal medical imaging lecture understanding.

查看原始来源

PURPOSE: Vision-language models (VLMs) are increasingly used to interpret multimodal educational materials, yet their reliability on diagram-, equation-, and text-dense scientific…

PubMed AI

Weakly Supervised Composed Object Re-Identification With Large Models.

查看原始来源

Existing object re-identification (re-ID) and composed image retrieval (CIR) methods capture different aspects of real-world retrieval requirements; re-ID preserves identity but c…

2026-04-17

2026-04-17 11:39:21 (Asia/Shanghai)

LLM

From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

查看原始来源

Recent advancements in LLM agents are gradually shifting from reactive, text-based paradigms toward proactive, multimodal interaction. However, existing benchmarks primarily focus…

LLM

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

查看原始来源

The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flex…

Vision

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

查看原始来源

High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-b…

Vision

RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

查看原始来源

Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources -- such as APIs, computational utilities, and specialized models -- to…

PubMed AI

From Image to Pixels: towards Fine-Grained Medical Vision-Language Models.

查看原始来源

Multimodal large language models (MLLMs) offer immense potential for biomedical AI, yet current applications remain limited to coarse-grained image understanding and basic textual…

2026-04-16

2026-04-16 11:43:00 (Asia/Shanghai)

LLM

GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis

查看原始来源

The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift toward autonomous spatial analysis. However, evaluating these LLM-…

LLM

Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning

查看原始来源

The rapid evolution of multimodal large models has revolutionized the simulation of diverse characters in speech dialogue systems, enabling a novel interactive paradigm. Character…

LLM

MAny: Merge Anything for Multimodal Continual Instruction Tuning

查看原始来源

Multimodal Continual Instruction Tuning (MCIT) is essential for sequential task adaptation of Multimodal Large Language Models (MLLMs) but is severely restricted by catastrophic f…

LLM

MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging

查看原始来源

The potential of Multimodal Large Language Models (MLLMs) in domain of medical imaging raise the demands of systematic and rigorous evaluation frameworks that are aligned with the…

Vision

ROSE: Retrieval-Oriented Segmentation Enhancement

查看原始来源

Existing segmentation models based on multimodal large language models (MLLMs), such as LISA, often struggle with novel or emerging entities due to their inability to incorporate…

Vision

Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models

查看原始来源

While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental "tempo…

Vision

Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding

查看原始来源

Unified Multimodal Models (UMMs) aim to integrate visual understanding and generation within a single structure. However, these models exhibit a notable capability mismatch, where…

Vision

POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch

查看原始来源

While Large Multimodal Models (LMMs) demonstrate impressive visual perception, they remain epistemically constrained by their static parametric knowledge. To transcend these bound…

2026-04-15

2026-04-15 11:35:50 (Asia/Shanghai)

LLM

Towards Long-horizon Agentic Multimodal Search

查看原始来源

Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous inform…

LLM

Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs

查看原始来源

Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool…

Vision

RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation

查看原始来源

Multimodal semantic segmentation has emerged as a powerful paradigm for enhancing scene understanding by leveraging complementary information from multiple sensing modalities (e.g…

Vision

All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding

查看原始来源

Training multimodal large language models (MLLMs) for video understanding requires large-scale annotated data spanning diverse tasks such as object counting, question answering, a…

Vision

Probabilistic Feature Imputation and Uncertainty-Aware Multimodal Federated Aggregation

查看原始来源

Multimodal federated learning enables privacy-preserving collaborative model training across healthcare institutions. However, a fundamental challenge arises from modality heterog…

Vision

Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

查看原始来源

Vision-Language Models (VLMs) have shown remarkable performance, yet their security remains insufficiently understood. Existing adversarial studies focus almost exclusively on the…

PubMed AI

Multimodal large language models in brain tumor imaging: clinical applications and future perspectives.

查看原始来源

The use of multimodal data is essential for the precise diagnosis and treatment of brain tumors. In this context, multimodal data encompass multisequence magnetic resonance imagin…

PubMed AI

Bridging the Modality Gap in Medical Vision-Language Models: A Hybrid Contrastive-Optimal Transport Framework for Enhanced Cross-Modal Alignment.

查看原始来源

Vision-language models in healthcare face a critical limitation, i.e., the modality gap, where image and text embeddings occupy distantly separated regions in shared representatio…

2026-04-14

2026-04-14 11:37:06 (Asia/Shanghai)

LLM

Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games

查看原始来源

Vision-language models (VLMs) have shown impressive capabilities in perceptual tasks, yet they degrade in complex multi-hop reasoning under multiplayer game settings with imperfec…

LLM

Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving

查看原始来源

Closed-loop cooperative driving requires planners that generate realistic multimodal multi-agent trajectories while improving safety and traffic efficiency. Existing diffusion pla…

Vision

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

查看原始来源

In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference imag…

Vision

LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

查看原始来源

Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision--language understanding, yet they remain limited in tasks requiring precise object-level…

Vision

GeomPrompt: Geometric Prompt Learning for RGB-D Semantic Segmentation Under Missing and Degraded Depth

查看原始来源

Multimodal perception systems for robotics and embodied AI often assume reliable RGB-D sensing, but in practice, depth is frequently missing, noisy, or corrupted. We thus present…

Vision

Anthropogenic Regional Adaptation in Multimodal Vision-Language Model

查看原始来源

While the field of vision-language (VL) has achieved remarkable success in integrating visual and textual information across multiple languages and domains, there is still no dedi…

Vision

GazeVaLM: A Multi-Observer Eye-Tracking Benchmark for Evaluating Clinical Realism in AI-Generated X-Rays

查看原始来源

We introduce GazeVaLM, a public eye-tracking dataset for studying clinical perception during chest radiograph authenticity assessment. The dataset comprises 960 gaze recordings fr…

PubMed AI

Text4Seg++: Advancing Image Segmentation via Generative Language Modeling.

查看原始来源

Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks. However, effectively integrating image segmentation into these models remain…

PubMed AI

Comparative Performance of Gemini 3 Pro and GPT-5 Family Models on Ophthalmology Board-Style Questions.

查看原始来源

OBJECTIVE: To compare the performance of state-of-the-art Gemini and GPT models on ophthalmology board-style questions and examine variation by subspecialty, cognitive complexity,…

2026-04-08

2026-04-08 17:10:24 (Asia/Shanghai)

LLM

MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

查看原始来源

MLLMs have been successfully applied to multimodal embedding tasks, yet their generative reasoning capabilities remain underutilized. Directly incorporating chain-of-thought reaso…

LLM

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

查看原始来源

Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer…

LLM

Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery

查看原始来源

This study proposes a lightweight multimodal adaptation framework to bridge the representation gap between RGB-pretrained VLMs and thermal infrared imagery, and demonstrates its p…

Vision

MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

查看原始来源

MLLMs have been successfully applied to multimodal embedding tasks, yet their generative reasoning capabilities remain underutilized. Directly incorporating chain-of-thought reaso…

Vision

Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery

查看原始来源

This study proposes a lightweight multimodal adaptation framework to bridge the representation gap between RGB-pretrained VLMs and thermal infrared imagery, and demonstrates its p…

Vision

Scientific Graphics Program Synthesis via Dual Self-Consistency Reinforcement Learning

查看原始来源

Graphics Program Synthesis is pivotal for interpreting and editing visual data, effectively facilitating the reverse-engineering of static visuals into editable TikZ code. While T…

Vision

CoStream: Codec-Guided Resource-Efficient System for Video Streaming Analytics

查看原始来源

Video streaming analytics is a crucial workload for vision-language model serving, but the high cost of multimodal inference limits scalability. Prior systems reduce inference cos…