multimodal language model Topic Archive

multimodal language model Topic Archive multimodal-language-model.html 关键词 multimodal language model 的长期追踪 RSS，汇总历史命中文献。 zh-CN Sun, 28 Jun 2026 05:24:06 +0000 Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs ../papers/arxiv-d7c2dcff959d.html https://arxiv.org/abs/2604.12896v1#2026-04-15#multimodal-language-model Wed, 15 Apr 2026 11:35:50 +0800 Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool-generated visual cues, MLLMs often fail to benefit from them. Existing approaches typically feed raw tool outputs into the model, but these dense, pixel-level representations are misaligned with the language-native reasoning strengths of LLMs, leading to weak perception and reliance on language priors. We argue that,…