<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>multimodal language model Topic Archive</title>
<link>multimodal-language-model.html</link>
<description>关键词 multimodal language model 的长期追踪 RSS，汇总历史命中文献。</description>
<language>zh-CN</language>
<lastBuildDate>Sun, 28 Jun 2026 05:24:06 +0000</lastBuildDate>
<item>
<title>Don&#x27;t Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs</title>
<link>../papers/arxiv-d7c2dcff959d.html</link>
<guid>https://arxiv.org/abs/2604.12896v1#2026-04-15#multimodal-language-model</guid>
<pubDate>Wed, 15 Apr 2026 11:35:50 +0800</pubDate>
<description>Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool-generated visual cues, MLLMs often fail to benefit from them. Existing approaches typically feed raw tool outputs into the model, but these dense, pixel-level representations are misaligned with the language-native reasoning strengths of LLMs, leading to weak perception and reliance on language priors. We argue that,…</description>
</item>
</channel>
</rss>
