<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>Vision Feed Archive</title>
<link>vision.html</link>
<description>Vision 的长期订阅 RSS，汇总最近命中的论文和归档。</description>
<language>zh-CN</language>
<lastBuildDate>Wed, 22 Apr 2026 03:37:20 +0000</lastBuildDate>
<item>
<title>PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving</title>
<link>../papers/arxiv-523377f05ec5.html</link>
<guid>https://arxiv.org/abs/2604.19379v1#2026-04-22#vision</guid>
<pubDate>Wed, 22 Apr 2026 11:37:03 +0800</pubDate>
<description>This paper presents the first study on Unsupervised Domain Adaptation (UDA) for multimodal 3D panoptic segmentation (mm-3DPS), aiming to improve generalization under domain shifts commonly encountered in real-world autonomous driving. A straightforward solution is to employ a pseudo-labeling strategy, which is widely used in UDA to generate supervision for unlabeled target data, combined with an mm-3DPS backbone. However, existing supervised mm-3DPS methods rely heavily on strong cross-modal co…</description>
</item>
<item>
<title>Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval</title>
<link>../papers/arxiv-39272031a7a0.html</link>
<guid>https://arxiv.org/abs/2604.19135v1#2026-04-22#vision</guid>
<pubDate>Wed, 22 Apr 2026 11:37:03 +0800</pubDate>
<description>This paper presents the first exploration of text-to-image diffusion models for zero-shot sketch-based 3D shape retrieval (ZS-SBSR). Existing sketch-based 3D shape retrieval methods struggle in zero-shot settings due to the absence of category supervision and the extreme sparsity of sketch inputs. Our key insight is that large-scale pretrained diffusion models inherently exhibit open-vocabulary capability and strong shape bias, making them well suited for zero-shot visual retrieval. We leverage…</description>
</item>
<item>
<title>ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis</title>
<link>../papers/arxiv-9db4212c18cd.html</link>
<guid>https://arxiv.org/abs/2604.19720v1#2026-04-22#vision</guid>
<pubDate>Wed, 22 Apr 2026 11:37:03 +0800</pubDate>
<description>Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal cons…</description>
</item>
<item>
<title>MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation</title>
<link>../papers/arxiv-2fdaceb58972.html</link>
<guid>https://arxiv.org/abs/2604.19679v1#2026-04-22#vision</guid>
<pubDate>Wed, 22 Apr 2026 11:37:03 +0800</pubDate>
<description>Recent advances in Diffusion Transformers (DiTs) have enabled high-quality joint audio-video generation, producing videos with synchronized audio within a single model. However, existing controllable generation frameworks are typically restricted to video-only control. This restricts comprehensive controllability and often leads to suboptimal cross-modal alignment. To bridge this gap, we present MMControl, which enables users to perform Multi-Modal Control in joint audio-video generation. MMCon…</description>
</item>
<item>
<title>MedFlowSeg: Flow Matching for Medical Image Segmentation with Frequency-Aware Attention</title>
<link>../papers/arxiv-7e457ae682db.html</link>
<guid>https://arxiv.org/abs/2604.19675v1#2026-04-22#vision</guid>
<pubDate>Wed, 22 Apr 2026 11:37:03 +0800</pubDate>
<description>Flow matching has recently emerged as a principled framework for learning continuous-time transport maps, enabling efficient deterministic generation without relying on stochastic diffusion processes. While generative modeling has shown promise for medical image segmentation, particularly in capturing uncertainty and complex anatomical variability, existing approaches are predominantly built upon diffusion models, which incur substantial computational overhead due to iterative sampling and are…</description>
</item>
<item>
<title>RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation</title>
<link>../papers/arxiv-ab2855fef1f1.html</link>
<guid>https://arxiv.org/abs/2604.19570v1#2026-04-22#vision</guid>
<pubDate>Wed, 22 Apr 2026 11:37:03 +0800</pubDate>
<description>Accurate medical image segmentation requires both long-range contextual reasoning and precise boundary delineation, a task where existing transformer- and diffusion-based paradigms are frequently bottlenecked by quadratic computational complexity and prohibitive inference latency. We propose RF-HiT, a Rectified Flow Hierarchical Transformer that integrates an hourglass transformer backbone with a multi-scale hierarchical encoder for anatomically guided feature conditioning. Unlike prior diffusi…</description>
</item>
<item>
<title>How Far Are Video Models from True Multimodal Reasoning?</title>
<link>../papers/arxiv-f1cd701c6156.html</link>
<guid>https://arxiv.org/abs/2604.19193v1#2026-04-22#vision</guid>
<pubDate>Wed, 22 Apr 2026 11:37:03 +0800</pubDate>
<description>Despite remarkable progress toward general-purpose video models, a critical question remains unanswered: how far are these models from achieving true multimodal reasoning? Existing benchmarks fail to address this question rigorously, as they remain constrained by straightforward task designs and fragmented evaluation metrics that neglect complex multimodal reasoning. To bridge this gap, we introduce CLVG-Bench, an evaluation framework designed to probe video models&#x27; zero-shot reasoning capabili…</description>
</item>
<item>
<title>EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation</title>
<link>../papers/arxiv-3a3bbc2e6e3a.html</link>
<guid>https://arxiv.org/abs/2604.19105v1#2026-04-22#vision</guid>
<pubDate>Wed, 22 Apr 2026 11:37:03 +0800</pubDate>
<description>Faithfully modeling human behavior in dynamic environments is a foundational challenge for embodied intelligence. While conditional motion synthesis has achieved significant advances, egocentric motion generation remains largely underexplored due to the inherent complexity of first-person perception. In this work, we investigate Egocentric Vision-Language (Ego-VL) motion generation. This task requires synthesizing 3D human motion conditioned jointly on first-person visual observations and natur…</description>
</item>
<item>
<title>AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model</title>
<link>../papers/arxiv-16e9de02e970.html</link>
<guid>https://arxiv.org/abs/2604.19747v1#2026-04-22#vision</guid>
<pubDate>Wed, 22 Apr 2026 11:37:03 +0800</pubDate>
<description>Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing diffusion-based approaches mitigates this issues by synthesizing novel views, but they often condition on only one or two capture frames, which restricts geometric consistency and limits scalability to large or diverse scenes. We propose AnyRecon, a scalable framework for reconstruction from arbitrary and unordered sparse inputs that preserves e…</description>
</item>
<item>
<title>CityRAG: Stepping Into a City via Spatially-Grounded Video Generation</title>
<link>../papers/arxiv-1b780a279c11.html</link>
<guid>https://arxiv.org/abs/2604.19741v1#2026-04-22#vision</guid>
<pubDate>Wed, 22 Apr 2026 11:37:03 +0800</pubDate>
<description>We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we pr…</description>
</item>
<item>
<title>AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation</title>
<link>../papers/arxiv-1d3e8a90a79b.html</link>
<guid>https://arxiv.org/abs/2604.18348v1#2026-04-21#vision</guid>
<pubDate>Tue, 21 Apr 2026 11:40:46 +0800</pubDate>
<description>Video diffusion transformers (DiTs) suffer from prohibitive inference latency due to quadratic attention complexity. Existing sparse attention methods either overlook semantic similarity or fail to adapt to heterogeneous token distributions across layers, leading to model performance degradation. We propose AdaCluster, a training-free adaptive clustering framework that accelerates the generation of DiTs while preserving accuracy. AdaCluster applies an angle-similarity-preserving clustering meth…</description>
</item>
<item>
<title>DiffuSAM: Diffusion Guided Zero-Shot Object Grounding for Remote Sensing Imagery</title>
<link>../papers/arxiv-18b224b59dda.html</link>
<guid>https://arxiv.org/abs/2604.18201v1#2026-04-21#vision</guid>
<pubDate>Tue, 21 Apr 2026 11:40:46 +0800</pubDate>
<description>Diffusion models have emerged as powerful tools for a wide range of vision tasks, including text-guided image generation and editing. In this work, we explore their potential for object grounding in remote sensing imagery. We propose a hybrid pipeline that integrates diffusion-based localization cues with state-of-the-art segmentation models such as RemoteSAM and SAM3 to obtain more accurate bounding boxes. By leveraging the complementary strengths of generative diffusion models and foundationa…</description>
</item>
<item>
<title>Weakly-Supervised Referring Video Object Segmentation through Text Supervision</title>
<link>../papers/arxiv-ccd0dd55c2f1.html</link>
<guid>https://arxiv.org/abs/2604.17797v1#2026-04-21#vision</guid>
<pubDate>Tue, 21 Apr 2026 11:40:46 +0800</pubDate>
<description>Referring video object segmentation (RVOS) aims to segment the target instance in a video, referred by a text expression. Conventional approaches are mostly supervised learning, requiring expensive pixel-level mask annotations. To tackle it, weakly-supervised RVOS has recently been proposed to replace mask annotations with bounding boxes or points, which are however still costly and labor-intensive. In this paper, we design a novel weakly-supervised RVOS method, namely WSRVOS, to train the mode…</description>
</item>
<item>
<title>AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation</title>
<link>../papers/arxiv-28c2e2bc1523.html</link>
<guid>https://arxiv.org/abs/2604.18562v1#2026-04-21#vision</guid>
<pubDate>Tue, 21 Apr 2026 11:40:46 +0800</pubDate>
<description>Reasoning segmentation requires models to ground complex, implicit textual queries into precise pixel-level masks. Existing approaches rely on a single segmentation token $\texttt{&lt;SEG&gt;}$, whose hidden state implicitly encodes both semantic reasoning and spatial localization, limiting the model&#x27;s ability to explicitly disentangle what to segment from where to segment. We introduce AnchorSeg, which reformulates reasoning segmentation as a structured conditional generation process over image toke…</description>
</item>
<item>
<title>UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models</title>
<link>../papers/arxiv-e5254900d751.html</link>
<guid>https://arxiv.org/abs/2604.18518v1#2026-04-21#vision</guid>
<pubDate>Tue, 21 Apr 2026 11:40:46 +0800</pubDate>
<description>Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose \Ours, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accura…</description>
</item>
<item>
<title>Revisiting Change VQA in Remote Sensing with Structured and Native Multimodal Qwen Models</title>
<link>../papers/arxiv-c7fa0d917c8c.html</link>
<guid>https://arxiv.org/abs/2604.18429v1#2026-04-21#vision</guid>
<pubDate>Tue, 21 Apr 2026 11:40:46 +0800</pubDate>
<description>Change visual question answering (Change VQA) addresses the problem of answering natural-language questions about semantic changes between bi-temporal remote sensing (RS) images. Although vision-language models (VLMs) have recently been studied for temporal RS image understanding, Change VQA remains underexplored in the context of modern multimodal models. In this letter, we revisit the CDVQA benchmark using recent Qwen models under a unified low-rank adaptation (LoRA) setting. We compare Qwen3…</description>
</item>
<item>
<title>One-Step Diffusion with Inverse Residual Fields for Unsupervised Industrial Anomaly Detection</title>
<link>../papers/arxiv-51a89c3cb173.html</link>
<guid>https://arxiv.org/abs/2604.18393v1#2026-04-21#vision</guid>
<pubDate>Tue, 21 Apr 2026 11:40:46 +0800</pubDate>
<description>Diffusion models have achieved outstanding performance in unsupervised industrial anomaly detection (uIAD) by learning a manifold of normal data under the common assumption that off-manifold anomalies are harder to generate, resulting in larger reconstruction errors in data space or lower probability densities in the tractable latent space. However, their iterative denoising and noising nature leads to slow inference. In this paper, we propose OSD-IRF, a novel one-step diffusion with inverse re…</description>
</item>
<item>
<title>DSA-CycleGAN: A Domain Shift Aware CycleGAN for Robust Multi-Stain Glomeruli Segmentation</title>
<link>../papers/arxiv-d4a5ecc75743.html</link>
<guid>https://arxiv.org/abs/2604.18368v1#2026-04-21#vision</guid>
<pubDate>Tue, 21 Apr 2026 11:40:46 +0800</pubDate>
<description>A key challenge in segmentation in digital histopathology is inter- and intra-stain variations as it reduces model performance. Labelling each stain is expensive and time-consuming so methods using stain transfer via CycleGAN, have been developed for training multi-stain segmentation models using labels from a single stain. Nevertheless, CycleGAN tends to introduce noise during translation because of the one-to-many nature of some stain pairs, which conflicts with its cycle consistency loss. To…</description>
</item>
<item>
<title>OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation</title>
<link>../papers/arxiv-1f905a979d75.html</link>
<guid>https://arxiv.org/abs/2604.18326v1#2026-04-21#vision</guid>
<pubDate>Tue, 21 Apr 2026 11:40:46 +0800</pubDate>
<description>Recent advancements in audio-video joint generation models have demonstrated impressive capabilities in content creation. However, generating high-fidelity human-centric videos in complex, real-world physical scenes remains a significant challenge. We identify that the root cause lies in the structural deficiencies of existing datasets across three dimensions: limited global scene and camera diversity, sparse interaction modeling (both person-person and person-object), and insufficient individu…</description>
</item>
<item>
<title>Denoise and Align: Diffusion-Driven Foreground Knowledge Prompting for Open-Vocabulary Temporal Action Detection</title>
<link>../papers/arxiv-d700189c3374.html</link>
<guid>https://arxiv.org/abs/2604.18313v1#2026-04-21#vision</guid>
<pubDate>Tue, 21 Apr 2026 11:40:46 +0800</pubDate>
<description>Open-Vocabulary Temporal Action Detection (OV-TAD) aims to localize and classify action segments of unseen categories in untrimmed videos, where effective alignment between action semantics and video representations is critical for accurate detection. However, existing methods struggle to mitigate the semantic imbalance between concise, abstract action labels and rich, complex video contents, inevitably introducing semantic noise and misleading cross-modal alignment. To address this challenge,…</description>
</item>
<item>
<title>SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation</title>
<link>../papers/arxiv-23036fba0e62.html</link>
<guid>https://arxiv.org/abs/2604.15271v1#2026-04-17#vision</guid>
<pubDate>Fri, 17 Apr 2026 11:39:21 +0800</pubDate>
<description>Reliable uncertainty estimation is critical for medical image segmentation, where automated contours feed downstream quantification and clinical decision support. Many strong uncertainty methods require repeated inference, while efficient single-forward-pass alternatives often provide weaker failure ranking or rely on restrictive feature-space assumptions. We present $\textbf{SegWithU}$, a post-hoc framework that augments a frozen pretrained segmentation backbone with a lightweight uncertainty…</description>
</item>
<item>
<title>Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization</title>
<link>../papers/arxiv-5879454db7c6.html</link>
<guid>https://arxiv.org/abs/2604.15196v1#2026-04-17#vision</guid>
<pubDate>Fri, 17 Apr 2026 11:39:21 +0800</pubDate>
<description>We propose a novel hierarchical spatiotemporal vector quantization framework for unsupervised skeleton-based temporal action segmentation. We first introduce a hierarchical approach, which includes two consecutive levels of vector quantization. Specifically, the lower level associates skeletons with fine-grained subactions, while the higher level further aggregates subactions into action-level representations. Our hierarchical approach outperforms the non-hierarchical baseline, while primarily…</description>
</item>
<item>
<title>Boundary-Centric Active Learning for Temporal Action Segmentation</title>
<link>../papers/arxiv-d84f4cfc7217.html</link>
<guid>https://arxiv.org/abs/2604.15173v1#2026-04-17#vision</guid>
<pubDate>Fri, 17 Apr 2026 11:39:21 +0800</pubDate>
<description>Temporal action segmentation (TAS) demands dense temporal supervision, yet most of the annotation cost in untrimmed videos is spent identifying and refining action transitions, where segmentation errors concentrate and small temporal shifts disproportionately degrade segmental metrics. We introduce B-ACT, a clip-budgeted active learning framework that explicitly allocates supervision to these high-leverage boundary regions. B-ACT operates in a hierarchical two-stage loop: (i) it ranks and queri…</description>
</item>
<item>
<title>An Analysis of Regularization and Fokker-Planck Residuals in Diffusion Models for Image Generation</title>
<link>../papers/arxiv-fb4972e283bf.html</link>
<guid>https://arxiv.org/abs/2604.15171v1#2026-04-17#vision</guid>
<pubDate>Fri, 17 Apr 2026 11:39:21 +0800</pubDate>
<description>Recent work has shown that diffusion models trained with the denoising score matching (DSM) objective often violate the Fokker--Planck (FP) equation that governs the evolution of the true data density. Directly penalizing these deviations in the objective function reduces their magnitude but introduces a significant computational overhead. It is also observed that enforcing strict adherence to the FP equation does not necessarily lead to improvements in the quality of the generated samples, as…</description>
</item>
<item>
<title>RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework</title>
<link>../papers/arxiv-27b12af34ce1.html</link>
<guid>https://arxiv.org/abs/2604.15308v1#2026-04-17#vision</guid>
<pubDate>Fri, 17 Apr 2026 11:39:21 +0800</pubDate>
<description>High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop plann…</description>
</item>
<item>
<title>Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation</title>
<link>../papers/arxiv-93a3ec1f4d6b.html</link>
<guid>https://arxiv.org/abs/2604.15003v1#2026-04-17#vision</guid>
<pubDate>Fri, 17 Apr 2026 11:39:21 +0800</pubDate>
<description>The rapid rise of image-to-video (I2V) generation enables realistic videos to be created from a single image but also brings new forensic demands. Unlike static images, I2V content evolves over time, requiring forensics to move beyond 2D pixel-level tampering localization toward tracing how pixels flow and transform throughout the video. As frames progress, embedded traces drift and deform, making traditional spatial forensics ineffective. To address this unexplored dimension, we present **Flow…</description>
</item>
<item>
<title>RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models</title>
<link>../papers/arxiv-4a4068542625.html</link>
<guid>https://arxiv.org/abs/2604.14951v1#2026-04-17#vision</guid>
<pubDate>Fri, 17 Apr 2026 11:39:21 +0800</pubDate>
<description>Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources -- such as APIs, computational utilities, and specialized models -- to solve complex tasks beyond the reach of standalone language generation. While recent advances in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have expanded their reasoning and perception capabilities, existing tool-use methods are predominantly limited to text-only inputs and closed-world s…</description>
</item>
<item>
<title>Efficient Search of Implantable Adaptive Cells for Medical Image Segmentation</title>
<link>../papers/arxiv-88d2221df05a.html</link>
<guid>https://arxiv.org/abs/2604.14849v1#2026-04-17#vision</guid>
<pubDate>Fri, 17 Apr 2026 11:39:21 +0800</pubDate>
<description>Purpose: Adaptive skip modules can improve medical image segmentation, but searching for them is computationally costly. Implantable Adaptive Cells (IACs) are compact NAS modules inserted into U-Net skip connections, reducing the search space compared with full-network NAS. However, the original IAC framework still requires a 200-epoch differentiable search for each backbone and dataset. Methods: We analyzed the temporal behavior of operations and edges within IAC cells during differentiable se…</description>
</item>
<item>
<title>From Boundaries to Semantics: Prompt-Guided Multi-Task Learning for Petrographic Thin-section Segmentation</title>
<link>../papers/arxiv-0f86f7993414.html</link>
<guid>https://arxiv.org/abs/2604.14805v1#2026-04-17#vision</guid>
<pubDate>Fri, 17 Apr 2026 11:39:21 +0800</pubDate>
<description>Grain-edge segmentation (GES) and lithology semantic segmentation (LSS) are two pivotal tasks for quantifying rock fabric and composition. However, these two tasks are often treated separately, and the segmentation quality is implausible albeit expensive, time-consuming, and expert-annotated datasets have been used. Recently, foundation models, especially the Segment Anything Model (SAM), have demonstrated impressive robustness for boundary alignment. However, directly adapting SAM to joint GES…</description>
</item>
<item>
<title>ROSE: Retrieval-Oriented Segmentation Enhancement</title>
<link>../papers/arxiv-e008501b0fb5.html</link>
<guid>https://arxiv.org/abs/2604.14147v1#2026-04-16#vision</guid>
<pubDate>Thu, 16 Apr 2026 11:43:00 +0800</pubDate>
<description>Existing segmentation models based on multimodal large language models (MLLMs), such as LISA, often struggle with novel or emerging entities due to their inability to incorporate up-to-date knowledge. To address this challenge, we introduce the Novel Emerging Segmentation Task (NEST), which focuses on segmenting (i) novel entities that MLLMs fail to recognize due to their absence from training data, and (ii) emerging entities that exist within the model&#x27;s knowledge but demand up-to-date externa…</description>
</item>
<item>
<title>Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models</title>
<link>../papers/arxiv-edb7485d7898.html</link>
<guid>https://arxiv.org/abs/2604.14044v1#2026-04-16#vision</guid>
<pubDate>Thu, 16 Apr 2026 11:43:00 +0800</pubDate>
<description>While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental &quot;temporal blindness&quot;. Existing architectures lack intrinsic mechanisms for multi-temporal contrastive reasoning and struggle with precise spatial grounding. To address this, we first introduce Delta-QA, a comprehensive benchmark comprising 180k visual question-answering samples. Delta-QA unifies pixel-level segmentation and…</description>
</item>
<item>
<title>Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding</title>
<link>../papers/arxiv-60b6a5d36d13.html</link>
<guid>https://arxiv.org/abs/2604.13540v1#2026-04-16#vision</guid>
<pubDate>Thu, 16 Apr 2026 11:43:00 +0800</pubDate>
<description>Unified Multimodal Models (UMMs) aim to integrate visual understanding and generation within a single structure. However, these models exhibit a notable capability mismatch, where their understanding capability significantly outperforms their generation. This mismatch indicates that the model&#x27;s rich internal knowledge, while effective for understanding tasks, remains underactivated during generation. To address this, we draw inspiration from the human ``Thinking-While-Drawing&#x27;&#x27; paradigm, where…</description>
</item>
<item>
<title>DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer</title>
<link>../papers/arxiv-dd5cdeb4155f.html</link>
<guid>https://arxiv.org/abs/2604.13509v1#2026-04-16#vision</guid>
<pubDate>Thu, 16 Apr 2026 11:43:00 +0800</pubDate>
<description>Recent advances in video generation models has significantly accelerated video generation and related downstream tasks. Among these, video stylization holds important research value in areas such as immersive applications and artistic creation, attracting widespread attention. However, existing diffusion-based video stylization methods struggle to maintain stability and consistency when processing long videos, and their high computational cost and multi-step denoising make them difficult to app…</description>
</item>
<item>
<title>Seedance 2.0: Advancing Video Generation for World Complexity</title>
<link>../papers/arxiv-fb1144e05ca6.html</link>
<guid>https://arxiv.org/abs/2604.14148v1#2026-04-16#vision</guid>
<pubDate>Thu, 16 Apr 2026 11:43:00 +0800</pubDate>
<description>Seedance 2.0 is a new native multi-modal audio-video generation model, officially released in China in early February 2026. Compared with its predecessors, Seedance 1.0 and 1.5 Pro, Seedance 2.0 adopts a unified, highly efficient, and large-scale architecture for multi-modal audio-video joint generation. This allows it to support four input modalities: text, image, audio, and video, by integrating one of the most comprehensive suites of multi-modal content reference and editing capabilities ava…</description>
</item>
<item>
<title>POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch</title>
<link>../papers/arxiv-cb64e4ef9ea1.html</link>
<guid>https://arxiv.org/abs/2604.14029v1#2026-04-16#vision</guid>
<pubDate>Thu, 16 Apr 2026 11:43:00 +0800</pubDate>
<description>While Large Multimodal Models (LMMs) demonstrate impressive visual perception, they remain epistemically constrained by their static parametric knowledge. To transcend these boundaries, multimodal search models have been adopted to actively interact with the external environment for evidence retrieval. Diverging from prevailing paradigms that merely retrofit general LMMs with search tools as modular extensions, we explore the potential of building a multimodal agentic search model from scratch.…</description>
</item>
<item>
<title>Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework</title>
<link>../papers/arxiv-93e6775d3e71.html</link>
<guid>https://arxiv.org/abs/2604.13994v1#2026-04-16#vision</guid>
<pubDate>Thu, 16 Apr 2026 11:43:00 +0800</pubDate>
<description>Generative diffusion priors have recently achieved state-of-the-art performance in natural image super-resolution, demonstrating a powerful capability to synthesize photorealistic details. However, their direct application to remote sensing image super-resolution (RSISR) reveals significant shortcomings. Unlike natural images, remote sensing images exhibit a unique texture distribution where ground objects are globally stochastic yet locally clustered, leading to highly imbalanced textures. Thi…</description>
</item>
<item>
<title>Blind Bitstream-corrupted Video Recovery via Metadata-guided Diffusion Model</title>
<link>../papers/arxiv-78d126f125fb.html</link>
<guid>https://arxiv.org/abs/2604.13906v1#2026-04-16#vision</guid>
<pubDate>Thu, 16 Apr 2026 11:43:00 +0800</pubDate>
<description>Bitstream-corrupted video recovery aims to restore realistic content degraded during video storage or transmission. Existing methods typically assume that predefined masks of corrupted regions are available, but manually annotating these masks is labor-intensive and impractical in real-world scenarios. To address this limitation, we introduce a new blind video recovery setting that removes the reliance on predefined masks. This setting presents two major challenges: accurately identifying corru…</description>
</item>
<item>
<title>PBE-UNet: A light weight Progressive Boundary-Enhanced U-Net with Scale-Aware Aggregation for Ultrasound Image Segmentation</title>
<link>../papers/arxiv-f0b69cb6a500.html</link>
<guid>https://arxiv.org/abs/2604.13791v1#2026-04-16#vision</guid>
<pubDate>Thu, 16 Apr 2026 11:43:00 +0800</pubDate>
<description>Accurate lesion segmentation in ultrasound images is essential for preventive screening and clinical diagnosis, yet remains challenging due to low contrast, blurry boundaries, and significant scale variations. Although existing deep learning-based methods have achieved remarkable performance, these methods still struggle with scale variations and indistinct tumor boundaries. To address these challenges, we propose a progressive boundary enhanced U-Net (PBE-UNet). Specially, we first introduce a…</description>
</item>
<item>
<title>Design and Behavior of Sparse Mixture-of-Experts Layers in CNN-based Semantic Segmentation</title>
<link>../papers/arxiv-1315c3054cdc.html</link>
<guid>https://arxiv.org/abs/2604.13761v1#2026-04-16#vision</guid>
<pubDate>Thu, 16 Apr 2026 11:43:00 +0800</pubDate>
<description>Sparse mixture-of-experts (MoE) layers have been shown to substantially increase model capacity without a proportional increase in computational cost and are widely used in transformer architectures, where they typically replace feed-forward network blocks. In contrast, integrating sparse MoE layers into convolutional neural networks (CNNs) remains inconsistent, with most prior work focusing on fine-grained MoEs operating at the filter or channel levels. In this work, we investigate a coarser,…</description>
</item>
<item>
<title>RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation</title>
<link>../papers/arxiv-3027c60dff03.html</link>
<guid>https://arxiv.org/abs/2604.12319v1#2026-04-15#vision</guid>
<pubDate>Wed, 15 Apr 2026 11:35:50 +0800</pubDate>
<description>Multimodal semantic segmentation has emerged as a powerful paradigm for enhancing scene understanding by leveraging complementary information from multiple sensing modalities (e.g., RGB, depth, and thermal). However, existing cross-modal fusion methods often implicitly assume that all modalities are equally reliable, which can lead to feature degradation when auxiliary modalities are noisy, misaligned, or incomplete. In this paper, we revisit cross-modal fusion from the perspective of modality…</description>
</item>
<item>
<title>All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding</title>
<link>../papers/arxiv-ba711ee91078.html</link>
<guid>https://arxiv.org/abs/2604.12335v1#2026-04-15#vision</guid>
<pubDate>Wed, 15 Apr 2026 11:35:50 +0800</pubDate>
<description>Training multimodal large language models (MLLMs) for video understanding requires large-scale annotated data spanning diverse tasks such as object counting, question answering, and segmentation. However, collecting and annotating multimodal video data in real-world is costly, slow, and inherently limited in diversity and coverage. To address this challenge, we propose a unified synthetic data generation pipeline capable of automatically producing unlimited multimodal video data with rich and d…</description>
</item>
<item>
<title>Probabilistic Feature Imputation and Uncertainty-Aware Multimodal Federated Aggregation</title>
<link>../papers/arxiv-b480ff0cabeb.html</link>
<guid>https://arxiv.org/abs/2604.12970v1#2026-04-15#vision</guid>
<pubDate>Wed, 15 Apr 2026 11:35:50 +0800</pubDate>
<description>Multimodal federated learning enables privacy-preserving collaborative model training across healthcare institutions. However, a fundamental challenge arises from modality heterogeneity: many clinical sites possess only a subset of modalities due to resource constraints or workflow variations. Existing approaches address this through feature imputation networks that synthesize missing modality representations, yet these methods produce point estimates without reliability measures, forcing downs…</description>
</item>
<item>
<title>AbdomenGen: Sequential Volume-Conditioned Diffusion Framework for Abdominal Anatomy Generation</title>
<link>../papers/arxiv-7f4fd8c173f5.html</link>
<guid>https://arxiv.org/abs/2604.12969v1#2026-04-15#vision</guid>
<pubDate>Wed, 15 Apr 2026 11:35:50 +0800</pubDate>
<description>Computational phantoms are widely used in medical imaging research, yet current systems to generate controlled, clinically meaningful anatomical variations remain limited. We present AbdomenGen, a sequential volume-conditioned diffusion framework for controllable abdominal anatomy generation. We introduce the \textbf{Volume Control Scalar (VCS)}, a standardized residual that decouples organ size from body habitus, enabling interpretable volume modulation. Organ masks are synthesized sequentiall…</description>
</item>
<item>
<title>Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation</title>
<link>../papers/arxiv-f59d4565b4bc.html</link>
<guid>https://arxiv.org/abs/2604.12918v1#2026-04-15#vision</guid>
<pubDate>Wed, 15 Apr 2026 11:35:50 +0800</pubDate>
<description>Bird&#x27;s-eye-view (BEV) representations are the dominant paradigm for 3D perception in autonomous driving, providing a unified spatial canvas where detection and segmentation features are geometrically registered to the same physical coordinate system. However, existing radar-camera fusion methods treat these tasks in isolation, missing the opportunity to share complementary information between them: detection features encode object-level geometry that can sharpen segmentation boundaries, while s…</description>
</item>
<item>
<title>Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks</title>
<link>../papers/arxiv-df5064c793f1.html</link>
<guid>https://arxiv.org/abs/2604.12833v1#2026-04-15#vision</guid>
<pubDate>Wed, 15 Apr 2026 11:35:50 +0800</pubDate>
<description>Vision-Language Models (VLMs) have shown remarkable performance, yet their security remains insufficiently understood. Existing adversarial studies focus almost exclusively on the digital setting, leaving physical-world threats largely unexplored. As VLMs are increasingly deployed in real environments, this gap becomes critical, since adversarial perturbations must be physically realizable. Despite this practical relevance, physical attacks against VLMs have not been systematically studied. Suc…</description>
</item>
<item>
<title>Detecting and refurbishing ground truth errors during training of deep learning-based echocardiography segmentation models</title>
<link>../papers/arxiv-3b9fc3f09edf.html</link>
<guid>https://arxiv.org/abs/2604.12832v1#2026-04-15#vision</guid>
<pubDate>Wed, 15 Apr 2026 11:35:50 +0800</pubDate>
<description>Deep learning-based medical image segmentation typically relies on ground truth (GT) labels obtained through manual annotation, but these can be prone to random errors or systematic biases. This study examines the robustness of deep learning models to such errors in echocardiography (echo) segmentation and evaluates a novel strategy for detecting and refurbishing erroneous labels during model training. Using the CAMUS dataset, we simulate three error types, then compare a loss-based GT label er…</description>
</item>
<item>
<title>Generative Refinement Networks for Visual Synthesis</title>
<link>../papers/arxiv-3ad8f789dcec.html</link>
<guid>https://arxiv.org/abs/2604.13030v1#2026-04-15#vision</guid>
<pubDate>Wed, 15 Apr 2026 11:35:50 +0800</pubDate>
<description>While diffusion models dominate the field of visual generation, they are computationally inefficient, applying a uniform computational effort regardless of different complexity. In contrast, autoregressive (AR) models are inherently complexity-aware, as evidenced by their variable likelihoods, but are often hindered by lossy discrete tokenization and error accumulation. In this work, we introduce Generative Refinement Networks (GRN), a next-generation visual synthesis paradigm to address these…</description>
</item>
<item>
<title>Fragile Reconstruction: Adversarial Vulnerability of Reconstruction-Based Detectors for Diffusion-Generated Images</title>
<link>../papers/arxiv-9e7d5e45d310.html</link>
<guid>https://arxiv.org/abs/2604.12781v1#2026-04-15#vision</guid>
<pubDate>Wed, 15 Apr 2026 11:35:50 +0800</pubDate>
<description>Recently, detecting AI-generated images produced by diffusion-based models has attracted increasing attention due to their potential threat to safety. Among existing approaches, reconstruction-based methods have emerged as a prominent paradigm for this task. However, we find that such methods exhibit severe security vulnerabilities to adversarial perturbations; that is, by adding imperceptible adversarial perturbations to input images, the detection accuracy of classifiers collapses to near zer…</description>
</item>
<item>
<title>OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation</title>
<link>../papers/arxiv-c9ca67ecc0bf.html</link>
<guid>https://arxiv.org/abs/2604.11804v1#2026-04-14#vision</guid>
<pubDate>Tue, 14 Apr 2026 11:37:06 +0800</pubDate>
<description>In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an…</description>
</item>
<item>
<title>LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation</title>
<link>../papers/arxiv-be45283d75a9.html</link>
<guid>https://arxiv.org/abs/2604.11789v1#2026-04-14#vision</guid>
<pubDate>Tue, 14 Apr 2026 11:37:06 +0800</pubDate>
<description>Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision--language understanding, yet they remain limited in tasks requiring precise object-level grounding, fine-grained spatial reasoning, and controllable visual manipulation. In particular, existing systems often struggle to identify the correct instance, preserve object identity across interactions, and localize or modify designated regions with high precision. Object-centric vision provides a principled fram…</description>
</item>
</channel>
</rss>
