diffusion Topic Archive

diffusion Topic Archive diffusion.html 关键词 diffusion 的长期追踪 RSS，汇总历史命中文献。 zh-CN Wed, 22 Apr 2026 03:37:20 +0000 Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval ../papers/arxiv-39272031a7a0.html https://arxiv.org/abs/2604.19135v1#2026-04-22#diffusion Wed, 22 Apr 2026 11:37:03 +0800 This paper presents the first exploration of text-to-image diffusion models for zero-shot sketch-based 3D shape retrieval (ZS-SBSR). Existing sketch-based 3D shape retrieval methods struggle in zero-shot settings due to the absence of category supervision and the extreme sparsity of sketch inputs. Our key insight is that large-scale pretrained diffusion models inherently exhibit open-vocabulary capability and strong shape bias, making them well suited for zero-shot visual retrieval. We leverage… ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis ../papers/arxiv-9db4212c18cd.html https://arxiv.org/abs/2604.19720v1#2026-04-22#diffusion Wed, 22 Apr 2026 11:37:03 +0800 Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal cons… MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation ../papers/arxiv-2fdaceb58972.html https://arxiv.org/abs/2604.19679v1#2026-04-22#diffusion Wed, 22 Apr 2026 11:37:03 +0800 Recent advances in Diffusion Transformers (DiTs) have enabled high-quality joint audio-video generation, producing videos with synchronized audio within a single model. However, existing controllable generation frameworks are typically restricted to video-only control. This restricts comprehensive controllability and often leads to suboptimal cross-modal alignment. To bridge this gap, we present MMControl, which enables users to perform Multi-Modal Control in joint audio-video generation. MMCon… MedFlowSeg: Flow Matching for Medical Image Segmentation with Frequency-Aware Attention ../papers/arxiv-7e457ae682db.html https://arxiv.org/abs/2604.19675v1#2026-04-22#diffusion Wed, 22 Apr 2026 11:37:03 +0800 Flow matching has recently emerged as a principled framework for learning continuous-time transport maps, enabling efficient deterministic generation without relying on stochastic diffusion processes. While generative modeling has shown promise for medical image segmentation, particularly in capturing uncertainty and complex anatomical variability, existing approaches are predominantly built upon diffusion models, which incur substantial computational overhead due to iterative sampling and are… RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation ../papers/arxiv-ab2855fef1f1.html https://arxiv.org/abs/2604.19570v1#2026-04-22#diffusion Wed, 22 Apr 2026 11:37:03 +0800 Accurate medical image segmentation requires both long-range contextual reasoning and precise boundary delineation, a task where existing transformer- and diffusion-based paradigms are frequently bottlenecked by quadratic computational complexity and prohibitive inference latency. We propose RF-HiT, a Rectified Flow Hierarchical Transformer that integrates an hourglass transformer backbone with a multi-scale hierarchical encoder for anatomically guided feature conditioning. Unlike prior diffusi… EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation ../papers/arxiv-3a3bbc2e6e3a.html https://arxiv.org/abs/2604.19105v1#2026-04-22#diffusion Wed, 22 Apr 2026 11:37:03 +0800 Faithfully modeling human behavior in dynamic environments is a foundational challenge for embodied intelligence. While conditional motion synthesis has achieved significant advances, egocentric motion generation remains largely underexplored due to the inherent complexity of first-person perception. In this work, we investigate Egocentric Vision-Language (Ego-VL) motion generation. This task requires synthesizing 3D human motion conditioned jointly on first-person visual observations and natur… AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model ../papers/arxiv-16e9de02e970.html https://arxiv.org/abs/2604.19747v1#2026-04-22#diffusion Wed, 22 Apr 2026 11:37:03 +0800 Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing diffusion-based approaches mitigates this issues by synthesizing novel views, but they often condition on only one or two capture frames, which restricts geometric consistency and limits scalability to large or diverse scenes. We propose AnyRecon, a scalable framework for reconstruction from arbitrary and unordered sparse inputs that preserves e… AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation ../papers/arxiv-1d3e8a90a79b.html https://arxiv.org/abs/2604.18348v1#2026-04-21#diffusion Tue, 21 Apr 2026 11:40:46 +0800 Video diffusion transformers (DiTs) suffer from prohibitive inference latency due to quadratic attention complexity. Existing sparse attention methods either overlook semantic similarity or fail to adapt to heterogeneous token distributions across layers, leading to model performance degradation. We propose AdaCluster, a training-free adaptive clustering framework that accelerates the generation of DiTs while preserving accuracy. AdaCluster applies an angle-similarity-preserving clustering meth… DiffuSAM: Diffusion Guided Zero-Shot Object Grounding for Remote Sensing Imagery ../papers/arxiv-18b224b59dda.html https://arxiv.org/abs/2604.18201v1#2026-04-21#diffusion Tue, 21 Apr 2026 11:40:46 +0800 Diffusion models have emerged as powerful tools for a wide range of vision tasks, including text-guided image generation and editing. In this work, we explore their potential for object grounding in remote sensing imagery. We propose a hybrid pipeline that integrates diffusion-based localization cues with state-of-the-art segmentation models such as RemoteSAM and SAM3 to obtain more accurate bounding boxes. By leveraging the complementary strengths of generative diffusion models and foundationa… UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models ../papers/arxiv-e5254900d751.html https://arxiv.org/abs/2604.18518v1#2026-04-21#diffusion Tue, 21 Apr 2026 11:40:46 +0800 Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose \Ours, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accura… One-Step Diffusion with Inverse Residual Fields for Unsupervised Industrial Anomaly Detection ../papers/arxiv-51a89c3cb173.html https://arxiv.org/abs/2604.18393v1#2026-04-21#diffusion Tue, 21 Apr 2026 11:40:46 +0800 Diffusion models have achieved outstanding performance in unsupervised industrial anomaly detection (uIAD) by learning a manifold of normal data under the common assumption that off-manifold anomalies are harder to generate, resulting in larger reconstruction errors in data space or lower probability densities in the tractable latent space. However, their iterative denoising and noising nature leads to slow inference. In this paper, we propose OSD-IRF, a novel one-step diffusion with inverse re… Denoise and Align: Diffusion-Driven Foreground Knowledge Prompting for Open-Vocabulary Temporal Action Detection ../papers/arxiv-d700189c3374.html https://arxiv.org/abs/2604.18313v1#2026-04-21#diffusion Tue, 21 Apr 2026 11:40:46 +0800 Open-Vocabulary Temporal Action Detection (OV-TAD) aims to localize and classify action segments of unseen categories in untrimmed videos, where effective alignment between action semantics and video representations is critical for accurate detection. However, existing methods struggle to mitigate the semantic imbalance between concise, abstract action labels and rich, complex video contents, inevitably introducing semantic noise and misleading cross-modal alignment. To address this challenge,… An Analysis of Regularization and Fokker-Planck Residuals in Diffusion Models for Image Generation ../papers/arxiv-fb4972e283bf.html https://arxiv.org/abs/2604.15171v1#2026-04-17#diffusion Fri, 17 Apr 2026 11:39:21 +0800 Recent work has shown that diffusion models trained with the denoising score matching (DSM) objective often violate the Fokker--Planck (FP) equation that governs the evolution of the true data density. Directly penalizing these deviations in the objective function reduces their magnitude but introduces a significant computational overhead. It is also observed that enforcing strict adherence to the FP equation does not necessarily lead to improvements in the quality of the generated samples, as… RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework ../papers/arxiv-27b12af34ce1.html https://arxiv.org/abs/2604.15308v1#2026-04-17#diffusion Fri, 17 Apr 2026 11:39:21 +0800 High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop plann… Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding ../papers/arxiv-60b6a5d36d13.html https://arxiv.org/abs/2604.13540v1#2026-04-16#diffusion Thu, 16 Apr 2026 11:43:00 +0800 Unified Multimodal Models (UMMs) aim to integrate visual understanding and generation within a single structure. However, these models exhibit a notable capability mismatch, where their understanding capability significantly outperforms their generation. This mismatch indicates that the model's rich internal knowledge, while effective for understanding tasks, remains underactivated during generation. To address this, we draw inspiration from the human ``Thinking-While-Drawing'' paradigm, where… DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer ../papers/arxiv-dd5cdeb4155f.html https://arxiv.org/abs/2604.13509v1#2026-04-16#diffusion Thu, 16 Apr 2026 11:43:00 +0800 Recent advances in video generation models has significantly accelerated video generation and related downstream tasks. Among these, video stylization holds important research value in areas such as immersive applications and artistic creation, attracting widespread attention. However, existing diffusion-based video stylization methods struggle to maintain stability and consistency when processing long videos, and their high computational cost and multi-step denoising make them difficult to app… Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework ../papers/arxiv-93e6775d3e71.html https://arxiv.org/abs/2604.13994v1#2026-04-16#diffusion Thu, 16 Apr 2026 11:43:00 +0800 Generative diffusion priors have recently achieved state-of-the-art performance in natural image super-resolution, demonstrating a powerful capability to synthesize photorealistic details. However, their direct application to remote sensing image super-resolution (RSISR) reveals significant shortcomings. Unlike natural images, remote sensing images exhibit a unique texture distribution where ground objects are globally stochastic yet locally clustered, leading to highly imbalanced textures. Thi… Blind Bitstream-corrupted Video Recovery via Metadata-guided Diffusion Model ../papers/arxiv-78d126f125fb.html https://arxiv.org/abs/2604.13906v1#2026-04-16#diffusion Thu, 16 Apr 2026 11:43:00 +0800 Bitstream-corrupted video recovery aims to restore realistic content degraded during video storage or transmission. Existing methods typically assume that predefined masks of corrupted regions are available, but manually annotating these masks is labor-intensive and impractical in real-world scenarios. To address this limitation, we introduce a new blind video recovery setting that removes the reliance on predefined masks. This setting presents two major challenges: accurately identifying corru… AbdomenGen: Sequential Volume-Conditioned Diffusion Framework for Abdominal Anatomy Generation ../papers/arxiv-7f4fd8c173f5.html https://arxiv.org/abs/2604.12969v1#2026-04-15#diffusion Wed, 15 Apr 2026 11:35:50 +0800 Computational phantoms are widely used in medical imaging research, yet current systems to generate controlled, clinically meaningful anatomical variations remain limited. We present AbdomenGen, a sequential volume-conditioned diffusion framework for controllable abdominal anatomy generation. We introduce the \textbf{Volume Control Scalar (VCS)}, a standardized residual that decouples organ size from body habitus, enabling interpretable volume modulation. Organ masks are synthesized sequentiall… Generative Refinement Networks for Visual Synthesis ../papers/arxiv-3ad8f789dcec.html https://arxiv.org/abs/2604.13030v1#2026-04-15#diffusion Wed, 15 Apr 2026 11:35:50 +0800 While diffusion models dominate the field of visual generation, they are computationally inefficient, applying a uniform computational effort regardless of different complexity. In contrast, autoregressive (AR) models are inherently complexity-aware, as evidenced by their variable likelihoods, but are often hindered by lossy discrete tokenization and error accumulation. In this work, we introduce Generative Refinement Networks (GRN), a next-generation visual synthesis paradigm to address these… Fragile Reconstruction: Adversarial Vulnerability of Reconstruction-Based Detectors for Diffusion-Generated Images ../papers/arxiv-9e7d5e45d310.html https://arxiv.org/abs/2604.12781v1#2026-04-15#diffusion Wed, 15 Apr 2026 11:35:50 +0800 Recently, detecting AI-generated images produced by diffusion-based models has attracted increasing attention due to their potential threat to safety. Among existing approaches, reconstruction-based methods have emerged as a prominent paradigm for this task. However, we find that such methods exhibit severe security vulnerabilities to adversarial perturbations; that is, by adding imperceptible adversarial perturbations to input images, the detection accuracy of classifiers collapses to near zer… VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model. ../papers/doi-c5e38020b821.html https://pubmed.ncbi.nlm.nih.gov/41979962/#2026-04-15#diffusion Wed, 15 Apr 2026 11:35:50 +0800 The emergence of Large Vision-Language Models (LVLMs) marks significant strides towards achieving general artificial intelligence. However, these advancements are accompanied by concerns about biased outputs, a challenge that has yet to be thoroughly explored. Existing benchmarks are not sufficiently comprehensive in evaluating biases due to their limited data scale, single questioning format and narrow sources of bias. To address this problem, we introduce VLBiasBench, a comprehensive benchmar… Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving ../papers/arxiv-9e98a0ea67e8.html https://arxiv.org/abs/2604.11734v1#2026-04-14#diffusion Tue, 14 Apr 2026 11:37:06 +0800 Closed-loop cooperative driving requires planners that generate realistic multimodal multi-agent trajectories while improving safety and traffic efficiency. Existing diffusion planners can model multimodal behaviors from demonstrations, but they often exhibit weak scene consistency and remain poorly aligned with closed-loop objectives; meanwhile, stable online post-training in reactive multi-agent environments remains difficult. We present Multi-ORFT, which couples scene-conditioned diffusion p… Anthropogenic Regional Adaptation in Multimodal Vision-Language Model ../papers/arxiv-bb16b976d1a8.html https://arxiv.org/abs/2604.11490v1#2026-04-14#diffusion Tue, 14 Apr 2026 11:37:06 +0800 While the field of vision-language (VL) has achieved remarkable success in integrating visual and textual information across multiple languages and domains, there is still no dedicated framework for assessing human-centric alignment in vision-language systems. We offer two contributions to address this gap. First, we introduce Anthropogenic Regional Adaptation: a novel paradigm that aims to optimize model relevance to specific regional contexts while ensuring the retention of global generalizat… GazeVaLM: A Multi-Observer Eye-Tracking Benchmark for Evaluating Clinical Realism in AI-Generated X-Rays ../papers/doi-797bb9dad901.html https://arxiv.org/abs/2604.11653v1#2026-04-14#diffusion Tue, 14 Apr 2026 11:37:06 +0800 We introduce GazeVaLM, a public eye-tracking dataset for studying clinical perception during chest radiograph authenticity assessment. The dataset comprises 960 gaze recordings from 16 expert radiologists interpreting 30 real and 30 synthetic chest X-rays (generated by diffusion based generative AI) under two conditions: diagnostic assessment and real-fake classification (Visual Turing test). For each image-observer pair, we provide raw gaze samples, fixation maps, scanpaths, saliency density m… Progressively Texture-Aware Diffusion for Contrast-Enhanced Sparse-View CT ../papers/arxiv-60db98bd1c22.html https://arxiv.org/abs/2604.11559v1#2026-04-14#diffusion Tue, 14 Apr 2026 11:37:06 +0800 Diffusion-based sparse-view CT (SVCT) imaging has achieved remarkable advancements in recent years, thanks to its more stable generative capability. However, recovering reliable image content and visually consistent textures is still a crucial challenge. In this paper, we present a Progressively Texture-aware Diffusion (PTD) model, a coarse-to-fine learning framework tailored for SVCT. Specifically, PTD comprises a basic reconstructive module PTD$_{\textit{rec}}$ and a conditional diffusion mod… DiffHDR: Re-Exposing LDR Videos with Video Diffusion Models ../papers/arxiv-5fffffb294f8.html https://arxiv.org/abs/2604.06161v1#2026-04-08#diffusion Wed, 08 Apr 2026 17:10:24 +0800 Most digital videos are stored in 8-bit low dynamic range (LDR) formats, where much of the original high dynamic range (HDR) scene radiance is lost due to saturation and quantization. This loss of highlight and shadow detail precludes mapping accurate luminance to HDR displays and limits meaningful re-exposure in post-production workflows. Although techniques have been proposed to convert LDR images to HDR through dynamic range expansion, they struggle to restore realistic detail in the over- a… SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation ../papers/arxiv-55e346c2aa3c.html https://arxiv.org/abs/2604.06113v1#2026-04-08#diffusion Wed, 08 Apr 2026 17:10:24 +0800 Scalable generation of outdoor driving scenes requires 3D representations that remain consistent across multiple viewpoints and scale to large areas. Existing solutions either rely on image or video generative models distilled to 3D space, harming the geometric coherence and restricting the rendering to training views, or are limited to small-scale 3D scene or object-centric generation. In this work, we propose a 3D generative framework based on $Σ$-Voxfield grid, a discrete representation wher… HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation ../papers/arxiv-96f82d9e2dc4.html https://arxiv.org/abs/2604.05961v1#2026-04-08#diffusion Wed, 08 Apr 2026 17:10:24 +0800 Despite tremendous recent progress in human video generation, generative video diffusion models still struggle to capture the dynamics and physics of human motions faithfully. In this paper, we propose a new framework for human video generation, HumANDiff, which enhances the human motion control with three key designs: 1) Articulated motion-consistent noise sampling that correlates the spatiotemporal distribution of latent noise and replaces the unstructured random Gaussian noise with 3D articu…