video generation Topic Archive

video generation Topic Archive video-generation.html 关键词 video generation 的长期追踪 RSS，汇总历史命中文献。 zh-CN Wed, 22 Apr 2026 03:37:20 +0000 ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis ../papers/arxiv-9db4212c18cd.html https://arxiv.org/abs/2604.19720v1#2026-04-22#video-generation Wed, 22 Apr 2026 11:37:03 +0800 Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal cons… MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation ../papers/arxiv-2fdaceb58972.html https://arxiv.org/abs/2604.19679v1#2026-04-22#video-generation Wed, 22 Apr 2026 11:37:03 +0800 Recent advances in Diffusion Transformers (DiTs) have enabled high-quality joint audio-video generation, producing videos with synchronized audio within a single model. However, existing controllable generation frameworks are typically restricted to video-only control. This restricts comprehensive controllability and often leads to suboptimal cross-modal alignment. To bridge this gap, we present MMControl, which enables users to perform Multi-Modal Control in joint audio-video generation. MMCon… How Far Are Video Models from True Multimodal Reasoning? ../papers/arxiv-f1cd701c6156.html https://arxiv.org/abs/2604.19193v1#2026-04-22#video-generation Wed, 22 Apr 2026 11:37:03 +0800 Despite remarkable progress toward general-purpose video models, a critical question remains unanswered: how far are these models from achieving true multimodal reasoning? Existing benchmarks fail to address this question rigorously, as they remain constrained by straightforward task designs and fragmented evaluation metrics that neglect complex multimodal reasoning. To bridge this gap, we introduce CLVG-Bench, an evaluation framework designed to probe video models' zero-shot reasoning capabili… CityRAG: Stepping Into a City via Spatially-Grounded Video Generation ../papers/arxiv-1b780a279c11.html https://arxiv.org/abs/2604.19741v1#2026-04-22#video-generation Wed, 22 Apr 2026 11:37:03 +0800 We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we pr… AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation ../papers/arxiv-1d3e8a90a79b.html https://arxiv.org/abs/2604.18348v1#2026-04-21#video-generation Tue, 21 Apr 2026 11:40:46 +0800 Video diffusion transformers (DiTs) suffer from prohibitive inference latency due to quadratic attention complexity. Existing sparse attention methods either overlook semantic similarity or fail to adapt to heterogeneous token distributions across layers, leading to model performance degradation. We propose AdaCluster, a training-free adaptive clustering framework that accelerates the generation of DiTs while preserving accuracy. AdaCluster applies an angle-similarity-preserving clustering meth… OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation ../papers/arxiv-1f905a979d75.html https://arxiv.org/abs/2604.18326v1#2026-04-21#video-generation Tue, 21 Apr 2026 11:40:46 +0800 Recent advancements in audio-video joint generation models have demonstrated impressive capabilities in content creation. However, generating high-fidelity human-centric videos in complex, real-world physical scenes remains a significant challenge. We identify that the root cause lies in the structural deficiencies of existing datasets across three dimensions: limited global scene and camera diversity, sparse interaction modeling (both person-person and person-object), and insufficient individu… Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation ../papers/arxiv-93a3ec1f4d6b.html https://arxiv.org/abs/2604.15003v1#2026-04-17#video-generation Fri, 17 Apr 2026 11:39:21 +0800 The rapid rise of image-to-video (I2V) generation enables realistic videos to be created from a single image but also brings new forensic demands. Unlike static images, I2V content evolves over time, requiring forensics to move beyond 2D pixel-level tampering localization toward tracing how pixels flow and transform throughout the video. As frames progress, embedded traces drift and deform, making traditional spatial forensics ineffective. To address this unexplored dimension, we present **Flow… DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer ../papers/arxiv-dd5cdeb4155f.html https://arxiv.org/abs/2604.13509v1#2026-04-16#video-generation Thu, 16 Apr 2026 11:43:00 +0800 Recent advances in video generation models has significantly accelerated video generation and related downstream tasks. Among these, video stylization holds important research value in areas such as immersive applications and artistic creation, attracting widespread attention. However, existing diffusion-based video stylization methods struggle to maintain stability and consistency when processing long videos, and their high computational cost and multi-step denoising make them difficult to app… Seedance 2.0: Advancing Video Generation for World Complexity ../papers/arxiv-fb1144e05ca6.html https://arxiv.org/abs/2604.14148v1#2026-04-16#video-generation Thu, 16 Apr 2026 11:43:00 +0800 Seedance 2.0 is a new native multi-modal audio-video generation model, officially released in China in early February 2026. Compared with its predecessors, Seedance 1.0 and 1.5 Pro, Seedance 2.0 adopts a unified, highly efficient, and large-scale architecture for multi-modal audio-video joint generation. This allows it to support four input modalities: text, image, audio, and video, by integrating one of the most comprehensive suites of multi-modal content reference and editing capabilities ava… Generative Refinement Networks for Visual Synthesis ../papers/arxiv-3ad8f789dcec.html https://arxiv.org/abs/2604.13030v1#2026-04-15#video-generation Wed, 15 Apr 2026 11:35:50 +0800 While diffusion models dominate the field of visual generation, they are computationally inefficient, applying a uniform computational effort regardless of different complexity. In contrast, autoregressive (AR) models are inherently complexity-aware, as evidenced by their variable likelihoods, but are often hindered by lossy discrete tokenization and error accumulation. In this work, we introduce Generative Refinement Networks (GRN), a next-generation visual synthesis paradigm to address these… OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation ../papers/arxiv-c9ca67ecc0bf.html https://arxiv.org/abs/2604.11804v1#2026-04-14#video-generation Tue, 14 Apr 2026 11:37:06 +0800 In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an… HDR Video Generation via Latent Alignment with Logarithmic Encoding ../papers/arxiv-06948b88ac9a.html https://arxiv.org/abs/2604.11788v1#2026-04-14#video-generation Tue, 14 Apr 2026 11:37:06 +0800 High dynamic range (HDR) imagery offers a rich and faithful representation of scene radiance, but remains challenging for generative models due to its mismatch with the bounded, perceptually compressed data on which these models are trained. A natural solution is to learn new representations for HDR, which introduces additional complexity and data requirements. In this work, we show that HDR generation can be achieved in a much simpler way by leveraging the strong visual priors already captured… Action Images: End-to-End Policy Learning via Multiview Video Generation ../papers/arxiv-7e7a70961341.html https://arxiv.org/abs/2604.06168v1#2026-04-08#video-generation Wed, 08 Apr 2026 17:10:24 +0800 World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model t… OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control ../papers/arxiv-cc8e5d3d0950.html https://arxiv.org/abs/2604.06010v1#2026-04-08#video-generation Wed, 08 Apr 2026 17:10:24 +0800 Video fundamentally intertwines two crucial axes: the dynamic content of a scene and the camera motion through which it is observed. However, existing generation models often entangle these factors, limiting independent control. In this work, we introduce OmniCamera, a unified framework designed to explicitly disentangle and command these two dimensions. This compositional approach enables flexible video generation by allowing arbitrary pairings of camera and content conditions, unlocking unpre… HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation ../papers/arxiv-96f82d9e2dc4.html https://arxiv.org/abs/2604.05961v1#2026-04-08#video-generation Wed, 08 Apr 2026 17:10:24 +0800 Despite tremendous recent progress in human video generation, generative video diffusion models still struggle to capture the dynamics and physics of human motions faithfully. In this paper, we propose a new framework for human video generation, HumANDiff, which enhances the human motion control with three key designs: 1) Articulated motion-consistent noise sampling that correlates the spatiotemporal distribution of latent noise and replaces the unstructured random Gaussian noise with 3D articu…