YingVideo-MV: Music-Driven Multi-Stage Video Generation

AI Lab, Giant Network

Introduction Video

Abstract

We present YingVideo-MV, the first cascaded framework for music-driven long-video generation. Our approach integrates audio semantic analysis, an interpretable shot planning module (MV-Director), temporal-aware diffusion Transformer architectures, and long-sequence consistency modeling to enable automatic synthesis of high-quality music performance videos from audio signals. We construct a large-scale Music-in-the-Wild Dataset by collecting web data to support the achievement of diverse, high-quality results. Observing that existing long-video generation methods lack explicit camera motion control, we introduce a camera adapter module that embeds camera poses into latent noise. To enhance continulity between clips during long-sequence inference, we further propose a time-aware dynamic window range strategy that adaptively adjust denoising ranges based on audio embedding. Comprehensive benchmark tests demonstrate that YingVideo-MV achieves outstanding performance in generating coherent and expressive music videos, and enables precise audio-motion-camera synchronization.

Long MV with High Character Consistency across Multiple Shots

High-performance Clip Videos

Camera Move Video

Pan Left

Pan Right

Zoom In

Zoom Out

Comparison with other methods

StableAvatar

InfiniteTalk

Ours

Framework Overview

YingVideo-MV Pipeline Overview

Our framework integrates multimodal inputs (music, text, and images) to enable segmented generation of music-performing portrait videos under the guidance of a global planning module. The planning agent strategically invokes specialized tools according to sub-task requirements, ultimately generating three core outputs conditioned on initial-frame specifications: (1) high fidelity music-performing portrait images, (2) coherent dynamic camera trajectories, and (3) synchronized audio sequences aligned with visual performance cues.

BibTeX

@article{yingvideo2025,
  title={YingVideo-MV: Music-Driven Multi-Stage Video Generation},
  author={Giant AI Lab},
  year={2025}
}