YingSound: Video-Guided Sound Effects Generation
with Multi-modal Chain-of-Thought Controls

Zihao Chen1Haomin Zhang1Xinhan Di1Haoyu Wang1, 3Sizhe Shan1, 3Junjie Zheng1Yunming Liang1Yihan Fan1, 4Xinfa Zhu1, 2,
Wenjie Tian1, 2Yihua Wang1Chaofan Ding1, and Lei Xie2

1AI Lab, Giant Network
2ASLP Lab, Northwestern Polytechnical University
3Zhejiang University
4East China University of Science and Technology

[Arxiv]

Content

Promotional Video
Abstract
Method
V2A Generation Results Visualization
V2A Generation Examples
Audio Generation for Game
Audio Generation for Animation
Audio Generation for Real World
Audio Generation for Long Time Video
Audio Generation for AI Generated Video
Audio Generation Comparison with Prior Work
Text Control

Promotional Video

Abstract

Generating sound effects for product-level videos, where only a small amount of labeled data is available for diverse scenes, requires the production of high-quality sounds in few-shot settings. To tackle the challenge of limited labeled data in real-world scenes, we introduce YingSound, a foundation model designed for video-guided sound generation that supports high-quality audio generation in few-shot settings. Specifically, YingSound consists of two major modules. The first module uses a conditional flow matching transformer to achieve effective semantic alignment in sound generation across audio and visual modalities. This module aims to build a learnable audio-vision aggregator (AVA) that integrates high-resolution visual features with corresponding audio features at multiple stages. The second module is developed with a proposed multi-modal visual-audio chain-of-thought (CoT) approach to generate finer sound effects in few-shot settings. Finally, an industry-standard video-to-audio (V2A) dataset that encompasses various real-world scenarios is presented. We show that YingSound effectively generates high-quality synchronized sounds across diverse conditional inputs through automated evaluations and human studies.

Method

Overview of YingSound

Figure 1: Overview of YingSound.

It comprises two key components: Conditional Flow Matching with Transformers and Adaptive Multi-modal Chain-of-Thought-Based Audio Generation.

V2A Generation Results Visualization

Balloon
Yellow Dog
Lion
Gun

Figure 2: Temporal Alignment Comparison.

V2A Generation Examples

Audio Generation for Game

Audio Generation for Animation

Audio Generation for Real World

Audio Generation for Long Time Video

Audio Generation for AI Generated Video

Audio Generation Comparison with Prior Work

Ours
GT
FoleyCrafter
Diff-Foley

Text Control

Without Prompt
Prompt: motorcycle engine
Prompt: car horn
Without Prompt
Prompt: bird song
Without Prompt
Prompt: thunder
Without Prompt
Prompt: subway driving