YingMusic-SVC: Real-World Robust Zero-Shot
Singing Voice Conversion with Flow-GRPO and
Singing-Specific Inductive Biases

Gongyu Chen^1,∗, Xiaoyu Zhang^1,2,∗, Zhenqiang Weng^1,3, Junjie Zheng¹, Da Shen⁴, Chaofan Ding¹, Wei-Qiang Zhang⁴, Zihao Chen¹

¹AI Lab, GiantNetwork ²University College London (UCL) ³East China University of Science and Technology ⁴SATLab, Tsinghua University

∗Equal contribution

🐙GitHub Project 🤗Hugging Face Models 📄Paper

1. Overview
2. Abstract
3. Method
4. Audio Demos
5. Citation / BibTeX

1. Overview

YingMusic-SVC pipeline from professionally produced songs to robust zero-shot SVC

Figure 1: Baseline failure modes on professionally produced songs and our improved, singing-oriented SVC pipeline.

2. Abstract

Singing voice conversion (SVC) aims to render the target singer’s timbre while preserving melody and lyrics. However, existing zero-shot SVC systems remain fragile in real songs due to harmony interference, F0 errors, and the lack of inductive biases for singing. We propose YingMusic-SVC, a robust zero-shot framework that unifies continuous pre-training, robust supervised fine-tuning, and Flow-GRPO reinforcement learning. Our model introduces a singing-trained RVC timbre shifter for timbre–content disentanglement, an F0-aware timbre adaptor for dynamic vocal expression, and an energy-balanced rectified flow matching loss to enhance high-frequency fidelity. Experiments on a graded multi-track benchmark show that YingMusic-SVC achieves consistent improvements over strong open-source baselines in timbre similarity, intelligibility, and perceptual naturalness—especially under accompanied and harmony-contaminated conditions—demonstrating its effectiveness for real-world SVC deployment.

3. Method

Three-stage YingMusic-SVC framework with CPT, SFT, and Flow-GRPO RL

Figure 2: Three-stage training framework of the proposed YingMusic-SVC model: continuous pre-training (CPT), robustness-oriented supervised fine-tuning (SFT), and Flow-GRPO reinforcement learning with multi-objective rewards.

4. Audio Demos

Each example contains four tracks: Source (original song), Reference (target timbre), Baseline (Seed-VC), and Ours (YingMusic-SVC-Full).
Please wear headphones for the best experience. All audio samples are for research demonstration only.

Sample 1

Source

Reference

Baseline

Ours

Sample 2

Source

Reference

Baseline

Ours

Sample 3

Source

Reference

Baseline

Ours

Sample 4

Source

Reference

Baseline

Ours

5. Citation

If you find YingMusic-SVC helpful in your research or product, please consider citing:

@article{chen2025yingmusicsvc,
  title={YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases},
  author={Chen, Gongyu and Zhang, Xiaoyu and Weng, Zhenqiang and Zheng, Junjie and Shen, Da and Ding, Chaofan and Zhang, Wei-Qiang and Chen, Zihao},
  journal={arXiv preprint arXiv:2512.04793},
  year={2025}
}

Contents

1. Overview

2. Abstract

3. Method

4. Audio Demos

Sample 1

Sample 2

Sample 3

Sample 4

5. Citation