Giant Network AI Lab recently collaborated with Tsinghua University SATLab and Northwestern Polytechnical University to launch three multimodal generation technology achievements in the field of audio and video: the music-driven video generation model YingVideo-MV, the zero-shot voice conversion model YingMusic-SVC, and the singing synthesis model YingMusic-Singer.

These achievements reflect the team's latest progress in multimodal generation technology for audio and video, and will be open-sourced on platforms such as GitHub and HuggingFace. Among them, the YingVideo-MV model can generate a music video clip by simply providing "a piece of music and a person's image". It is capable of conducting multimodal analysis of the rhythm, emotion, and structural content of the music, ensuring that the camera movement is highly synchronized with the music. It also includes lens language such as zooming in, zooming out, panning, and moving. Additionally, it uses a long-term temporal consistency mechanism to effectively alleviate common issues like "distortion" and "frame skipping" in long videos.

Giant Network

In terms of audio generation, YingMusic-SVC focuses on the **"real song usability"** of zero-shot voice conversion. Through optimization for real music scenarios, it effectively suppresses interference from accompaniment, harmonies, and reverb, significantly reducing the risk of pitch distortion and high-note distortion, providing a stable technical support for high-quality music re-creation.

The YingMusic-Singer singing synthesis model supports inputting any lyrics under a given melody to generate clear pronunciation and stable melody natural singing. Its main feature is the ability to flexibly adapt to lyrics of different lengths and support zero-shot voice cloning, greatly enhancing the flexibility and practicality of AI singing in creation, and effectively lowering the barrier to music creation.