- DeepMind’s V2A generates soundtracks, dialogue for videos using diffusion models.
- Combined with video generation, it aims to revolutionize AI media.
- Taking on startups with more advanced capabilities.
DeepMind, Google’s renowned AI research lab, has announced its latest groundbreaking development – an AI technology capable of generating soundtracks and dialogue for videos.
This innovative solution, dubbed V2A (short for “video-to-audio”), aims to revolutionize the AI-generated media landscape.
Bringing silence to life
While significant advancements have been made in video generation models, DeepMind recognizes the need for accompanying audio elements to bring these visuals to life truly.
The company emphasizes, “Video generation models are advancing rapidly, yet many current systems can only generate silent output.”
V2A technology emerges as a promising approach to address this limitation, enabling the creation of music, sound effects, and dialogue synchronized with the generated videos.
Not the first, but aiming higher
DeepMind’s V2A technology leverages a diffusion model trained on a combination of sounds, dialogue transcripts, and video clips.
By associating specific audio events with visual scenes and incorporating information from annotations or transcripts, the AI model learns to generate audio tracks that seamlessly complement the visuals.
Additionally, DeepMind’s SynthID technology embeds watermarks to combat potential deepfakes.
Notably, AI-powered sound-generating tools are not entirely new to the market. Startups like Stability AI and ElevenLabs have recently released similar solutions, while Microsoft has developed a model to create talking and singing videos from still images.
Platforms such as Pika and GenreX have also trained models to suggest appropriate music or sound effects for given video scenes.
However, DeepMind’s V2A technology aims to push the boundaries further by integrating advanced AI capabilities.