Skip to main content

Google DeepMind’s new AI tool uses video pixels and text prompts to generate soundtracks

Google DeepMind’s new AI tool uses video pixels and text prompts to generate soundtracks

/

The new video-to-audio tool will automatically match sounds to the appropriate scenes.

Share this story

If you buy something from a Verge link, Vox Media may earn a commission. See our ethics statement.

Photo illustration of the shape of a brain on a circuit board.
Illustration: Cath Virginia / The Verge | Photos: Getty Images

Google DeepMind has taken the wraps off of a new AI tool for generating video soundtracks. In addition to using a text prompt to generate audio, DeepMind’s tool also takes into account the contents of the video.

By combining the two, DeepMind says users can use the tool to create scenes with “a drama score, realistic sound effects or dialogue that matches the characters and tone of a video.” You can see some of the examples posted on DeepMind’s website — and they sound pretty good.

For a video of a car driving through a cyberpunk-esque cityscape, Google used the prompt “cars skidding, car engine throttling, angelic electronic music” to generate audio. You can see how the sounds of skidding match up with the car’s movement. Another example creates an underwater soundscape using the prompt, “jellyfish pulsating under water, marine life, ocean.”

Even though users can include a text prompt, DeepMind says it’s optional. Users also don’t need to meticulously match up the generated audio with the appropriate scenes. According to DeepMind, the tool can also generate an “unlimited” number of soundtracks for videos, allowing users to come up with an endless stream of audio options.

That could help it stand out from other AI tools, like the sound effects generator from ElevenLabs, which uses text prompts to generate audio. It could also make it easier to pair audio with AI-generated video from tools like DeepMind’s Veo and Sora (the latter of which plans to eventually incorporate audio).

DeepMind says it trained its AI tool on video, audio, and annotations containing “detailed descriptions of sound and transcripts of spoken dialogue.” This allows the video-to-audio generator to match audio events with visual scenes.

The tool still has some limitations. For example, DeepMind is trying to improve its ability to synchronize lip movement with dialogue, as you can see in this video of a claymation family. DeepMind also notes that its video-to-audio system is dependent on video quality, so anything that’s grainy or distorted “can lead to a noticeable drop in audio quality.”

DeepMind’s tool isn’t generally available just yet, as it will still have to undergo “rigorous safety assessments and testing.” When it does become available, its audio output will include Google’s SynthID watermark to flag that it’s AI-generated.