Whether it’s mingling at a party in the metaverse or watching a home movie in your living room while wearing augmented reality (AR) glasses, acoustics play a role in how these moments will be experienced. We are building for mixed reality and virtual reality experiences like these, and we believe AI will be core to delivering sound quality that realistically matches the settings people are immersed in.
Today, Meta AI researchers, in collaboration with an audio specialist from Meta’s Reality Labs and researchers from the University of Texas at Austin, are open-sourcing three new models for audio-visual understanding of human speech and sounds in video that are designed to push us toward this reality at a faster rate.
We need AI models that understand a person’s physical surroundings based on both how they look and how things sound. For example, there’s a big difference between how a concert would sound in a large venue versus in your living room. That’s because the geometry of a physical space, the materials and surfaces in the area, and the proximity of where the sounds are coming from all factor into how we hear audio.
The research we are sharing today with the AI community focuses on three audio-visual tasks that outperform existing methods. For our Visual Acoustic Matching model, we can input an audio clip recorded anywhere, along with an image of a target environment, and transform the clip to make it sound as if it were recorded in that environment. For example, the model could take an image of a dining room in a restaurant, together with the audio of a voice recorded in a cave, and make that voice sound instead like it was recorded in the pictured restaurant. The second model, Visually-Informed Dereverberation, does the opposite. Using observed sounds and the visual cues of a space, it focuses on removing reverberation, which is the echo a sound makes based on the environment where it is recorded. Imagine a violin concert in a busy train station. This model can distill the essence of the violinist’s music without the reverberations bouncing around the massive train station. The third model, VisualVoice, uses visual and audio cues to separate speech from other background sounds and voices, which will be beneficial for human and machine understanding tasks, such as creating better subtitles or mingling at a party in VR.
Visual Acoustic Matching
Anyone who has watched a video where the audio isn’t consistent with the scene knows how disruptive this can feel to human perception. However, getting audio and video from different environments to match has previously been a challenge. Acoustic simulation models can be used to generate a room impulse response to re-create the acoustics of a room, but this can be done only if the geometry — often in the form of a 3D mesh — and material properties of the space are known. In most cases, this information isn’t available. Acoustic properties can also be estimated from just the audio captured in a particular room, but this provides only limited acoustic information about the target space from the reverberation of the audio sample. While these approaches are available, they often do not yield great results.
Building a future with AI models that understand the world around us
Existing AI models do a good job understanding images, and are getting better at video understanding. However, if we want to build new, immersive experiences for AR and VR, we need AI models that are multimodal — models that can take audio, video, and text signals all at once and create a much richer understanding of the environment.
This is an area we will continue exploring. AViTAR and VIDA are currently based on only a single image. In the future, we want to explore using video and other dynamics to capture the acoustic properties of a space. This will help bring us closer to our goal of creating multimodal AI that understands real-world environments and how people experience them.
We are excited to share this research with the open source community. We believe AI that understands the world around us can help unlock exciting new possibilities to benefit how people experience and interact in mixed and virtual reality