Meta (formerly Facebook) has built three new artificial intelligence (AI) models designed to make sound more realistic in mixed and virtual reality experiences.
The three AL models — Visual-Acoustic Matching, Visually-Informed Dereverberation and VisualVoice — focus on human speech and sounds in video and are designed to push “us toward a more immersive reality at a faster rate,” the company said in a statement.
“Acoustics play a role in how sound will be experienced in the metaverse, and we believe AI will be core to delivering realistic sound quality,” said Meta’s AI researchers and audio specialists from its Reality Labs team.
They built the AI models in collaboration with researchers from the University of Texas at Austin, and are making these models for audio-visual understanding open to developers.
The self-supervised Visual-Acoustic Matching model, called AViTAR, adjusts audio to match the space of a target image.
The self-supervised training objective learns acoustic matching from in-the-wild web videos, despite their lack of acoustically mismatched audio and unlabelled data, informed Meta.
VisualVoice learns in a way that’s similar to how people master new skills, by learning visual and auditory cues from unlabelled videos to achieve audio-visual speech separation.
For example, imagine being able to attend a group meeting in the metaverse with colleagues from around the world, but instead of people having fewer conversations and talking over one another, the reverberation and acoustics would adjust accordingly as they moved around the virtual space and joined smaller groups.
“VisualVoice generalises well to challenging real-world videos of diverse scenarios,” said Meta AI researchers.