Noise-canceling headphones have become very good at creating an auditory blank slate. But allowing certain sounds from a wearer’s environment through erasure still poses a challenge for researchers. For example, the latest edition of Apple’s AirPods Pro automatically adjusts the sound level for wearers (detecting when they’re on a call, for example), but the user has little control over who to listen to and when this happens.
A team from the University of Washington has developed an artificial intelligence system that allows a user wearing headphones to look at a person for three to five seconds to “register” them. The system, called “Target Speech Hearing”, then cancels all other sounds in the environment and plays only the voice of the registered speaker in real time, even if the listener moves in noisy places and is no longer looking at the speaker.
The team presented its findings May 14 in Honolulu at the ACM CHI conference on Human Factors in Computing Systems. The code for the proof-of-concept device is available for others to build on. The system is not commercially available.
“We now tend to think of AI as web-based chatbots that answer questions,” says senior author Shyam Gollakota, a UW professor in the Paul G. Allen School of Computer Science & Engineering. “But in this project we are developing AI to adapt the auditory perception of anyone wearing headphones to his or her preferences. With our devices you can now hear a single speaker clearly, even if you are in a noisy environment where many other people talk.”
To use the system, a person wearing an off-the-shelf headset with a microphone taps a button while pointing their head toward someone who is talking. The sound waves from that speaker’s voice must then reach the microphones on both sides of the headphones at the same time; there is a margin of error of 16 degrees. The headphones send that signal to an on-board computer, where the team’s machine learning software learns the voice patterns of the desired speaker. The system listens to that speaker’s voice and continues to play it to the listener even as the pair moves around. The system’s ability to focus on the recorded voice improves as the speaker continues to talk, providing the system with more training data.
The team tested its system on 21 subjects, who rated the clarity of the enrolled speaker’s voice on average almost twice as high as the unfiltered audio.
This work builds on the team’s previous research into ‘semantic hearing’, which allowed users to select specific sound classes (such as birds or voices) they wanted to hear and cancel out other sounds in the environment.
Currently, the TSH system can only enroll one speaker at a time, and it can only enroll a speaker if there is no other loud voice coming from the same direction as the target speaker’s voice. If a user is not satisfied with the sound quality, he or she can re-enroll the speaker to improve clarity.
The team is working to expand the system to earbuds and hearing aids in the future.
Other co-authors on the paper were Bandhav Veluri, Malek Itani and Tuochao Chen, UW doctoral students at the Allen School, and Takuya Yoshioka, research director at AssemblyAI.
More information:
Bandhav Veluri et al., Look Once to Hear: Focused Hearing of Speech with Noisy Samples, Proceedings of the CHI Conference on Human Factors in Computing Systems (2024). DOI: 10.1145/3613904.3642057, dl.acm.org/doi/10.1145/3613904.3642057
Provided by the University of Washington
Quote: AI headphones let wearer listen to one person in a crowd just by looking at them once (2024, May 23) retrieved May 24, 2024 from https://techxplore.com/news/2024-05-ai- headphones-wearer-person-crowd.html
This document is copyrighted. Except for fair dealing purposes for the purpose of private study or research, no part may be reproduced without written permission. The content is provided for informational purposes only.