With video chats becoming more popular in the age of remote and hybrid workplaces, phrases like “mute yourself” and “I think you’re muted” have entered our everyday lexicon. However, it turns out that muting yourself may not be as safe as you think.
Kevin Fu, a Northeastern University professor of electrical and computer engineering and computer science, has discovered a way to extract audio from images and even muted films. Fu and his research team developed Side Eye, a machine learning-assisted tool that can discern the gender of someone speaking in the room where a photo was taken—and even the exact words they spoke.
“Imagine someone is doing a TikTok video and they mute it and dub music,” Fu explains. “Have you ever wondered what they’re really saying?” ‘Watermelon watermelon’ or ‘Here’s my password?’ Was there someone speaking behind them? You can actually hear what is spoken off camera.”
It sounds like something out of a science fiction novel—and it is. Side Eye was inspired by an episode of the science fiction show “Fringe” in which the primary protagonists, a team of fringe science investigators working for the FBI, extracted audio from a melted pane of glass.
When the show first broadcast, one Den of Geek critic dubbed it a “ridiculous pseudo science technique.” Fu was not convinced.
“I was like, ‘I bet we can do that,'” Fu recalls. “My lab specializes in the unthinkable.” We normally anticipate the initial reaction to anything we do as ‘You can’t do that,’ and we respond, ‘Well, we already did.'”
Side Eye makes use of image stabilization technology, which is now almost universal in phone cameras. Cameras include microscopic springs that keep the lens suspended in liquid to prevent blurry photos from being taken with a shaky hand. To eliminate camera shake, an electromagnet and sensors press the lens in equal and opposing directions.
However, Fu claims that anytime someone speaks near a camera lens, it generates tiny vibrations in the springs and slightly bends the light. The light’s angle shifts almost imperceptibly—”unless you’re looking for it,” Fu explains.
Normally, extracting acoustic frequency from minuscule vibrations would be difficult. However, Fu claims that rolling shutter, a technique used by most phone cameras today, actually makes it easier to achieve the impossible.
“The way cameras work today to reduce cost basically is they don’t scan all pixels of an image simultaneously — they do it one row at a time,” Fu explains. “That occurs hundreds of thousands of times in a single photograph.” This simply means that you can magnify the amount of frequency information you can acquire, or the granularity of the audio, by over a thousand times.”
Side Eye will work as long as there is some light, albeit the more imagery it has access to, the better. Fu claims that even pointing a camera at a ceiling will allow Side Eye to work its magic.
This method produces audio that, even at its best, sounds more like the muffled sound of grownups in Peanuts cartoons. Fu, on the other hand, is able to extract a lot of information by applying machine learning and training Side Eye on specific words and audio.
“If you want to know if I said yes or no, you can train [Side Eye] on people saying yes and no and then look at the patterns and with high confidence when I get an image later know if someone said yes or no,” Fu explains.
Side Eye can even identify the precise person speaking if it’s been trained on that person’s speech, though Fu claims it’s not as accurate yet.
Side Eye, from the standpoint of cybersecurity, opens up a completely new realm of hazards that users and cybersecurity specialists should be aware of. However, Fu believes that the most intriguing application for Side Eye could be as a new type of digital evidence for lawyers and others working in the criminal justice system.
“Maybe there’s an alibi and it’s being admitted to court and somebody wants to prove somebody was or wasn’t there,” Fu speculates. “If you have an authenticated video with a known timestamp, you might be able to use this technique to confirm one way or the other.” If you hear the person’s voice, they are most likely present.”
Download The Radiant App To Start Watching!
Web: Watch Now
LGTV™: Download
ROKU™: Download
XBox™: Download
Samsung TV™: Download
Amazon Fire TV™: Download
Android TV™: Download