Removing original audio from ambient input

Discussion in 'General Electronics Chat' started by wavepropagation, Jan 7, 2015.

  1. wavepropagation

    Thread Starter New Member

    Jan 7, 2015

    I'm working on a project where I have the original audio (music) which is output to speakers. The output from speakers is captured by a microphone. The signal captured by the microphone can be a combination of white noise, music, and conversation. What I'm interested in achieving is obtaining a confirmation that someone is speaking while the music is playing. The confirmation can be either a binary 0 / 1 or a level of confirmation from 0 - 10.
    My current thought is to use a differential amplifier. The original signal + gain + buffer is fed on one input and the input of the microphone is fed on the other input. The goal is to match the amplitude of the signal from the microphone with the amplitude of the original signal and then subtract the two signals. If no one is speaking then I would expect a very low amplitude signal but if someone is speaking then I should detect a signal with amplitude exceeding a threshold. This can be fed directly in to a comparator outputting a 1 / 0.
    Also, is the phase shift between the input of the original signal and the signal captured by the microphone something I need to be concerned about?
    Please let me know if this would work or any other options to accomplish this goal.
  2. MikeML

    AAC Fanatic!

    Oct 2, 2009
    Ain't going to work because of acoustic delays.

    The world's expert on this problem was banished from this web site. Go ask your question on and hope that Audioguru responds.
    GopherT likes this.
  3. AnalogKid

    AAC Fanatic!

    Aug 1, 2013
    An instantaneous comparator as you describe probably won't work consistently for two reasons. First, as noted above there is a group delay between the speaker and the mic that is not in the direct source path. Second, the speaker colors the sound. That is, it doesn't have a flat frequency response and flat phase delay.

    However, what might work is a low frequency envelope detector scheme. Rectify and lowpass filter the direct and mic signals to get waveforms that vary much more slowly than the audio frequencies in the original signals. Try a lowpass cutoff of 10 to 20 Hz as a start. The nice thing about this is that you can electronically delay the source signal with an all-pass filter stage to match the time shift in the mic signal.

  4. Veracohr

    Well-Known Member

    Jan 3, 2011
    Do you mean you want to isolate conversation from a room source that may also include music and other noise? That's next to impossible, not just because of delays as MikeML says, but also reflections and the variable that people moving around in a room introduces.

    Off the top of my head I can imagine a theoretical possibility wherein the room is modeled by an impulse response, then the original music source is processed with an artificial reverb created with this impulse response in an attempt to achieve the same sound the microphone is hearing. It would require a fair amount of processing (convolution reverbs are notoriously processor-hungry) and in reality probably wouldn't work. And that's not even including the people that would be in the room. Bodies affect the reflections in the room, and once they start moving around, you're SOL.

    A more reasonable approach (maybe) would be to filter the audio to the relatively narrow range that is required for voices to be intelligible, then use speech recognition software to identify a voice. Telephone frequency response historically have been limited to 300Hz-3kHz because that's all that's needed to understand speech.
  5. nsaspook

    AAC Fanatic!

    Aug 27, 2009