Speech Recognition using DSP

Thread Starter


Joined Aug 6, 2010

Firstly, thanks for the time your spend on this thread.

Scenario at hand: I want to build a remote controlled toy-car: instead of a joystick in the remote, i need it to voice enabled. 6 predefined instructions (start, stop, front, back, left, right) from any of 2 persons only should be authenticated by the remote controller.

Is it possible to do this with TMS320C6713, or any such DSP for the speech processing part and a regular PIC for maneuvering the car?

Maximum delay is 2s from having spoken the command to the response of the toy-car.
I need to know if the DSP will have enough prowess to all the processing.

In addition to this, I want the communication to be wireless. I am planning to use a transceiver pair; will it work fine with the DSP and PIC?

Please reply at your convenience.


Joined Feb 11, 2008
1. Yes it is possible.

2. Based on the questions you are asking, like; "I need to know if the DSP will have enough prowess to all the processing." I think you might be lacking the skills to make it work, as it is very difficult to do and requires some complex DSP processing with some specific problems.

Thread Starter


Joined Aug 6, 2010
well, i haven't used a dsp (or for that matter even a PIC till date)!!
but i would like to study how to program a dsp. any good links of tutorials or related stuff??
PS: thanks for the frank reply


Joined Feb 11, 2008
Well I'm just guessing because you have not mentioned it, but is this a school project or Uni project?

If so it might be better to start with a simpler project like a remote controlled car or robot using a micro of whatever type you prefer.

Speech recognition is an extremely hard project, even with a microphone and good mic amp right at your mouth (like a headset) and a large processor like a PC it does not work that well. To make a robot or car hear your voice and understand it from across the room is almost impossible even for someone with a lot of microcontroller experience.

As for learning DSP especially sound decoding and tone recognition etc it is a wide field that will give you a lot of reading (please google). I doubt that you will find something like a hobby speech recognition project where you can copy someone's Arduino code... ;)

Thread Starter


Joined Aug 6, 2010
i don't want the bot to detect a voice across the room, no!
there will be a hand held module with the microphone, DSP and a transmission module. a receiver module in the moving bot will receive directions from the hand-held module and the PIC in the bot will direct it accordingly.

'n yes, this is a college project, and as i have already given the interim report, i have to continue with the voice authenticated navigation system itself; it's only the circuit (and components like PIC/DSPIC?Raspberry Pi) that i have liberty to change now.

so i'll just start studying dsp programming 'n mayb those arduino codes just might come in handy at the end..! ;)


Joined Oct 18, 2012
To chime in here, I agree with Mr. Black, DSP can be pretty hard to do, especially, voice recognition.

One thing to remember in a project like this is that restrictions are your friends. Saying that this works only for your voice in a certain tone makes everything much easier for you.

At the very least, you could send the audio data to a computer, I know Microsoft Speech can do speech recognition, have the computer determine the command and send that over to the robot. I know it isn't pretty, but it can do what you want in the end.


Joined Sep 9, 2010
Why reinvent a (very complex) wheel? Just use an iPod running Dragon Speech and have it send out the interpreted commands.


Joined Feb 11, 2008
Or use the same "handheld control" and put a couple of buttons on it? Then it will be more reliable, AND faster and better to control the vehicle. You just invented the remote control.

There are a lot of good reasons you don't tell your TV, microwave or HiFi what to do by voice...

Thread Starter


Joined Aug 6, 2010
thankx guys, i understand that it's very difficult to get the required output for this project. but in this project, i'm trying to familiarize with the DSP rather than to master it.

so i think i'll just go forward with this idea and get stuck somewhere down the line..! :p but i would have learned how to use a DSP [rather, how not to use it :p] 'n that in itself will be an achievement for me.

i hope to contact u guys for further queries related to the same in due time.


Joined Jan 10, 2012
I agree with all of the above. I suggest a much more reasonable project for DSP. Try sampling a signal and use the DSP to smooth and recover the original signal. That alone will be a very good introduction to DSP. Or how about a spectrum analyzer? Have you taken discreete math or Digital Processing? Or do you plan to use software, like Simulink to generate the DSP code without knowledge of the underlying math?


Joined Apr 24, 2007
The general speech recognition case would be very challenging. But you can make it relatively easy, with enough restrictions and conditions, and maybe a bit of cleverness.

With machine recognition of anything, the most important part of the algorithm is what features you have preselected, in this case for it to try to use to discriminate between words or people. You can simplify the tasks performed during operation, tremendously, by requiring, for example, no or low background noise levels, no speech except commands, speaker silence gaps before and after each commnand, etc. (Also, a hint: Make the car drive quite slowly. Two seconds is a very long time.) (Hint number two: Consider adding the word "NO", for when it makes a mistake, so it will abort or undo the last command.)

"Feature extraction" will be very important, before setting up the details of the recognition algorithm part of the application code. You might also want to implement a learning mode, in the application, for new speakers.

Maybe you can find a library of speech-recognition-related code for your DSP, or a similar one, that extracts useful information from an utterance, maybe syllables, or phonemes, or whatever features tend to be useful. But for so few words, you might also be able to just do whatever works; a "nonsense" example: do an FFT a few times a second once a loudness threshold is passed and look for a sequence pattern with threshold bands at certain frequencies, or something trivial like that (and then have your "learning mode" populate the threshold and frequency parameters for any new speaker, unless you can make it speaker-independent).

With so few words, which all sound quite different from each other, it might be worth also just recording many copies of the sampled mic data for each word and literally looking at both the time-domain and frequency-domain plots, maybe FFTing and/or averaging (or "other") with different window rates/lengths, and just visually look for differences, in the plots, that might enable an algorithm to perform calculations that would differentiate between the words.

But definitely do some searches and read how it's usually done, first. (Including the word "tutorial" can sometimes work wonders, on google.) Maybe try "speech recognition feature extraction" and "speaker recognition feature extraction", as well as "statistical pattern recognition".

I will say this: A guy I knew implemented a very similar level of speech recognition on a PDP-11/35 running Unix back in 1977 or so, with a total of 80K of RAM, to do voice control of a robotic arm. He simultaneously had it doing machine vision with a video camera, and could tell it to pick up and give him an object, which it also had to recognize. It also had to plan and calculate all of the geometry for its motion and the joints' command voltages as functions of time, to move the gripper to where it needed to be, with acceptable bedside manner. 80K of RAM and some C code. We later focused on motorized prosthetic arms for above-the-elbow amputees, with automatic recognition of the wearer's intent by classification of patterns in the voltage waveforms in their remaining bicep and tricep musculature.

Thread Starter


Joined Aug 6, 2010
Mr. Gootee, thanks for your encouraging words; 'n i'm not loosing sight of the hurdles pointed out by my other friends.
Anyway, i'm back with another doubt..

I'm making the Matlab code for voice authentication and recognition, using MFCCs.
In the initial phase,
I will read a .wav file using Matlab,
do pre-emphasis,
separate it into frames,
pass the frames through a hamming window,
take their separate ffts and combine the results to form the complete vector.

Now I want to pass it through a triangular BPF and take the DCT of its log; so as to get the MFCCs. Then I will do the same with another file, and by comparing the MFCC pairs, I hope the authentication can be done.

But I haven't got any clue how to make the same. Can you please help me with the codes?

This is my current Matlab code.

%[sorry i couldn't attach the file, they kept saying it was too long..]

display('Program starts');
waveFile='M:\project\MATLAB programs\whatFood_preEmphasis.wav';
[y, fs,nbits]=wavread(waveFile);
y = filter([1, -.95], 1, y); %Pre-emphasis

fft_vector=[0]; %one coloumn for freqency and the other for magnitude
magfreq_vector=[0 0];
for startIndex=1:n:nbits-n+1
endIndex=startIndex+n-1; %length of frame is n-1
y_frame=y(startIndex:endIndex); %framing
win_y=y_frame.*(hamming(n))'; %windowing function
win_fft=fft(win_y); %taking fft
%fft_vector=cat(2, fft_vector, win_fft); %complete vector, formed by concatenating the different frames
magfreq_vector=cat(1, magfreq_vector, [mag, freq]);


Joined Apr 24, 2007
For the MFCC, i.e. mel-scale frequency cepstral coefficient (and the DCT, discrete cosine transform), maybe you already found this page:

http://neural.cs.nthu.edu.tw/jang/books/audiosignalprocessing/speechfeaturemfcc.asp?title=12-2 mfcc

They seem to be discussing (and showing how to code) exactly what you are wanting to do.

It looks like there are also a lot of similar discussions at the mathworks.com site.

Remember, Google is your best friend.


ALSO look at the "parent page" for that link! And, at the very bottom, in the "Chapter 19" link, there is a downloadable FUNCTION LIBRARY of program code for speech recognition. I haven't looked at it but that's the type of thing that could make your job a whole lot easier, so you can concentrate more on the overall strategy, instead of getting too bogged-down with all of the little tactical details.

It's at:


If that's not directly applicable, you can search and will find one that is.

On the other hand, if the details are what you're supposed to be learning, then consider the provided code to be only "examples". (If you were working for some company, doing this, you would not want to try to "re-invent the wheel" unless you had a very good reason to do so. However, being a student is different. A lot of things need to be done totally "from scratch", at least once, so you can fully understand them. I wouldn't want you to "cheat yourself out of" (i.e. deprive yourself from) having such an experience, if that is the instructor's intention, in this case.)

Have fun,

Last edited:

Thread Starter


Joined Aug 6, 2010
You're right Gootee, for I had found the first link you posted.

But they haven't given a code for the Triangular BPF, 'n I don't know if I'm supposed to combine the win_fft to form the fft_vector [both these are variables in my previous post's code block]; before I apply the TBPF or should I just filter the individual components before combining the results.

Also, can I just take dft(fft_vector) or do I have to take the separate ffts and combine them afterwards?

And as for the basic motto, I would like to study the basics - I actually know a little, as in TBPFs are used so as to get the audio components in a logarithmic scale; even though I don't know what I'm calling components! - but I have got only till march 1 to submit this project; it is a part of our final year engineering course.

So I would love to have both the codes and the basics; 'n I will study as much of the basic concepts as I can.

I went through the second link you posted, 'n I'm sure they will come in use when I'm at last writing the code into DSP. But as of now, I'm concentrating on Matlab codes.

Please note that your help is very much appreciated and I look forward to the same.

'n one more doubt, when i'm splitting up the input signal into frames, they are overlapped? Is it necessary for them to be overlapped? what is the extent of this overlap?
Last edited:


Joined Apr 24, 2007
I am not really able to help you more than that. I suggest that you try to find a discussion site where there are other people who already know how to do what you want. Alas, I don't.

Try the mathworks site. And look for the relevant usenet-type groups at http://groups.google.com. Like this:


Also check yahoogroups.com:


It looks like there are a LOT of resources available. The yahoogroups groups have FILE LIBRARIES in their "members only" areas. But most of those groups require you to join, first, which can include a DELAY while someone reviews and approves your membership (usually just to stop bots and spammers). So apply for membership, immediately, for ALL of the groups that look like they even might be useful, so you don't run out of time. And ping them after 12-18 hours if you don't get membership rights by then.

You are really running short of time. Try to find a source for the most-suitable Matlab code (or knowledge), within one or maybe two days. But spend enough time to choose wisely. It will take the rest of the time to prepare everything, IF you have chosen the right code (or knowledge source) to begin with. Otherwise, it will be very difficult to complete it on time. Good luck. Try to get a little sleep, regularly, too, and stay hydrated.