As the title, there’re several ways on extracting important information from speech signals. We’ll dive into all of them.
All speech signals will be pre-emphasized by a pre-emphasis filter of
As we know, the whole process of LPC coefficient extraction can be divided into the following stages:
First, we would like to find the LPC prediction error from voiced and unvoiced parts of a speech signal. The frame that we extracted is /sh/ and /iy/ from the word “she” . Each frame will be 30ms long under a sampling rate of 16000 which is 480 samples.
The error is computed from the equation where is derived from the LPCs.
We may wonder how does the error vary with the order of LPCs, thus a diagram of Error vs Order is shown below:
It can be seen that the error falls with increasing LPC order and doesn’t change much with increasing order.
An illustration is shown below for Residual Error vs Time for LPCs of order 12.
As we can see, there’s a periodic impulse train for voiced speech. It’s due to the vibration of the glottal and it’s what we call pitch. For unvoiced speech, we won’t see the same structure and all we see is white noise.
As the coefficients we obtained from LPC can be implemented using a ladder structure, but most of the time, we will need to implement it in lattice structure as computation will be more coefficient as error is computed through stages of a lattice filter and we can determine the number of stages we want for our error.
An illustration of reflection coefficients vs time is shown below:
Next, we will obtain the spectrum derived from the LPC coefficients where the spectrum is given by:
Thus we obtain:
We can see that it’s the envelope of the spectrum of the original signal. It also shows the variation of the formant of the speech which is what we wan’t in speech recognition.
Next we’ll obtain the ceptrum of the speech by taking the inverse fourier transform of the log of the spectrum obtained from LPC.
The plot of the cepstrum vs quefrency is shown below:
We removed the first coefficient of the cepstrum as it depends only on [ref]. We then perform an Discrete Cosine Transform to obtain the spectrum.
Next, we’d like to introduce MFCC which is a commonly used method in obtaining the cepstrum of a speech signal. The difference between MFCC and ordinary method of obtaining cepstrum is that MFCC will emphasis high frequency components of a speech signal or it’s what we call warping. It’s to simulate the human ear’s sensitivity to speech signal where sensitivity of the human ear decreases log-linearly over the frequency. We will show how emphasis is done by MFCC in the diagram below:
It can be seen that only 5 banks are needed to cover 4kHz to 8KHz.
Next, we would like to use MFCC to obtain the spectrum of /iy/ and then use DCT to obtain the spectrum of the resulting cepstum: