LPC & Cepstrum & MFCC

As the title, there’re several ways on extracting important information from speech signals. We’ll dive into all of them.

All speech signals will be pre-emphasized  by a pre-emphasis filter of 1-0.975z^{-1}

 

As we know, the whole process of LPC coefficient extraction can be divided into the following stages:

source: https://www.mathworks.com/help/dsp/examples/lpc-analysis-and-synthesis-of-speech.html

First, we would like to find the LPC prediction error from voiced and unvoiced parts of a speech signal. The frame that we extracted is /sh/ and /iy/ from the word “she” . Each frame will be 30ms long under a sampling rate of 16000 which is 480 samples.

The error is computed from the equation error = original speech - estimated speech where estimated speech is derived from the LPCs.

We may wonder how does the error vary with the order of LPCs, thus a diagram of Error vs Order is shown below:

error_vs_order.jpg

It can be seen that the error falls with increasing LPC order and doesn’t change much with increasing order.

 

An illustration is shown below for Residual Error vs Time for LPCs of order 12.

residual_error_vs_order.jpg

As we can see, there’s a periodic impulse train for voiced speech. It’s due to the vibration of the glottal and it’s what we call pitch. For unvoiced speech, we won’t see the same structure and all we see is white noise.

As the coefficients we obtained from LPC can be implemented using a ladder structure, but most of the time, we will need to implement it in lattice structure as computation will be more coefficient as error is computed through stages of a lattice filter and we can determine the number of stages we want for our error.

An illustration of reflection coefficients k_i vs time is shown below:

K_time.jpg

Next, we will obtain the spectrum derived from the LPC coefficients where the spectrum is given by:

H(e^{-j\omega}) = \frac{G} {1 - \sum_{i=1}^{p} a_i \cdot e^{-j\omega} }  = \frac{G} {A(e^{-j\omega}) } ;  G^2 = E_p

Thus we obtain:

amp_freq.jpg

We can see that it’s the envelope of the spectrum of the original signal. It also shows the variation of the formant of the speech which is what we wan’t in speech recognition.

 

Next we’ll obtain the ceptrum of the speech by taking the inverse fourier transform of the log of the spectrum obtained from LPC.

The plot of the cepstrum vs quefrency is shown below:

ceps_que

We removed the first coefficient of the cepstrum as it depends only on G [ref]. We then perform an Discrete Cosine Transform to obtain the spectrum.

amp_freq_3

 

Next, we’d like to introduce MFCC which is a commonly used method in obtaining the cepstrum of a speech signal. The difference between MFCC and ordinary method of obtaining cepstrum is that MFCC will emphasis high frequency components of a speech signal or it’s what we call warping. It’s to simulate the human ear’s sensitivity to speech signal where sensitivity of the human ear decreases log-linearly over the frequency. We will show how emphasis is done by MFCC in the diagram below:

mel_triangle.jpg

It can be seen that only 5 banks are needed to cover 4kHz to 8KHz.

 

Next, we would like to use MFCC to obtain the spectrum of /iy/ and then use DCT to obtain the spectrum of the resulting cepstum:

amp_freq_5.jpg

 

 

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s