GMM-Based Speaker Recognition

Introduction

GMM vs K-Means

First, we’ll have to understand what are hard decisions and soft decisions .

Hard Decision

A data point is clustered to a single cluster and the results are final.

Soft Decision

A data point is modeled by a distribution of clusters, thus it will be probabilistically defined and there’s no definite assignment to a particular cluster.

Implementation

Feature Extraction

Pre-processing is a vital step for speech recognition task. The most commonly used method for extracting features from speech signals in MFCC. We won’t be implementing this feature extraction ourselves but we’ll be using a toolkit called Hidden Markov Model Toolkit (HTK).

The conversion step can be done using the following command:

HCopy -C config your_wav_file.wav mfcc_file.mfc

Where the contents of config are as follows:

# Waveform parameters
SOURCEFORMAT = ALIEN #file with head length = 0
HEADERSIZE = 1024
SOURCERATE = 625.0 #sampling rate, unit: 100 nsec
# Coding parameters
TARGETKIND = MFCC_E #output is MFCC and energy
TARGETRATE = 100000.0 #window shift for analysis, unit: 100 nsec
SAVECOMPRESSED = F
SAVEWITHCRC = T
WINDOWSIZE = 320000.0 #window width for analysis, unit: 100 nsec
ZMEANSOURCE = T #remove signal bias
USEHAMMING = T #take hamming windows before FFT
PREEMCOEF = 0.97 #pre-emphasis : 1-0.97Z^-1
NUMCHANS = 24 #number of filter bands 
USEPOWER = F
CEPLIFTER = 22 #weighting MFCCs 
LOFREQ = 0 #filter band begin from 0Hz 
HIFREQ = 8000 #filter band stop at 8000Hz 
NUMCEPS = 12 #number of cosine transform 
ENORMALISE = T
ALLOWCXTEXP = F

If you’ve multiple wav file to be converted, you can use the following command:

HCopy -C config -S convert.scp

Where convert.scp is a script the specifies your source and destination file, an example is shown below:

WAV/FBCG1/SA1.WAV MFCC/FBCG1/1.mfc
WAV/FBCG1/SA2.WAV MFCC/FBCG1/2.mfc
WAV/FBCG1/SI1612.WAV MFCC/FBCG1/3.mfc
WAV/FBCG1/SI2242.WAV MFCC/FBCG1/4.mfc
WAV/FBCG1/SI982.WAV MFCC/FBCG1/5.mfc
WAV/FBCG1/SX172.WAV MFCC/FBCG1/6.mfc
WAV/FBCG1/SX262.WAV MFCC/FBCG1/7.mfc
WAV/FBCG1/SX352.WAV MFCC/FBCG1/8.mfc
WAV/FBCG1/SX442.WAV MFCC/FBCG1/9.mfc
WAV/FBCG1/SX82.WAV MFCC/FBCG1/10.mfc

Below shows the plot of 2 MFCC features for 2 individuals. Differences can be observed.

Tricks in copying files with same extension in Windows

In *nix system, we are able to dump filenames to a file by using ls > dest_file but in windows, I don’t think it works, so I’ve to use the following method:

  • dir /b > dest_file // To dump only filename to dest_file
  • dir > dest_file // To dump all information of current directory to dest_file

This is really useful in creating convert.scp mentioned above.

Building Gaussian Mixture Model

Next, we would need a good model to base our features upon. As gaussian models are smooth functions that works well in modeling natural signals, Gaussian Mixture Model (GMM) is a widely used method in modeling feature in speech recognition task. GMM is basically a joint distribution of gaussians with different weight, mean and covariance. An illustration is shown below: [source]

gmm_model.PNG

The equation for Gaussian Mixture Density is shown below:

p(x|\lambda) = \sum_{i=1}^{M} p_i n_i (x)

Component Density

b_i(x) = \frac{1}{(2\pi)^{D/2} |\Sigma_i|^{1/2} } e^{-\frac{1}{2} (x - \mu_i)' \Sigma_i^{-1} (x - \mu_i) }

Speaker Model

\lambda = {p_i, \mu_i, \Sigma_i}\ i=1,...,M

From the equations above, the only information we have is x which is the MFCC features extracted from speech signals. We will try to obtain p_i, \mu_i and \Sigma_i which are unobserved parameters of the model.

The approach in obtaining those parameters is through the well known Expectation-Maximization (EM) Algorithm. For this algorithm, we are trying to obtain a model where p(X|\hat{\lambda}) \geq p(X|\lambda) On each EM iteration, monotonic increase in the model’s likelihood value is guaranteed.

EM algorithm can be broken down into the following 2 steps:

Expectation

The a posteriori probability for acoustic class is computed by:

p(i|x_t,\lambda) = \frac{p_ib_i(x_t)}{\sum_{k=1}^M p_k b_k (x)}

Maximization

The pdf computed on the expectation step is fitted into this step to optimize for p_i, \mu_i and \Sigma_i given below:

Mixture Weights

\hat{p_i} = \frac{1}{T}\sum_{i=1}{T}p(i|x_t,\lambda)

Means

\hat{\mu_i} = \frac{\sum_{i=1}{T}p(i|x_t,\lambda)x_t}{\sum_{i=1}{T}p(i|x_t,\lambda)}

Variances

\hat{\sigma_i^2} = \frac{\sum_{i=1}{T}p(i|x_t,\lambda)x_t^2}{\sum_{i=1}{T}p(i|x_t,\lambda)} - \hat{\mu_i}^2

Convergence

Convergence of the EM algorithm can observed through the the value of the log-likelihood. We can stop once the difference of the log-likelihood value are small between iterations. The log-likelihood equation is given below:

log\ l(\lambda) = \sum_{i=1}^{N}log\ p(x_i|\lambda) = \sum_{i=1}^{N}(log \sum_{k=1}^{M}p_k b_k(x) )

Visualization

As MFCC features have more than 2 dimension (13 here), it’s hard to visualize GMM results in that amount of dimensions. Thus, we’ll be using the first 2 feature of MFCC to show the results of MFCC.

Histogram (Continuous)

The histogram plot of the first two features of MFCC for 2 difference speakers are shown below:

It can be seen that difference speakers have distinctive features based on the histogram alone.

GMM Results (2 Dimension)

The GMM results for 2 individuals (using different number of mixtures [8,16,32,64]) are shown below:

Dataset: FCEG0

1_gmm_81_gmm_161_gmm_321_gmm_64

Dataset: MBCG0

2_gmm_82_gmm_162_gmm_322_gmm_64

 

Speaker Recognition using GMM Model

I’ve built my model using TIMIT dataset with the speakers: (The order will remain the same through this experiment)

FBCG1
FCEG0
FCLT0
FJRB0
FKLH0
FMBG0
FNKL0
FPLS0
MBVG0
MBSB0

From the model we built above, we are able to identify a speaker using the formula:

 \underset{i}{\arg\max}\ log\ (P(O_1,O_2,...,O_N|\Lambda = (p_i,\mu_i,\Sigma_i) ) )

where

log(P(O_1,O_2,...,O_N|\Lambda = (p_i,\mu_i,\Sigma_i))) = \sum_n^N log(P(O_n|\Lambda = (p_i,\mu_i,\Sigma_i)))

\forall i = 1,...,K

Where N is the number of samples for testing and K is the number of clusters trained.

I’ve trained the model of 9 speakers with 2 .wav file for each (approximately 6 seconds of audio). For the TIMIT dataset given, each speakers have 10 .wav file where each audio clip consist of approximately 3 seconds of speech.

Using the trained model, I’ve tested using the 3rd to 10th audio clip of each speakers and obtain the accuracy as follows:

8 Gaussian Mixtures

accuracy_test.png

16 Gaussian Mixtures

accuracy_test_16

32 Gaussian Mixtures

accuracy_test_32.png

The accuracy here is the number of speakers guessed correct out of all speakers in the test set.

It can be observed that the accuracy starts to fall when the number of gaussian mixtures increases. This is due to over-fitting of gaussian basis. This scenario can be observed by looking at the 3D GMM plots given above. The extreme case for 64 Gaussian Mixtures shows significant results of over-fitting (thin and long Gaussian basis). The root of over-fitting is due to the variation of MFCC features. MFCC features are represented by our vocal tract and it can be represented well using 8 LPCs.

Observing speech of second best speakers

This experiment is done using 8 Gaussian mixtures as it gives the optimal results. The second best speakers are given by:

5 1 5 5 1 1 5 6 3 6

where the orders are based on the list given above.

It’s not up to my expectation as the 2nd best for male speakers should be also male as what I’m expecting. I’m not sure what I didn’t take into consideration.

Speaker Recognition (Without Consideration of Energy Band)

I’ve repeated the entire experiment by changing the parameter in the config file for HTK’s HCopy from

TARGETKIND = MFCC_E

to

TARGETKIND = MFCC

where energy band of MFCC is not taken into consideration on the recognition task as it’s unfair to take energy into account. (Different length of speech signal with have different amount of energy)

8 Gaussian Mixtures

accuracy_test_no_energy

16 Gaussian Mixtures

accuracy_test_no_energy_16

32 Gaussian Mixtures

accuracy_test_no_energy_32.png

 

It can be observed that the accuracy is high when compared to the previous experiment that considers energy in MFCC for its respective number of gaussian mixtures.

Observing speech of second best speakers

Same procedure as above is repeated.

6     3     5     7     4     8     5     6     3     6

 

Using Different Length of Test Data

This experiment is done using approximately 6 seconds of test data.

longer_test.png

It can be seen that the accuracy is as expected, higher.

Source Code

References

GMM based Language Identification using MFCC and SDC Features

Tutorial on GMM 1

Tutorial on GMM 2

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s