# GMM-Based Speaker Recognition

## Introduction

### GMM vs K-Means

First, we’ll have to understand what are hard decisions and soft decisions .

#### Hard Decision

A data point is clustered to a single cluster and the results are final.

#### Soft Decision

A data point is modeled by a distribution of clusters, thus it will be probabilistically defined and there’s no definite assignment to a particular cluster.

## Implementation

### Feature Extraction

Pre-processing is a vital step for speech recognition task. The most commonly used method for extracting features from speech signals in MFCC. We won’t be implementing this feature extraction ourselves but we’ll be using a toolkit called Hidden Markov Model Toolkit (HTK).

The conversion step can be done using the following command:

HCopy -C config your_wav_file.wav mfcc_file.mfc

Where the contents of config are as follows:

# Waveform parameters
SOURCEFORMAT = ALIEN #file with head length = 0
SOURCERATE = 625.0 #sampling rate, unit: 100 nsec
# Coding parameters
TARGETKIND = MFCC_E #output is MFCC and energy
TARGETRATE = 100000.0 #window shift for analysis, unit: 100 nsec
SAVECOMPRESSED = F
SAVEWITHCRC = T
WINDOWSIZE = 320000.0 #window width for analysis, unit: 100 nsec
ZMEANSOURCE = T #remove signal bias
USEHAMMING = T #take hamming windows before FFT
PREEMCOEF = 0.97 #pre-emphasis : 1-0.97Z^-1
NUMCHANS = 24 #number of filter bands
USEPOWER = F
CEPLIFTER = 22 #weighting MFCCs
LOFREQ = 0 #filter band begin from 0Hz
HIFREQ = 8000 #filter band stop at 8000Hz
NUMCEPS = 12 #number of cosine transform
ENORMALISE = T
ALLOWCXTEXP = F

If you’ve multiple wav file to be converted, you can use the following command:

HCopy -C config -S convert.scp

Where convert.scp is a script the specifies your source and destination file, an example is shown below:

WAV/FBCG1/SA1.WAV MFCC/FBCG1/1.mfc
WAV/FBCG1/SA2.WAV MFCC/FBCG1/2.mfc
WAV/FBCG1/SI1612.WAV MFCC/FBCG1/3.mfc
WAV/FBCG1/SI2242.WAV MFCC/FBCG1/4.mfc
WAV/FBCG1/SI982.WAV MFCC/FBCG1/5.mfc
WAV/FBCG1/SX172.WAV MFCC/FBCG1/6.mfc
WAV/FBCG1/SX262.WAV MFCC/FBCG1/7.mfc
WAV/FBCG1/SX352.WAV MFCC/FBCG1/8.mfc
WAV/FBCG1/SX442.WAV MFCC/FBCG1/9.mfc
WAV/FBCG1/SX82.WAV MFCC/FBCG1/10.mfc

Below shows the plot of 2 MFCC features for 2 individuals. Differences can be observed.

#### Tricks in copying files with same extension in Windows

In *nix system, we are able to dump filenames to a file by using ls > dest_file but in windows, I don’t think it works, so I’ve to use the following method:

• dir /b > dest_file // To dump only filename to dest_file
• dir > dest_file // To dump all information of current directory to dest_file

This is really useful in creating convert.scp mentioned above.

### Building Gaussian Mixture Model

Next, we would need a good model to base our features upon. As gaussian models are smooth functions that works well in modeling natural signals, Gaussian Mixture Model (GMM) is a widely used method in modeling feature in speech recognition task. GMM is basically a joint distribution of gaussians with different weight, mean and covariance. An illustration is shown below: [source]

The equation for Gaussian Mixture Density is shown below:

$p(x|\lambda) = \sum_{i=1}^{M} p_i n_i (x)$

Component Density

$b_i(x) = \frac{1}{(2\pi)^{D/2} |\Sigma_i|^{1/2} } e^{-\frac{1}{2} (x - \mu_i)' \Sigma_i^{-1} (x - \mu_i) }$

Speaker Model

$\lambda = {p_i, \mu_i, \Sigma_i}\ i=1,...,M$

From the equations above, the only information we have is $x$ which is the MFCC features extracted from speech signals. We will try to obtain $p_i$, $\mu_i$ and $\Sigma_i$ which are unobserved parameters of the model.

The approach in obtaining those parameters is through the well known Expectation-Maximization (EM) Algorithm. For this algorithm, we are trying to obtain a model where $p(X|\hat{\lambda}) \geq p(X|\lambda)$ On each EM iteration, monotonic increase in the model’s likelihood value is guaranteed.

EM algorithm can be broken down into the following 2 steps:

##### Expectation

The a posteriori probability for acoustic class is computed by:

$p(i|x_t,\lambda) = \frac{p_ib_i(x_t)}{\sum_{k=1}^M p_k b_k (x)}$

##### Maximization

The pdf computed on the expectation step is fitted into this step to optimize for $p_i$, $\mu_i$ and $\Sigma_i$ given below:

Mixture Weights

$\hat{p_i} = \frac{1}{T}\sum_{i=1}{T}p(i|x_t,\lambda)$

Means

$\hat{\mu_i} = \frac{\sum_{i=1}{T}p(i|x_t,\lambda)x_t}{\sum_{i=1}{T}p(i|x_t,\lambda)}$

Variances

$\hat{\sigma_i^2} = \frac{\sum_{i=1}{T}p(i|x_t,\lambda)x_t^2}{\sum_{i=1}{T}p(i|x_t,\lambda)} - \hat{\mu_i}^2$

##### Convergence

Convergence of the EM algorithm can observed through the the value of the log-likelihood. We can stop once the difference of the log-likelihood value are small between iterations. The log-likelihood equation is given below:

$log\ l(\lambda) = \sum_{i=1}^{N}log\ p(x_i|\lambda) = \sum_{i=1}^{N}(log \sum_{k=1}^{M}p_k b_k(x) )$

#### Visualization

As MFCC features have more than 2 dimension (13 here), it’s hard to visualize GMM results in that amount of dimensions. Thus, we’ll be using the first 2 feature of MFCC to show the results of MFCC.

##### Histogram (Continuous)

The histogram plot of the first two features of MFCC for 2 difference speakers are shown below:

It can be seen that difference speakers have distinctive features based on the histogram alone.

##### GMM Results (2 Dimension)

The GMM results for 2 individuals (using different number of mixtures [8,16,32,64]) are shown below:

### Speaker Recognition using GMM Model

I’ve built my model using TIMIT dataset with the speakers: (The order will remain the same through this experiment)

FBCG1
FCEG0
FCLT0
FJRB0
FKLH0
FMBG0
FNKL0
FPLS0
MBVG0
MBSB0

From the model we built above, we are able to identify a speaker using the formula:

$\underset{i}{\arg\max}\ log\ (P(O_1,O_2,...,O_N|\Lambda = (p_i,\mu_i,\Sigma_i) ) )$

where

$log(P(O_1,O_2,...,O_N|\Lambda = (p_i,\mu_i,\Sigma_i))) = \sum_n^N log(P(O_n|\Lambda = (p_i,\mu_i,\Sigma_i)))$

$\forall i = 1,...,K$

Where $N$ is the number of samples for testing and $K$ is the number of clusters trained.

I’ve trained the model of 9 speakers with 2 .wav file for each (approximately 6 seconds of audio). For the TIMIT dataset given, each speakers have 10 .wav file where each audio clip consist of approximately 3 seconds of speech.

Using the trained model, I’ve tested using the 3rd to 10th audio clip of each speakers and obtain the accuracy as follows:

###### 32 Gaussian Mixtures

The accuracy here is the number of speakers guessed correct out of all speakers in the test set.

It can be observed that the accuracy starts to fall when the number of gaussian mixtures increases. This is due to over-fitting of gaussian basis. This scenario can be observed by looking at the 3D GMM plots given above. The extreme case for 64 Gaussian Mixtures shows significant results of over-fitting (thin and long Gaussian basis). The root of over-fitting is due to the variation of MFCC features. MFCC features are represented by our vocal tract and it can be represented well using 8 LPCs.

#### Observing speech of second best speakers

This experiment is done using 8 Gaussian mixtures as it gives the optimal results. The second best speakers are given by:

5 1 5 5 1 1 5 6 3 6

where the orders are based on the list given above.

It’s not up to my expectation as the 2nd best for male speakers should be also male as what I’m expecting. I’m not sure what I didn’t take into consideration.

### Speaker Recognition (Without Consideration of Energy Band)

I’ve repeated the entire experiment by changing the parameter in the config file for HTK’s HCopy from

TARGETKIND = MFCC_E

to

TARGETKIND = MFCC

where energy band of MFCC is not taken into consideration on the recognition task as it’s unfair to take energy into account. (Different length of speech signal with have different amount of energy)

###### 32 Gaussian Mixtures

It can be observed that the accuracy is high when compared to the previous experiment that considers energy in MFCC for its respective number of gaussian mixtures.

#### Observing speech of second best speakers

Same procedure as above is repeated.

6     3     5     7     4     8     5     6     3     6

### Using Different Length of Test Data

This experiment is done using approximately 6 seconds of test data.

It can be seen that the accuracy is as expected, higher.

## References

GMM based Language Identification using MFCC and SDC Features

Tutorial on GMM 1

Tutorial on GMM 2