In this experiment, we will be using VGG19 which is pre-trained on ImageNet on Cifar-10 dataset. We will be using PyTorch for this experiment. (A Keras version is also available) VGG19 is well known in producing promising results due to the depth of it. The “19” comes from the number of layers it has. The architecture is shown below:

There are many pre-trained model using ImageNet of it out there. We will be converting it for the case of Cifar-10.

**Optimizer:** SGD with Nesterov Momentum (momentum = 0.9)

**Mini-batch size:** 128

**Total epochs:** 164

**Initial learning rate:** 0.01 (0.001 for non-Batch Normalization), divide by 10 at 81, 122 epoch

**Loss function:** cross-entropy

**Data Augmentation:** Random Crop + Horizontal Flip

**PyTorch **has its own model zoo provided by Torchvision that has a VGG19 pretrained model. I’ve tried the model provided here instead for comparison. The pretrained model in Torchvision’s model zoo is slightly better than the model I used. This can be proved by testing both pre-trained models on a single image as shown below:

For the model I used, pixels values has to be subtracted by 103.939, 116.779, 123.68 for B, G, R channel respectively.

**notans:** No Pre-trained Model (Random Initialization)

**bn:** Batch Normalization (bn2 indicate second try with same settings)

**transfer:** With Pre-trained Model

It can be seen above that the retrained model performs well compared to the random initialized one. I was unable to get the random initialized model working using PyTorch as a framework but I got it working for the Keras version, weird…

With batch normalization taken into account, all networks can be trained successfully. The reason is batch normalization reduces the dependency/requirement of having a good initialization as mentioned in [1]. A comparison between retrained model and model with random initialization can have a fair comparison here. It can be observed that the retrained model have at least a 2% gain compared to the latter.

The final error obtained for this experiment is around 92%, We could’ve got a better result if the experiment is repeated multiple times as a good initialization is crucial in obtaining high test accuracy.

As can be observed from the test accuracy of this experiment, initialization play an important role in the final test accuracy. Batch normalization does help in fixing this problem.

As for framework comparison, I prefer using PyTorch over TensorFlow and Keras as a deep learning framework due to its speed and versatility. The drawback of using PyTorch is there’s no written wrapper for the *embeddings* and *graph* in TensorBoard.

[1] Ioffe, Sergey, and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” *International Conference on Machine Learning*. 2015.

In this Lab, we will be implementing Network In Network *[1]* where its purpose is to enhance model discriminability for local patches within the receptive field. Conventional convolutional layers uses linear filters followed by a nonlinear activation function. The downside of the conventional method is the local receptors are too simple and doesn’t project local features of an input to a simpler subspace. Network in Network (NiN) solves this problem by introducing micro network right after convolutional filters where the fully connected part is shared across the preceding layer (input layer for the first conv. filter). This concept is shown in the illustration below:

The dataset that we will be using is Cifar-10 *[2]* which is great for benchmarking purpose.

Layer |
Filter Size |
Filter Num |
Pad |
Stride |
Activation |

Conv1 | 5×5 | 192 | 2 | 1 | ReLU |

mlp1 | 1×1 | 160 | 0 | 1 | ReLU |

mlp2 | 1×1 | 96 | 0 | 1 | ReLU |

Max Pool 1 | 3×3 | 1 | 2 | ||

Dropout 0.5 | |||||

Conv2 | 5×5 | 192 | 2 | 1 | ReLU |

mlp2-1 | 1×1 | 192 | 0 | 1 | ReLU |

mlp2-2 | 1×1 | 192 | 0 | 1 | ReLU |

Max Pool 2 | 3×3 | 1 | 2 | ||

Dropout 0.5 | |||||

Conv3 | 3×3 | 192 | 1 | 1 | ReLU |

mlp3-1 | 1×1 | 192 | 0 | 1 | ReLU |

mlp3-2 | 1×1 | 10 | 0 | 1 | ReLU |

Avg Pool | 8×8 | 1 | 2 | ||

Softmax |

Optimizer: RMSprop

Loss Function: Cross Entropy

All weights are initialized using a random normal distribution of standard deviation of 0.05 except the first convolutional filter which is initialized with a standard deviation of 0.01. The reason for this is to prevent large gradient updates in the beginning which results in weights that are highly positive or highly negative as described in *[3]*.

All bias are initialized with a constant of 0.

To avoid confusion, data used for training consist of partially augmented data ( Combined Flip and Random Crop + Original Data) unless labelled as *aug* or *noaug*.

The labels for training are as following:

- RMS: RMSprop as optimizer
- aug: Data augmentation (Flip + Random Crop + Original Data)
- noaug: No Data Augmentation (Original Data Only)
- whitten: Normalized by mean and standard deviation computed from training dataset inclusive of augmented data
- fixwhite: Normalized by mean and standard deviation computed from training dataset exclusive of augmented data
- fixwhite2: Normalized by mean and standard deviation computed from whole dataset exclusive of augmented data
- decay: Weight Decay

The combination with the highest test accuracy of ~87% is Data Augmentation + Weight Decay + RMSprop as optimizer.

It can be observed that Data Augmentation does indeed help with the overall accuracy. The reason is because our model is able to generalize over different a broader range of data.

It can be observed that the test accuracy is better without Data Normalization which is really counter-intuitive. I’m still unable to find a good explanation for this.

In this experiment, I begin with Momentum Optimizer provided by TensorFlow. My model is unable to generalize over Cifar-10 training dataset despite low training rate. My assumption is that initial gradients tend to cause the weights of my network to a highly negative or positive value which stagnant the whole training process.

I ended up using RMSprop optimizer which adapts the step size for individual weights. This method prevents fluctuation of the weights as momentum does. By adapting this method, I’m able to successfully train my network.

Using weight decay does indeed help my model to generalize over Cifar-10 dataset by suppressing the weights. A gain of 1% is obtained using this method.

[1] Lin, Min, Qiang Chen, and Shuicheng Yan. “Network in network.” *arXiv preprint arXiv:1312.4400* (2013).

[2] https://www.cs.toronto.edu/~kriz/cifar.html

[3] http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

]]>- The variation of accuracy and correctness based on the number of observation mixtures.
- How does adding noise affect our recognition accuracy (The experiment done in our last post involves clean test input)
- The confusion matrix of our test results

It can be observed that the test signal with additive Subway noise performs the worst here due to its non-interpretability compared to additive white noise.

This experiment is done with insertion penalty of -60 (s=-60 of HVite)

As expected, additive subway noise performs the worst here.

Reading across the rows, %c indicates the number of correct instances divided by the total number of instances in the row. %e is the number of incorrect instances in the row divided by the total number of instances (N).

It can be observed that the word that is the misclassified most of the time is “yi” which is often classified as “ling”. In physical sense, the “e” sound in both is really similar which explains the high misclassificaiton rate.

The second word that performs badly here is “liu” which is misclassified as “jiu”. The reason of this is because they share the same phone “iu” at the back.

]]>

OS: Linux 4.9.27-1

Tools: Wavesurfer, HTK, Python2.7

The data set that we will be using is provided by NCTUDS-100 DATABASE. The file format will be stored as follow:

md010101.pcm ==> male,group-01,speaker-01,sentence-01 fd020301.pcm ==> female,group-02,speaker-03,sentence-01 md030501.pcm ==> male,group-03,speaker-05,sentence-01

We will be using speakers x0 as our test dataset and others as training dataset. To do this, we will need to separate them. The way I separate them is as follows:

Before data can be recorded:

- Phone set must be defined
- Task grammar must be defined

Grammar

For voice dialing application, an example grammar would be:

$digit = ONE | TWO | THREE | FOUR | FIVE | SIX | SEVEN | EIGHT | NINE | OH | ZERO; $name = [ JOOP ] JANSEN | [ JULIAN ] ODELL | [ DAVE ] OLLASON | [ PHIL ] WOODLAND | [ STEVE ] YOUNG; ( SENT-START ( DIAL <$digit> | (PHONE|CALL) $name) SENT-END )

In our application, our Grammar would look like this:

$digit = ling | yi | er | san | si | wu | liu | qi | ba | jiu; ( SENT-START <$digit> SENT-END )

HParse gram wdnet

will create an equivalent word network in the file wdnet

For our task, we can easily create a list of required words by hand. In more complicated task where the data set is large. We would need a global dictionary where word to phone mapping is provided. HDMan would be useful here as shown in the diagram below:

HDMan -m -w wlist -n monophones1 -l dlog dict beep names

First, we would need to generate a word level transcription. Since the data set that we are given is labelled, we can generate the word level transcript using that label file. We will write a script using Python to convert the labels given to the format needed by HTK.

We would need to generate a phone-level transcription for further processing. Here, our phone would be the same as our word for ease of implementation.

The dictionary that we will be using is hand-crafted as shown below:

The command to do this conversion would be:

HLEd -l '*' -d dict -i phones0.mlf mkphones0.led words.mlf

The contents of mkphones0.led would be:

EX IS sil sil DE sp

- EX: Replace each word in words.mlf by the corresponding pronunciation in the dictionary file dict.
- IS: Inserts a silence model sil at the start and end of every utterance.
- DE: Deletes all short-pause sp labels, which are not wanted in the transcription labels at this point.

Next, we would need to extract MFCC features from our audio data. HCopy will be used here.

HCopy -T 1 -C wav_config "wave file" "mfcc file" or HCopy -T 1 -C wav_config -S mfc_train.scp

Here, we will be creating a single-Gaussian monophone HMMs. At first, there’ll be a set of identical monophone HMMs in which every mean and variance is identical. These are then retrained, short-pause models are added and the silence model is extended slightly. The monophones are then retrained.

Once reasonable monophone HMMs have been created, the recogniser tool HVite can be used to perform a forced alignment of the training data.

The parameters of this model are not important, its purpose is to define the model topology. For phone-based systems, a good topology to use is 3-state left-right with no skips such as the following:

where each ellipsed vector is of length 39. This number, 39, is computed from the length of the parameterised static vector (MFCC 0 = 13) plus the delta coefficients (+13) plus the acceleration coefficients (+13).

The HTK tool HCompV will scan a set of data files, compute the global mean and variance and set all of the Gaussians in a given HMM to have the same mean and variance. Hence, assuming that a list of all the training files is stored in train.scp, the command

HCompV -C config -f 0.01 -m -S train.scp -M hmm0 proto

The -f option causes a variance floor macro (called vFloors) to be generated which is equal to 0.01 times the global variance. This is a vector of values which will be used to set a floor on the variances estimated in the subsequent steps. The -m option asks for means to be computed as well as variances. Given this new prototype model stored in the directory hmm0, a Master Macro File (MMF) called hmmdefs containing a copy for each of the required monophone HMMs is constructed by manually copying the prototype and relabeling it for each required monophone (including “sil”). The format of an MMF is similar to that of an MLF and it serves a similar purpose in that it avoids having a large number of individual HMM definition files.

The flat start monophones stored in the directory hmm0 are re-estimated using the embedded re-estimation tool HERest invoked as follows

HERest -C config -I phones0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm0/macros -H hmm0/hmmdefs -M hmm1 monophone0.lst

Most of the files used in this invocation of HERest have already been described. The exception is the file macros. This should contain a so-called global options macro and the variance floor macro vFloors generated earlier. The global options macro simply defines the HMM parameter kind and the vector size.

Execution of HERest should be repeated twice more, changing the name of the input and output directories (set with the options -H and -M) each time, until the directory hmm3 contains the final set of initialised monophone HMMs.

We make the model more robust by allowing individual states to absorb the various impulsive noises in the training data.

A 1 state short pause sp model should be created. This should be a so-called tee-model which has a direct transition from entry to exit node. This sp has its emitting state tied to the centre state of the silence model.

These silence models can be created in two stages

- Use a text editor on the file hmm3/hmmdefs to copy the centre state of the sil model to make a new sp model and store the resulting MMF hmmdefs, which includes the new sp model, in the new directory hmm4.
- Run the HMM editor HHEd to add the extra transitions required and tie the sp state to the centre sil state

HHEd works in a similar way to HLEd. It applies a set of commands in a script to modify a set of HMMs. In this case, it is executed as follows

HHEd -H hmm4/macros -H hmm4/hmmdefs -M hmm5 sil.hed monophone1.lst

where sil.hed contains the following commands

AT 2 4 0.2 {sil.transP} AT 4 2 0.2 {sil.transP} AT 1 3 0.3 {sp.transP} TI silst {sil.state[3],sp.state[2]}

- AT: Add transitions to the given transition matrices
- TI: Creates a tied-state called silst.

The parameters of this tied-state are stored in the hmmdefs file and within each silence model, the original state parameters are replaced by the name of this macro. Macros are described in more detail below. For now it is sufficient to regard them simply as the mechanism by which HTK implements parameter sharing. Note that the phone list used here has been changed, because the original list monophone0.lst has been extended by the new sp model. The new file is called monophones1 and has been used in the above HHEd command.

After HHed, 2 more re-estimations (HERest) are done.

Before re-estimation, we will have to create a phone model for our master label file.

HLEd -l '*' -d dict -i phones1.mlf mkphones1.led words.mlf

EX IS sil sil

For the final re-estimation, we can add the option *“-s stats_mix1”. *It will dump a file which descibes the state’s occupation of every model unit. This will be helpful when we need to increase the amount of mixtures later on.

The phone models created so far can be used to realign the training data and create new transcriptions.

This command uses the HMMs stored in *hmm8* to transform the input word level transcription *data.mlf* to the new phone level transcriptions in the directory *f_align* using the pronunciations stored in the dictionary *dict*

The recogniser now considers all pronunciations for each word and outputs the pronunciation that best matches the acoustic data.

It can be observed that the results we obtained are accurate.

*test.scp* holds a list of the coded test files, then each test file will be recognised and its transcription output to an MLF called *recout.mlf* by executing the following

HVite -H hmm8/macros -H hmm8/hmmdefs -S test.scp -l '*' -i recout.mlf -w wdnet -p 0.0 -s 5.0 dict monophone1.lst

The actual performance can be determined by running HResults as follows

HResults -I test.mlf monophone0.lst recout.mlf

The results we get:

It can be observed that the results we get is really bad as the correct guesses are much higher than the accuracy which implies that the insertion is too high. We then looked at *test.mlf *and *recout.mlf *to see what happened:

It can be observed that the words we need matches, but there’re a lot of extra phones recognized.

We can fix this by tuning “-p” of HVite which is the word insertion penalty. We ran through a range from 0 to 400 using a Python script and got the following results:

It can be observed that the *accuracy* peaks are around p=-60 and then continues to fall along with *correct hits* after that.

` `

` `

` `

` `

` `

` `

` `

` `

` `

]]>

The advantage of neural networks over other methods is due to their non-linearity. The non-linearity is caused by the linear combinations of the activation functions used. The activation functions that we will be using here is Sigmoid and ReLU.

Let be the output of a neuron after a linear combination of its input neurons and weights and be a mapping function where the function is the activation function mentioned above. can be

We will be working with neural networks of 1 and 2 hidden layers.

The following are my implementation of the networks mentioned above. We’ll be using Stochastic Gradient Descent (SGD) to update our weight. Theorectically, input data has to be shuffled for SGD to work, but in my case mine doesn’t work if it’s shuffled. (need some clarification and help here) Since my input data that’s fed into the network is nicely ordered (data of same class are grouped together), it’s easy for the error to fall in a local minima.

As before, we’ll still be using the same dataset (Database of Faces) provided by AT&T Laboratories. We’ll also reduce the dimension of the image to 2 using PCA for better illustration on how the boundaries are drawn. As for PCA, I usually whitten my data (divide by ) before feeding it into the network. (I once forget the square root and my network can’t be trained successfully, I still have no idea why such a minor difference would cause the failure in the training of network) After various attempts, what suprises me is that not performing whittening produces a better result (sometimes). As I observe the output of the activation function of the first layer, not whittening does indeed produce abrupt changes after softmax for the 1st layer. I also realize that not performing whittening would cause the values of ReLU of 2 layers to overflow.

SGD works in a way where data is fed into the network one-by-one and the weights are updated for every data sent into the network. Compared to other methods (mini-batch, batch) SGD will be the fastest to converge if tuned correctly. The downside is that it might not converge at all if changes are too abrupt and jumps off a local minima whenever a single data fed is highly influencial.

For the implementation of a network with single hidden layer, the number of neurons that I tried and worked is in the range of 4~50 it seems that the likelihood of overfitting is really low here. Sigmoid activation function is used here.

The training error and validation error is shown in the graph below:

It can be observed that it converges to a local minima and stays there whenever the local minima is found.

The decision boundary for this method is shown below:

Observed the non-linearity of the descision boundary which is different from what we observed before for linear-classification methods.

The accruracy obtained here is 86.67%.

The implementation of 2-Hidden Layer Network using ReLU as an activation funciton is a little tricky. The input value has to be whitten-ed before being fed into the network. This is not important when sigmoid activation is used because large value will be close to 1 after passing through a sigmoid activation function. For ReLU, the value is not clipped and grows linearly. So the only way to deal with this problem is to whitten the data (divide by the square root of the eigenvalue after PCA).

The error function for this method is shown below:

As we can observe, The validation error did fall to a very low value and bounced back up. This is due to the nature of SGD where it’s easy to bounce away from a local minima. To deal with this problem, I’ll always store the weight that produces the smallest validation error.

The decision boundary of this method is shown below:

As expected, we can observe that a 2-layer network tend to produce a much non-linear boundary compared to 1-layer network.

The accuracy obtained for this method is 80% which is a little below 1-layer. From here, we can see that the number of neurons is important.

This method fall between SGD and Batch Gradient Descent. Let’s say an epoch consist of all the data used for training, mini-batch cuts the data into groups to be fed into the network. The weights are updated on a per-group-fed basis. All steps used here are the same in SGD but on differs on the gradient descent step where the average of the gradient of error is taken for a single batch. This method is comparatively more stable than the previous method (SGD) because it takes many data points into account before making a single update.

The implementation here is the same as before and the only difference is mentioned above, so I’ve dive straight into showing the plots.

The error function of this method is shown below:

It can be observed that both error falls steadily. The error is still falling and would produce a better result if I continue but as long is the idea is clear, I don’t see any point in doing that.

The decision boundary is shown below:

It can be observed that the decision boundary here is similar to the 1-layer implementation of SGD. This is because nothing much can be learnt from a 2D data that is clustered in a clear manner.

The accuracy obtained here is 84%.

The error function is shown below:

It can be observed that it converges nicely.

The decision boundary is shown below:

It can be observed that lots of blue points are lost using this decision boundary which is an indicator where a bad local minima is found. The accuracy obtained here is 74.33% which supports our deductions above.

Dynamic Time Warping is an algorithm used to match two speech sequence that are same but might differ in terms of length of certain part of speech (phones for example). Here, we’ll not be using phone as a basic unit but frames that are obtained from MFCC features that are obtained from feature extraction through a sliding windows. We will be using 12 extracted features of MFCC per frame which excludes energy.

For MFCC features, we will obtain them using HTK toolbox instead of implementing ourselves. The audio file that we will be using here has a different format from the previous posts if you notice. Before that, we were using an alien format audio file. Here, we will be using an audio file with a fixed header format which is RIFF.

The changes made are as following:

From

SOURCEFORMAT = ALIEN #file with head length = 0 HEADERSIZE = 1024 SOURCERATE = 625.0 #sampling rate, unit: 100 nsec

To

SOURCEFORMAT = WAV #RIFF File Header

If you’re curious, the new header size is 44 and the sample rate is defined into the header.

First we need to clarify that there’s a difference between DTW and Viterbi. As from the link, Viterbi algorithm represents pattern matching algorithm of statistic probability. DTW algorithm represents pattern matching algorithm of template matching algorithm. The algorithm we’ll be using here is DTW as no probability is involved.

We’ll implement two methods where there differ in the path restriction. For the first method, the every step is restricted to move to (0,+1),(+1,0) and (+1,+1) from the current point as shown in the diagram below:

We’ll show the result of DTW algorithm one-by-one. The template and test result that we will use here is the phrase 交通大學. We’ll start by matching the template with itself to show that it works as we expect (a straight line should appear):

We then compare to the sample phrase spoken but the test data used here is a faster version of the template:

It can be observed that the curve grows horizontally really fast. We will compare this result with the same speech spoken at a slower pace shown below:

It can be observed that it grows vertically faster than the one above. (Observe the vertical axis)

Now, we’ll show results obtained when compared with difference speech sequence. (We’ll be changing Test Data and we’ll be using speech with the same pace)

Observe how the last extra word correspond to the DTW graph.

Observe how bad of a result produced by an entire different word.

A comparison with different speech sequences are shown below: (Medium pace 交通大學 as template)

Test Data |
Cost |

快-交通大學 |
3.5393e+03 |

快-交通大隊 | 5.3293e+03 |

快-信號處理 | 7.5717e+03 |

快-語音信號 | 1.0811e+04 |

快-語音處理 | 7.9386e+03 |

慢-交通大學 |
4.7898e+03 |

慢-交通大隊 | 8.2430e+03 |

慢-信號處理 | 9.5801e+03 |

慢-語音信號 | 8.8775e+03 |

慢-語音處理 | 1.0253e+04 |

中-交通大學 |
0 |

中-交通大學贊 | 5.4024e+03 |

中-交通大隊 | 5.4029e+03 |

中-交通大隊爛 | 7.9093e+03 |

中-信號處理 | 9.2503e+03 |

中-語音信號 | 9.4377e+03 |

中-語音處理 | 9.5596e+03 |

All the path with the lowest cost for corresponding pace are shown in italic.

Next, we would like to experiment a looser constraint on the available steps ( (0,+2) (0,+1) (+1,+1) (+1,0) (+2,0) ) to be taken from the current node. An illustration is shown below:

I’ll only show some of the optimal paths for this method as it’s similar to the paths shown above.

The most obvious difference can be seen for the DTW of 交通大學贊. There’s some difference at the “tail” of the graph.

Test Data |
Cost |

快-交通大學 |
2.5807e+03 |

快-交通大隊 | 3.7465e+03 |

快-信號處理 | 5.6295e+03 |

快-語音信號 | 8.4524e+03 |

快-語音處理 | 7.3564e+03 |

慢-交通大學 |
3.4671e+03 |

慢-交通大隊 | 5.6360e+03 |

慢-信號處理 | 7.6668e+03 |

慢-語音信號 | 7.2069e+03 |

慢-語音處理 | 7.6250e+03 |

中-交通大學 |
0 |

中-交通大學贊 | 4.0314e+03 |

中-交通大隊 | 3.9625e+03 |

中-交通大隊爛 | 6.3642e+03 |

中-信號處理 | 5.4268e+03 |

中-語音信號 | 6.6767e+03 |

中-語音處理 | 8.9559e+03 |

It can be observed that the overall cost is lower than a tighter constraint as expected.

Now we would like to clip our template and compare it with the test data given. The templates that we will be clipping are “交通大學贊” and “交通大隊爛” and we would expect to clip the last word our of the sentence resulting in “交通大學” and “交通大隊” respectively. We will set out clip at 0.7 of the template length.

First we show a clipped and unclipped version of “交通大學贊” when matched with “交通大學”

The error obtained when compared with others using “交通大學贊” as a reference is shown in the table below:

Test Data |
Cost |

中-交通大學 |
3.5683e+03 |

中-交通大學贊 | 0 |

中-交通大隊 | 5.5088e+03 |

中-交通大隊爛 | 9.6360e+03 |

中-信號處理 | 6.3100e+03 |

中-語音信號 | 8.4559e+03 |

中-語音處理 | 9.2887e+03 |

Next, we’ll show a clipped and unclipped version of “交通大隊爛” when matched with “交通大隊”

The error obtained when compared with others using “交通大隊爛” as a reference is shown in the table below:

Test Data |
Cost |

中-交通大學 | 5.9552e+03 |

中-交通大學贊 | 9.8786e+03 |

中-交通大隊 |
4.3075e+03 |

中-交通大隊爛 | 0 |

中-信號處理 | 7.9714e+03 |

中-語音信號 | 8.7883e+03 |

中-語音處理 | 9.9951e+03 |

]]>

First, we’ll have to understand what are **hard decisions** and **soft decisions** .

A data point is clustered to a single cluster and the results are final.

A data point is modeled by a distribution of clusters, thus it will be probabilistically defined and there’s no definite assignment to a particular cluster.

Pre-processing is a vital step for speech recognition task. The most commonly used method for extracting features from speech signals in MFCC. We won’t be implementing this feature extraction ourselves but we’ll be using a toolkit called Hidden Markov Model Toolkit (HTK).

The conversion step can be done using the following command:

HCopy -C config your_wav_file.wav mfcc_file.mfc

Where the contents of config are as follows:

# Waveform parameters SOURCEFORMAT = ALIEN #file with head length = 0 HEADERSIZE = 1024 SOURCERATE = 625.0 #sampling rate, unit: 100 nsec # Coding parameters TARGETKIND = MFCC_E #output is MFCC and energy TARGETRATE = 100000.0 #window shift for analysis, unit: 100 nsec SAVECOMPRESSED = F SAVEWITHCRC = T WINDOWSIZE = 320000.0 #window width for analysis, unit: 100 nsec ZMEANSOURCE = T #remove signal bias USEHAMMING = T #take hamming windows before FFT PREEMCOEF = 0.97 #pre-emphasis : 1-0.97Z^-1 NUMCHANS = 24 #number of filter bands USEPOWER = F CEPLIFTER = 22 #weighting MFCCs LOFREQ = 0 #filter band begin from 0Hz HIFREQ = 8000 #filter band stop at 8000Hz NUMCEPS = 12 #number of cosine transform ENORMALISE = T ALLOWCXTEXP = F

If you’ve multiple wav file to be converted, you can use the following command:

HCopy -C config -S convert.scp

Where convert.scp is a script the specifies your source and destination file, an example is shown below:

WAV/FBCG1/SA1.WAV MFCC/FBCG1/1.mfc WAV/FBCG1/SA2.WAV MFCC/FBCG1/2.mfc WAV/FBCG1/SI1612.WAV MFCC/FBCG1/3.mfc WAV/FBCG1/SI2242.WAV MFCC/FBCG1/4.mfc WAV/FBCG1/SI982.WAV MFCC/FBCG1/5.mfc WAV/FBCG1/SX172.WAV MFCC/FBCG1/6.mfc WAV/FBCG1/SX262.WAV MFCC/FBCG1/7.mfc WAV/FBCG1/SX352.WAV MFCC/FBCG1/8.mfc WAV/FBCG1/SX442.WAV MFCC/FBCG1/9.mfc WAV/FBCG1/SX82.WAV MFCC/FBCG1/10.mfc

Below shows the plot of 2 MFCC features for 2 individuals. Differences can be observed.

In *nix system, we are able to dump filenames to a file by using *ls > dest_file *but in windows, I don’t think it works, so I’ve to use the following method:

- dir /b > dest_file // To dump only filename to dest_file
- dir > dest_file // To dump all information of current directory to dest_file

This is really useful in creating convert.scp mentioned above.

Next, we would need a good model to base our features upon. As gaussian models are smooth functions that works well in modeling natural signals, Gaussian Mixture Model (GMM) is a widely used method in modeling feature in speech recognition task. GMM is basically a joint distribution of gaussians with different weight, mean and covariance. An illustration is shown below: [source]

The equation for Gaussian Mixture Density is shown below:

*Component Density*

*Speaker Model*

From the equations above, the only information we have is which is the MFCC features extracted from speech signals. We will try to obtain , and which are unobserved parameters of the model.

The approach in obtaining those parameters is through the well known Expectation-Maximization (EM) Algorithm. For this algorithm, we are trying to obtain a model where On each EM iteration, monotonic increase in the model’s likelihood value is guaranteed.

EM algorithm can be broken down into the following 2 steps:

The a posteriori probability for acoustic class *i *is computed by:

The pdf computed on the expectation step is fitted into this step to optimize for , and given below:

*Mixture Weights*

*Means*

*Variances*

Convergence of the EM algorithm can observed through the the value of the log-likelihood. We can stop once the difference of the log-likelihood value are small between iterations. The log-likelihood equation is given below:

As MFCC features have more than 2 dimension (13 here), it’s hard to visualize GMM results in that amount of dimensions. Thus, we’ll be using the first 2 feature of MFCC to show the results of MFCC.

The histogram plot of the first two features of MFCC for 2 difference speakers are shown below:

It can be seen that difference speakers have distinctive features based on the histogram alone.

The GMM results for 2 individuals (using different number of mixtures [8,16,32,64]) are shown below:

I’ve built my model using TIMIT dataset with the speakers: (The order will remain the same through this experiment)

FBCG1 FCEG0 FCLT0 FJRB0 FKLH0 FMBG0 FNKL0 FPLS0 MBVG0 MBSB0

From the model we built above, we are able to identify a speaker using the formula:

*where*

Where is the number of samples for testing and is the number of clusters trained.

I’ve trained the model of 9 speakers with 2 .wav file for each (approximately 6 seconds of audio). For the TIMIT dataset given, each speakers have 10 .wav file where each audio clip consist of approximately 3 seconds of speech.

Using the trained model, I’ve tested using the 3rd to 10th audio clip of each speakers and obtain the accuracy as follows:

The accuracy here is the number of speakers guessed correct out of all speakers in the test set.

It can be observed that the accuracy starts to fall when the number of gaussian mixtures increases. This is due to over-fitting of gaussian basis. This scenario can be observed by looking at the 3D GMM plots given above. The extreme case for 64 Gaussian Mixtures shows significant results of over-fitting (thin and long Gaussian basis). The root of over-fitting is due to the variation of MFCC features. MFCC features are represented by our vocal tract and it can be represented well using 8 LPCs.

This experiment is done using 8 Gaussian mixtures as it gives the optimal results. The second best speakers are given by:

5 1 5 5 1 1 5 6 3 6

where the orders are based on the list given above.

It’s not up to my expectation as the 2nd best for male speakers should be also male as what I’m expecting. I’m not sure what I didn’t take into consideration.

I’ve repeated the entire experiment by changing the parameter in the config file for HTK’s HCopy from

TARGETKIND = MFCC_E

to

TARGETKIND = MFCC

where energy band of MFCC is not taken into consideration on the recognition task as it’s unfair to take energy into account. (Different length of speech signal with have different amount of energy)

It can be observed that the accuracy is high when compared to the previous experiment that considers energy in MFCC for its respective number of gaussian mixtures.

Same procedure as above is repeated.

6 3 5 7 4 8 5 6 3 6

This experiment is done using approximately 6 seconds of test data.

It can be seen that the accuracy is as expected, higher.

GMM based Language Identification using MFCC and SDC Features

]]>

There’re 3 major methods on working with classification:

- Discriminant Function
- Probabilistic Generative Model
- Probabilistic Discriminative Model

The first method is brute-force method which is what neural networks uses. It consist of least square classification, Fisher’s linear discriminant and perceptron algorithm.

Our focus will be on the following 2 models which involves a probabilistic viewpoint.

The main goal of this model is to model the density function entirely from the training data points given. By entirely, we mean where is the class number and it’s the joint PDF of class and data point . For the binary case and from bayes theorem we know that . It’ll be which is known as * sigmoid function* for the binary case where . For multi-class problems, it’ll be which is also known as

As always, we’ll try to solve for optimal parameters given by the posterior probability . We’ll start with

For the multi-class case, we are trying to solve for:

Where by solving for the maximum likelihood for a Gaussian Distribution, we obtain:

For the maximum likelihood solution for a Gaussian Distribution:

*where*

For this model, we are trying to reach the posterior probability without going through all the steps mentioned above (obtaining the * maximum likelihood *and

Our goal in this model is to optimize for in posterior PDF below: (Logistic Regression)

is a basis function of (to increase the non-linearity)

For a binary classification example, our goal is to minimize the **cross-entropy error function **given below:

The that we are trying to optimize is responsible for the “placement” of the sigmoid function given on the diagram above. It’s also called the decision boundary.

Due to the non-linearity of the logistic sigmoid funciton, we are unable to solve using a closed form solution. Thus, there’re 2 approaches in dealing with this problem.

So the term here is basically the learning rate of the discriminative model. The only parameter that is user-defined is here.

Sequential learning method is actually a simplified version of Newton-Raphson method. The equation used for Newton-Raphson method is shown below:

*Hessian Matrix*

We can see that has been changed to . Current implementation for the training of a model is done using the former method because the computation of the Hessian Matrix is relatively expensive and slows down the time needed for the training of the model. Sequential learning is also called Gradient Descent and it is the most commonly used method in the training of neural networks. The parameter and decides the learning rate of the algorithm thus a large value corresponds to the speed on finding the global minima of the model. There’s a catch here, where a learning rate that’s too large might cause the learning algorithm to oscillate and we might never reach the global minima of our model. We’ll be using the Newton-Raphson method for the rest of the implementation, thus a detailed elaboration on this method will be given below.

Gradient of cross-entropy error function

Hessian Matrix

R: diagonal matrix with

Since , it follows that for an arbitrary vector . The error function is a convex function of and there is a unique minimum.

We can substitute all of the equations above into and obtain (after some algebraic manipulation):

*where *

For the multi-class case:

*where*

From that, we’re able to get the cross-entropy error function:

There’re a few things that we’ll have to pay attention to. One of them is the that is in the Hessian Matrix because now we have multiple class. We would end up with multiple where we have now:

,

We will be working with images from Database of faces (AT&T Laboratories Cambridge) which are of size 30×30. We’ll apply both Generative and Discrimininative models to model our data.

We will be using Python 2.7 for the development of this algorithm. To work with the data given, we’ll have to first read an image file (.bmp) using Python. As from here we will need the library “pillow” which enables us to read image files. The next step would be converting read image files into a single numpy array.

The following code will convert the given image into a numpy array:

img = Image.open("Data_Train/Class1/faceTrain1_755.bmp") data = np.asarray( img, dtype="int32" )

You can also read all image from a directory iteratively using the following method:

imageFolderPath = 'Data_Train/Class1' imagePath = glob.glob(imageFolderPath+'/*.bmp') train_size = 100 im_array_1 = np.array( [np.array(Image.open(imagePath[i])) for i in range(train_size)] )

The library glob is used here for finding pathname matching a specified pattern according to rules used by Unix shell.

To ease on the computation power needed to process the images, we will have do reduce the dimension of the given dataset. Thus Principal Component Analysis (PCA) is used for dimensionality reduction.

The implementation for PCA can be refered here

PCA must only be done for training data to ensure that we have a general model. The **mean** of the training data, the **eigenvector** used for projection and the corresponding **eigenvalue** will be reused later on during testing, so it has to be stored for later use. Here’s a detailed explaination on how PCA is implemented when there’re both training and test dataset. When a test dataset comes in, it is first subtracted by the mean, and then it’s projected on the basis (eigenvectors) and the it’s whiten by dividing by the corresponding eigenvalue.

By using this model, we are able to model the data completely using Gaussian Distributions. An illustration of a scatter plot + Contour plot and scatter plot + surface plot is shown below:

It can be observed that red (group 2) and green (group 3) data points are closely clustered together which is a problem when we’re trying to classify them.

We got an accuracy of 81.3%

For this model, we try to solve for the weights directly, so we are unable to come up with the mean and variance of the gaussians that models our data. Though, we are still able to plot the decision boundary for this model. There are two ways to implement this model, one is through Newton-Raphson iterative optimization, the other is through a fixed learning rate by changing the Hessian Matrix here to a fixed value.

It can be observed that the boundaries are linear.

Throughout the design of this method, we are able to observe the error function and check for its convergence. Since it’s a convex function, it’ll fall to a global minima.

The final log probability error we got here is: 964.4353

The accuracy we got here is 85%

The learning rate we used here is 0.002. Finding a good learning rate is really important for this method. By using a bad learning rate, this algorithm might oscillate and will never converge. Even convergence itself takes much longer than the Newton-Raphson method because the slope of the learning rate is now fixed and not optimized by the Hessian Matrix. In the end, it will converge to the same global minima as the previous method if the initial conditions are the same.

We would now want to try how unbalanced training data affects our performance. Here, we will try to reduce the amount of data points for the 3rd training dataset from 900 to 300 and see how it affects the performance.

It can be observed that dataset 2 and 3 (red and green) are closely blended together. This will of course affect the overall performance of this model. The accuracy we got here is 77.3% which is as expected lower.

The final log probability error we got here is 640.7852 which is lower than before. This is probably caused by the closely clustered dataset 2 and 3 (red and green) where both are assigned a high probability for each posterior classes.

The accuracy we got here is: 76% which is obvious because we didn’t come up with an optimal model.

http://sites.stat.psu.edu/~jiali/course/stat597e/notes2/logit.pdf

]]>All speech signals will be pre-emphasized by a pre-emphasis filter of

As we know, the whole process of LPC coefficient extraction can be divided into the following stages:

source: https://www.mathworks.com/help/dsp/examples/lpc-analysis-and-synthesis-of-speech.html

First, we would like to find the LPC prediction error from **voiced** and **unvoiced** parts of a speech signal. The frame that we extracted is **/sh/** and **/iy/** from the word **“she” **. Each frame will be 30ms long under a sampling rate of 16000 which is 480 samples.

The error is computed from the equation where is derived from the LPCs.

We may wonder how does the error vary with the order of LPCs, thus a diagram of Error vs Order is shown below:

It can be seen that the error falls with increasing LPC order and doesn’t change much with increasing order.

An illustration is shown below for Residual Error vs Time for LPCs of order 12.

As we can see, there’s a periodic impulse train for voiced speech. It’s due to the vibration of the glottal and it’s what we call pitch. For unvoiced speech, we won’t see the same structure and all we see is white noise.

As the coefficients we obtained from LPC can be implemented using a ladder structure, but most of the time, we will need to implement it in lattice structure as computation will be more coefficient as error is computed through stages of a lattice filter and we can determine the number of stages we want for our error.

An illustration of reflection coefficients vs time is shown below:

Next, we will obtain the spectrum derived from the LPC coefficients where the spectrum is given by:

Thus we obtain:

We can see that it’s the envelope of the spectrum of the original signal. It also shows the variation of the formant of the speech which is what we wan’t in speech recognition.

Next we’ll obtain the ceptrum of the speech by taking the inverse fourier transform of the log of the spectrum obtained from LPC.

The plot of the cepstrum vs quefrency is shown below:

We removed the first coefficient of the cepstrum as it depends only on [ref]. We then perform an Discrete Cosine Transform to obtain the spectrum.

Next, we’d like to introduce MFCC which is a commonly used method in obtaining the cepstrum of a speech signal. The difference between MFCC and ordinary method of obtaining cepstrum is that MFCC will emphasis high frequency components of a speech signal or it’s what we call warping. It’s to simulate the human ear’s sensitivity to speech signal where sensitivity of the human ear decreases log-linearly over the frequency. We will show how emphasis is done by MFCC in the diagram below:

It can be seen that only 5 banks are needed to cover 4kHz to 8KHz.

Next, we would like to use MFCC to obtain the spectrum of **/iy/ **and then use DCT to obtain the spectrum of the resulting cepstum:

]]>