Disclaimer: Most of the contents are from HTKBook, this is just a summary based on my own Chinese Digit Recognition Dataset
Development Platform
OS: Linux 4.9.27-1
Tools: Wavesurfer, HTK, Python2.7
Data Set Segregation
The data set that we will be using is provided by NCTUDS-100 DATABASE. The file format will be stored as follow:
md010101.pcm ==> male,group-01,speaker-01,sentence-01 fd020301.pcm ==> female,group-02,speaker-03,sentence-01 md030501.pcm ==> male,group-03,speaker-05,sentence-01
We will be using speakers x0 as our test dataset and others as training dataset. To do this, we will need to separate them. The way I separate them is as follows:
Data Preparation
Before data can be recorded:
- Phone set must be defined
- Task grammar must be defined
Grammar
For voice dialing application, an example grammar would be:
$digit = ONE | TWO | THREE | FOUR | FIVE | SIX | SEVEN | EIGHT | NINE | OH | ZERO; $name = [ JOOP ] JANSEN | [ JULIAN ] ODELL | [ DAVE ] OLLASON | [ PHIL ] WOODLAND | [ STEVE ] YOUNG; ( SENT-START ( DIAL <$digit> | (PHONE|CALL) $name) SENT-END )
In our application, our Grammar would look like this:
$digit = ling | yi | er | san | si | wu | liu | qi | ba | jiu; ( SENT-START <$digit> SENT-END )
HParse gram wdnet
will create an equivalent word network in the file wdnet
The Dictionary
For our task, we can easily create a list of required words by hand. In more complicated task where the data set is large. We would need a global dictionary where word to phone mapping is provided. HDMan would be useful here as shown in the diagram below:
HDMan -m -w wlist -n monophones1 -l dlog dict beep names
Creating Transcription Files
First, we would need to generate a word level transcription. Since the data set that we are given is labelled, we can generate the word level transcript using that label file. We will write a script using Python to convert the labels given to the format needed by HTK.
Conversion of Word-level to Phone-level Transcription
We would need to generate a phone-level transcription for further processing. Here, our phone would be the same as our word for ease of implementation.
The dictionary that we will be using is hand-crafted as shown below:
The command to do this conversion would be:
HLEd -l '*' -d dict -i phones0.mlf mkphones0.led words.mlf
The contents of mkphones0.led would be:
EX IS sil sil DE sp
- EX: Replace each word in words.mlf by the corresponding pronunciation in the dictionary file dict.
- IS: Inserts a silence model sil at the start and end of every utterance.
- DE: Deletes all short-pause sp labels, which are not wanted in the transcription labels at this point.
Feature Extraction
Next, we would need to extract MFCC features from our audio data. HCopy will be used here.
HCopy -T 1 -C wav_config "wave file" "mfcc file" or HCopy -T 1 -C wav_config -S mfc_train.scp
wav_config
Creating Monophone HMMs
Here, we will be creating a single-Gaussian monophone HMMs. At first, there’ll be a set of identical monophone HMMs in which every mean and variance is identical. These are then retrained, short-pause models are added and the silence model is extended slightly. The monophones are then retrained.
Once reasonable monophone HMMs have been created, the recogniser tool HVite can be used to perform a forced alignment of the training data.
Creating Flat Start Monophone
The parameters of this model are not important, its purpose is to define the model topology. For phone-based systems, a good topology to use is 3-state left-right with no skips such as the following:
where each ellipsed vector is of length 39. This number, 39, is computed from the length of the parameterised static vector (MFCC 0 = 13) plus the delta coefficients (+13) plus the acceleration coefficients (+13).
The HTK tool HCompV will scan a set of data files, compute the global mean and variance and set all of the Gaussians in a given HMM to have the same mean and variance. Hence, assuming that a list of all the training files is stored in train.scp, the command
HCompV -C config -f 0.01 -m -S train.scp -M hmm0 proto
The -f option causes a variance floor macro (called vFloors) to be generated which is equal to 0.01 times the global variance. This is a vector of values which will be used to set a floor on the variances estimated in the subsequent steps. The -m option asks for means to be computed as well as variances. Given this new prototype model stored in the directory hmm0, a Master Macro File (MMF) called hmmdefs containing a copy for each of the required monophone HMMs is constructed by manually copying the prototype and relabeling it for each required monophone (including “sil”). The format of an MMF is similar to that of an MLF and it serves a similar purpose in that it avoids having a large number of individual HMM definition files.
Model Re-estimation
The flat start monophones stored in the directory hmm0 are re-estimated using the embedded re-estimation tool HERest invoked as follows
HERest -C config -I phones0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm0/macros -H hmm0/hmmdefs -M hmm1 monophone0.lst
Most of the files used in this invocation of HERest have already been described. The exception is the file macros. This should contain a so-called global options macro and the variance floor macro vFloors generated earlier. The global options macro simply defines the HMM parameter kind and the vector size.
macros
vFloors
Execution of HERest should be repeated twice more, changing the name of the input and output directories (set with the options -H and -M) each time, until the directory hmm3 contains the final set of initialised monophone HMMs.
Fixing the Silence Model
We make the model more robust by allowing individual states to absorb the various impulsive noises in the training data.
A 1 state short pause sp model should be created. This should be a so-called tee-model which has a direct transition from entry to exit node. This sp has its emitting state tied to the centre state of the silence model.
These silence models can be created in two stages
- Use a text editor on the file hmm3/hmmdefs to copy the centre state of the sil model to make a new sp model and store the resulting MMF hmmdefs, which includes the new sp model, in the new directory hmm4.
- Run the HMM editor HHEd to add the extra transitions required and tie the sp state to the centre sil state
HHEd works in a similar way to HLEd. It applies a set of commands in a script to modify a set of HMMs. In this case, it is executed as follows
HHEd -H hmm4/macros -H hmm4/hmmdefs -M hmm5 sil.hed monophone1.lst
where sil.hed contains the following commands
AT 2 4 0.2 {sil.transP} AT 4 2 0.2 {sil.transP} AT 1 3 0.3 {sp.transP} TI silst {sil.state[3],sp.state[2]}
- AT: Add transitions to the given transition matrices
- TI: Creates a tied-state called silst.
The parameters of this tied-state are stored in the hmmdefs file and within each silence model, the original state parameters are replaced by the name of this macro. Macros are described in more detail below. For now it is sufficient to regard them simply as the mechanism by which HTK implements parameter sharing. Note that the phone list used here has been changed, because the original list monophone0.lst has been extended by the new sp model. The new file is called monophones1 and has been used in the above HHEd command.
monophone1.lst
After HHed, 2 more re-estimations (HERest) are done.
Before re-estimation, we will have to create a phone model for our master label file.
HLEd -l '*' -d dict -i phones1.mlf mkphones1.led words.mlf
mkphones1.led
EX IS sil sil
For the final re-estimation, we can add the option “-s stats_mix1”. It will dump a file which descibes the state’s occupation of every model unit. This will be helpful when we need to increase the amount of mixtures later on.
Realigning the Training Data
The phone models created so far can be used to realign the training data and create new transcriptions.
This command uses the HMMs stored in hmm8 to transform the input word level transcription data.mlf to the new phone level transcriptions in the directory f_align using the pronunciations stored in the dictionary dict
The recogniser now considers all pronunciations for each word and outputs the pronunciation that best matches the acoustic data.
It can be observed that the results we obtained are accurate.
Recognizer Evaluation
test.scp holds a list of the coded test files, then each test file will be recognised and its transcription output to an MLF called recout.mlf by executing the following
HVite -H hmm8/macros -H hmm8/hmmdefs -S test.scp -l '*' -i recout.mlf -w wdnet -p 0.0 -s 5.0 dict monophone1.lst
The actual performance can be determined by running HResults as follows
HResults -I test.mlf monophone0.lst recout.mlf
The results we get:
It can be observed that the results we get is really bad as the correct guesses are much higher than the accuracy which implies that the insertion is too high. We then looked at test.mlf and recout.mlf to see what happened:
It can be observed that the words we need matches, but there’re a lot of extra phones recognized.
We can fix this by tuning “-p” of HVite which is the word insertion penalty. We ran through a range from 0 to 400 using a Python script and got the following results:
It can be observed that the accuracy peaks are around p=-60 and then continues to fall along with correct hits after that.
Interesting, can you provide the complete source code and the dataset?
LikeLike
https://github.com/eugenelet/HTK-Force-Alignment
The dataset belongs to my University, I can’t release it here, feel free to pm me to get the dataset.
LikeLike