Force Alignment using Hidden Markov Model

Disclaimer: Most of the contents are from HTKBook, this is just a summary based on my own Chinese Digit Recognition Dataset

Development Platform

OS: Linux 4.9.27-1

Tools: Wavesurfer, HTK, Python2.7

Data Set Segregation

The data set that we will be using is provided by NCTUDS-100 DATABASE. The file format will be stored as follow:

 md010101.pcm ==> male,group-01,speaker-01,sentence-01
 fd020301.pcm ==> female,group-02,speaker-03,sentence-01
 md030501.pcm ==> male,group-03,speaker-05,sentence-01

We will be using speakers x0 as our test dataset and others as training dataset. To do this, we will need to separate them. The way I separate them is as follows:

find . | grep -P "[a-z]{2}[0-9]{3}[0]" | xargs -I{} cp {} test/

find . | grep -P "[a-z]{2}[0-9]{3}[^0]" | xargs -I{} cp {} train/



Data Preparation

Before data can be recorded:

  1. Phone set must be defined
  2. Task grammar must be defined


For voice dialing application, an example grammar would be:

$digit = ONE | TWO | THREE | FOUR | FIVE | 
         SIX | SEVEN | EIGHT | NINE | OH | ZERO; 
$name = [ JOOP ] JANSEN |
        [ JULIAN ] ODELL |
        [ DAVE ] OLLASON |
        [ PHIL ] WOODLAND |
        [ STEVE ] YOUNG; 
( SENT-START ( DIAL <$digit> | (PHONE|CALL) $name) SENT-END )

In our application, our Grammar would look like this:

$digit = ling | yi | er | san | si | wu | liu | qi | ba | jiu; 
( SENT-START <$digit> SENT-END )


HParse gram wdnet

will create an equivalent word network in the file wdnet


The Dictionary

For our task, we can easily create a list of required words by hand. In more complicated task where the data set is large. We would need a global dictionary where word to phone mapping is provided. HDMan would be useful here as shown in the diagram below:


HDMan -m -w wlist -n monophones1 -l dlog dict beep names

Creating Transcription Files

First, we would need to generate a word level transcription. Since the data set that we are given is labelled, we can generate the word level transcript using that label file. We will write a script using Python to convert the labels given to the format needed by HTK.

Conversion of Word-level to Phone-level  Transcription

We would need to generate a phone-level transcription for further processing. Here, our phone would be the same as our word for ease of implementation.

The dictionary that we will be using is hand-crafted as shown below:


The command to do this conversion would be:

HLEd -l '*' -d dict -i phones0.mlf mkphones0.led words.mlf

The contents of mkphones0.led would be:

IS sil sil 
DE sp
  • EX: Replace each word in words.mlf by the corresponding pronunciation in the dictionary file dict.
  • IS: Inserts a silence model sil at the start and end of every utterance.
  • DE: Deletes all short-pause sp labels, which are not wanted in the transcription labels at this point.



Feature Extraction

Next, we would need to extract MFCC features from our audio data. HCopy will be used here.


HCopy -T 1 -C wav_config "wave file" "mfcc file"
HCopy -T 1 -C wav_config -S mfc_train.scp



Creating Monophone HMMs

Here, we will be creating a single-Gaussian monophone HMMs. At first, there’ll be a set of identical monophone HMMs in which every mean and variance is identical. These are then retrained, short-pause models are added and the silence model is extended slightly. The monophones are then retrained.

Once reasonable monophone HMMs have been created, the recogniser tool HVite can be used to perform a forced alignment of the training data.

Creating Flat Start Monophone

The parameters of this model are not important, its purpose is to define the model topology. For phone-based systems, a good topology to use is 3-state left-right with no skips such as the following:


where each ellipsed vector is of length 39. This number, 39, is computed from the length of the parameterised static vector (MFCC 0 = 13) plus the delta coefficients (+13) plus the acceleration coefficients (+13).

The HTK tool HCompV will scan a set of data files, compute the global mean and variance and set all of the Gaussians in a given HMM to have the same mean and variance. Hence, assuming that a list of all the training files is stored in train.scp, the command

HCompV -C config -f 0.01 -m -S train.scp -M hmm0 proto

The -f option causes a variance floor macro (called vFloors) to be generated which is equal to 0.01 times the global variance. This is a vector of values which will be used to set a floor on the variances estimated in the subsequent steps. The -m option asks for means to be computed as well as variances. Given this new prototype model stored in the directory hmm0, a Master Macro File (MMF) called hmmdefs containing a copy for each of the required monophone HMMs is constructed by manually copying the prototype and relabeling it for each required monophone (including “sil”). The format of an MMF is similar to that of an MLF and it serves a similar purpose in that it avoids having a large number of individual HMM definition files.

Model Re-estimation

The flat start monophones stored in the directory hmm0 are re-estimated using the embedded re-estimation tool HERest invoked as follows

HERest -C config -I phones0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm0/macros -H hmm0/hmmdefs -M hmm1 monophone0.lst

Most of the files used in this invocation of HERest have already been described. The exception is the file macros. This should contain a so-called global options macro and the variance floor macro vFloors generated earlier. The global options macro simply defines the HMM parameter kind and the vector size.






Execution of HERest should be repeated twice more, changing the name of the input and output directories (set with the options -H and -M) each time, until the directory hmm3 contains the final set of initialised monophone HMMs.

Fixing the Silence Model

We make the model more robust by allowing individual states to absorb the various impulsive noises in the training data.

A 1 state short pause sp model should be created. This should be a so-called tee-model which has a direct transition from entry to exit node. This sp has its emitting state tied to the centre state of the silence model.


These silence models can be created in two stages

  • Use a text editor on the file hmm3/hmmdefs to copy the centre state of the sil model to make a new sp model and store the resulting MMF hmmdefs, which includes the new sp model, in the new directory hmm4.
  • Run the HMM editor HHEd to add the extra transitions required and tie the sp state to the centre sil state

HHEd works in a similar way to HLEd. It applies a set of commands in a script to modify a set of HMMs. In this case, it is executed as follows

HHEd -H hmm4/macros -H hmm4/hmmdefs -M hmm5 sil.hed monophone1.lst

where sil.hed contains the following commands

AT 2 4 0.2 {sil.transP} 
AT 4 2 0.2 {sil.transP} 
AT 1 3 0.3 {sp.transP} 
TI silst {sil.state[3],sp.state[2]}
  • AT: Add transitions to the given transition matrices
  • TI: Creates a tied-state called silst.

The parameters of this tied-state are stored in the hmmdefs file and within each silence model, the original state parameters are replaced by the name of this macro. Macros are described in more detail below. For now it is sufficient to regard them simply as the mechanism by which HTK implements parameter sharing. Note that the phone list used here has been changed, because the original list monophone0.lst has been extended by the new sp model. The new file is called monophones1 and has been used in the above HHEd command.




After HHed, 2 more re-estimations (HERest) are done.

Before re-estimation, we will have to create a phone model for our master label file.

HLEd -l '*' -d dict -i phones1.mlf mkphones1.led words.mlf


IS sil sil

For the final re-estimation, we can add the option “-s stats_mix1”. It will dump a file which descibes the state’s occupation of every model unit. This will be helpful when we need to increase the amount of mixtures later on.

Realigning the Training Data

The phone models created so far can be used to realign the training data and create new transcriptions.

HVite -a -b sil -o SW -m -y lab -I words.mlf -S train.scp -C config -H hmm8/hmmdefs -H hmm8/macros -l f_align/ dict monophone1.lst

This command uses the HMMs stored in hmm8 to transform the input word level transcription data.mlf to the new phone level transcriptions in the directory f_align using the pronunciations stored in the dictionary dict

The recogniser now considers all pronunciations for each word and outputs the pronunciation that best matches the acoustic data.



It can be observed that the results we obtained are accurate.

Recognizer Evaluation

test.scp holds a list of the coded test files, then each test file will be recognised and its transcription output to an MLF called recout.mlf by executing the following

HVite -H hmm8/macros -H hmm8/hmmdefs -S test.scp -l '*' -i recout.mlf -w wdnet -p 0.0 -s 5.0 dict monophone1.lst

The actual performance can be determined by running HResults as follows

HResults -I test.mlf monophone0.lst recout.mlf

The results we get:


It can be observed that the results we get is really bad as the correct guesses are much higher than the accuracy which implies that the insertion is too high. We then looked at test.mlf and recout.mlf to see what happened:

It can be observed that the words we need matches, but there’re a lot of extra phones recognized.

We can fix this by tuning “-p” of HVite which is the word insertion penalty. We ran through a range from 0 to 400 using a Python script and got the following results:


It can be observed that the accuracy peaks are around p=-60 and then continues to fall along with correct hits after that.



2 thoughts on “Force Alignment using Hidden Markov Model

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s