Dynamic Time Warping for Speech Recognition


Dynamic Time Warping is an algorithm used to match two speech sequence that are same but might differ in terms of length of certain part of speech (phones for example). Here, we’ll not be using phone as a basic unit but frames that are obtained from MFCC features that are obtained from feature extraction through a sliding windows. We will be using 12 extracted features of MFCC per frame which excludes energy.

MFCC Feature Extraction

For MFCC features, we will obtain them using HTK toolbox instead of implementing ourselves. The audio file that we will be using here has a different format from the previous posts if you notice. Before that, we were using an alien format audio file. Here, we will be using an audio file with a fixed header format which is RIFF.

The changes made are as following:


SOURCEFORMAT = ALIEN #file with head length = 0
SOURCERATE = 625.0 #sampling rate, unit: 100 nsec



If you’re curious, the new header size is 44 and the sample rate is defined into the header.


First we need to clarify that there’s a difference between DTW and Viterbi. As from the link, Viterbi algorithm represents pattern matching algorithm of statistic probability. DTW algorithm represents pattern matching algorithm of template matching algorithm. The algorithm we’ll be using here is DTW as no probability is involved.

3-Steps Constraint

We’ll implement two methods where there differ in the path restriction. For the first method, the every step is restricted to move to (0,+1),(+1,0) and (+1,+1) from the current point as shown in the diagram below:


We’ll show the result of DTW algorithm one-by-one. The template and test result that we will use here is the phrase 交通大學. We’ll start by matching the template with itself to show that it works as we expect (a straight line should appear):


We then compare to the sample phrase spoken but the test data used here is a faster version of the template:


It can be observed that the curve grows horizontally really fast. We will compare this result with the same speech spoken at a slower pace shown below:


It can be observed that it grows vertically faster than the one above. (Observe the vertical axis)

Now, we’ll show results obtained when compared with difference speech sequence. (We’ll be changing Test Data and we’ll be using speech with the same pace)



Observe how the last extra word correspond to the DTW graph.







Observe how bad of a result produced by an entire different word.

A comparison with different speech sequences are shown below: (Medium pace 交通大學 as template)

Test Data Cost
快-交通大學 3.5393e+03
快-交通大隊 5.3293e+03
快-信號處理 7.5717e+03
快-語音信號 1.0811e+04
快-語音處理 7.9386e+03
慢-交通大學 4.7898e+03
慢-交通大隊 8.2430e+03
慢-信號處理 9.5801e+03
慢-語音信號 8.8775e+03
慢-語音處理 1.0253e+04
中-交通大學 0
中-交通大學贊 5.4024e+03
中-交通大隊 5.4029e+03
中-交通大隊爛 7.9093e+03
中-信號處理 9.2503e+03
中-語音信號 9.4377e+03
中-語音處理 9.5596e+03

All the path with the lowest cost for corresponding pace are shown in italic.

5-Steps Constraint

Next, we would like to experiment a looser constraint on the available steps ( (0,+2) (0,+1) (+1,+1) (+1,0) (+2,0) ) to be taken from the current node. An illustration is shown below:


I’ll only show some of the optimal paths for this method as it’s similar to the paths shown above.

Faster version of the template


Slower version of the template


Same pace 交通大隊贊


The most obvious difference can be seen for the DTW of 交通大學贊. There’s some difference at the “tail” of the graph.

Test Data Cost
快-交通大學 2.5807e+03
快-交通大隊 3.7465e+03
快-信號處理 5.6295e+03
快-語音信號 8.4524e+03
快-語音處理 7.3564e+03
慢-交通大學 3.4671e+03
慢-交通大隊 5.6360e+03
慢-信號處理 7.6668e+03
慢-語音信號 7.2069e+03
慢-語音處理 7.6250e+03
中-交通大學 0
中-交通大學贊 4.0314e+03
中-交通大隊 3.9625e+03
中-交通大隊爛 6.3642e+03
中-信號處理 5.4268e+03
中-語音信號 6.6767e+03
中-語音處理 8.9559e+03

It can be observed that the overall cost is lower than a tighter constraint as expected.

Speech Clipping

Now we would like to clip our template and compare it with the test data given. The templates that we will be clipping are “交通大學贊” and “交通大隊爛” and we would expect to clip the last word our of the sentence resulting in “交通大學” and “交通大隊” respectively. We will set out clip at 0.7 of the template length.

First we show a clipped and unclipped version of “交通大學贊” when matched with “交通大學”





The error obtained when compared with others using “交通大學贊” as a reference is shown in the table below:

Test Data Cost
中-交通大學 3.5683e+03
中-交通大學贊 0
中-交通大隊 5.5088e+03
中-交通大隊爛 9.6360e+03
中-信號處理 6.3100e+03
中-語音信號 8.4559e+03
中-語音處理 9.2887e+03

Next, we’ll show a clipped and unclipped version of “交通大隊爛” when matched with “交通大隊”





The error obtained when compared with others using “交通大隊爛” as a reference is shown in the table below:


Test Data Cost
中-交通大學 5.9552e+03
中-交通大學贊 9.8786e+03
中-交通大隊 4.3075e+03
中-交通大隊爛 0
中-信號處理 7.9714e+03
中-語音信號 8.7883e+03
中-語音處理 9.9951e+03


Source Code



Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s