Force Alignment using HMM 2

This is the continuation of the post before. This discussion on this post is:

  1. The variation of accuracy and correctness based on the number of observation mixtures.
  2. How does adding noise affect our recognition accuracy (The experiment done in our last post involves clean test input)
  3. The confusion matrix of our test results



Correctness and Accuracy vs Insertion



Subway Noise (SNR=10dB)


White Noise (SNR=10dB)


White Noise (SNR=20dB)


It can be observed that the test signal with additive Subway noise performs the worst here due to its non-interpretability compared to additive white noise.

Correctness and Accuracy vs Number of Mixtures

This experiment is done with insertion penalty of -60 (s=-60 of HVite)



Subway Noise (SNR=10dB)


White Noise (SNR=10dB)


White Noise (SNR=20dB)


As expected, additive subway noise performs the worst here.

Confusion Matrix


Reading across the rows, %c indicates the number of correct instances divided by the total number of instances in the row. %e is the number of incorrect instances in the row divided by the total number of instances (N).

It can be observed that the word that is the misclassified most of the time is “yi” which is often classified as “ling”. In physical sense, the “e” sound in both is really similar which explains the high misclassificaiton rate.

The second word that performs badly here is “liu” which is misclassified as “jiu”. The reason of this is because they share the same phone “iu” at the back.



Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s