I am doing supervised learning on speech audio files using neural networks. For this purpose, I'll have to extract features from the audio file. But since an audio file is a time varying signal, it is generally divided into multiple frames and then features like MFCC etc are extracted from each frame. So how should I encode my feature vector for each training example(audio file) considering that it will be divided into different no of frames(depending on duration of file)?
Asked By : ksb
Answered By : Nikolay Shmyrev
Simple neural network as a structure doesn't have invariance across time scale deformation that's why it is impractical to apply it to recognize time series. To recognize time series usually a generic communication model is used (HMM). NN could be used together with HMM to classify individual frames of speech. In such HMM-ANN configuration audio is split on frames, frame slices are passed into ANN in order to calculate phoneme probabilities and then the whole probability sequence is analyzed for a best match using dynamic search with HMM.
To train HMM-ANN you need a segmentation of speech on states. HMM-ANN system usually requires initialization from more robust HMM-GMM system thus there are no standalone HMM-ANN implementation, usually they are part of a whole speech recognition toolkit. Among popular toolkits Kaldi has implementation for HMM-ANN and even for HMM-DNN (deep neural networks).
There are several more complex types of neural networks that are intended to model sequence data. They fall into class of recursive neural networks where connections have cycles. Recursive neural network can process sequences of features of arbitrary length. RNN systems are state of the art systems these days and you can train very accurate recognition system using them, however, training is not simple. You can check RNN toolkits like CURRENNT.
A case of recursive neural networks are long-short term memory networks. They are state of the art system for speech recognition these days. You can learn more about them from the following publication:
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/41976
0 comments:
Post a Comment
Let us know your responses and feedback