Speech Enhancement Based on Combined DNN Structure
In this paper, we propose the speech enhancement method by stacking the long-short term memory (LSTM) and deep neural network (DNN). From the previous studies, it is known that DNN is good at modeling the complex relation between input feature vectors and its desired target vectors through its multiple nonlinear hidden layers. Therefore, DNN-based speech enhancement methods such as de-noising or de-reverberation have been proposed by training the relation between the input noisy feature vectors and the target clean feature vector. However, despite adjacent frames are correlated, DNN cannot model the temporal context of sequential data such as speech or video, since the mapping between the input feature vectors and the target feature vectors of each frame is individually done. Recently, LSTM has been successfully applied to various sequence prediction and sequence labeling tasks such as speech recognition and speech enhancement because of its recurrent connection on each hidden layer. Therefore, we stack LSTM and DNN to take the advantages of their complementarity. LSTM is used to model the temporal property of speech by using long-range history. And then, DNN is trained with LSTM-based enhanced speech signals and noisy speech signals which are LSTM input and output. The proposed method is evaluated in terms of the objective measures and shows a significant improvement compared with the conventional single DNN and stacked DNN-based Speech enhancement methods.
Keywords- Speech Enhancement, De-noising, Long-short Term Memory, Deep Neural Network.