Leveraging Jointly Spatial, Temporal And Modulation Enhancement In Creating Noise-Robust Features For Speech Recognition
This paper presents to adopt various fusion types of spatial, temporal and modulation domain speech feature enhancement techniques in order to achieve superior speech recognition performance under noise-corrupted environments. With the mel-frequency cepstral coefficients (MFCC) as the standard speech feature representation, the spatial-domain techniques involve the short-time intra-frame feature enhancement, while the temporal-domain techniques compensate for the noise distortion that exists in the long-term inter-frame MFCC time stream. Furthermore, the modulation- domain techniques are conducted on the Fourier transform of a MFCC time stream. The evaluation experiments conducted on the connected-digit Aurora-2 database reveal that each of the spatial/temporal enhancement techniques adopted here performs better than the unprocessed MFCC baseline, and the integration of the methods respectively for spatial-, temporal-and modulation-domain features can result in even better recognition accuracy than the individual component method under a wide range of noise-corrupted environments. These results clearly demonstrate that the methods in the three domains treat noise in different aspects and therefore they are complementary to each other.
Keywords- Noise Robustness, Speech Recognition, Spatial Processing, Temporal Processing, Modulation Domain