Researchers’ data set pinpoints challenges adapting speech recognition models to new hardware

A new study from researchers affiliated with the University College London, Nokia Bell Labs Cambridge, and the University of Oxford shows how differences in microphone quality can impact speech recognition accuracy. The coauthors use a custom data set — Libri-Adapt — containing 7,200 hours of English speech to test whether Mozilla’s DeepSpeech model handles unique environments and microphones well. The findings suggest there is a noticeable degradation in accuracy during certain “domain shifts,” with word error rate increasing to as high as 28% after switching microphones.

Automatic speech recognition models must perform well across hardware if they’re to be reliable. For instance, customers expect the models powering Alexa to work similarly on different smart speakers, smart displays, and smart devices. But not all models achieve this ideal because they’re not consistently trained with diverse corpora. That is to say, some corpora don’t contain speech recorded with microphones of varying quality and in novel settings.

Libri-Adapt is designed to expose these flaws with speech recorded using the microphones in six different products: A PlayStation Eye camera, a generic USB mic, a Google Nexus 6 smartphone, the Shure MV5, a Raspberry Pi accessory called ReSpeaker, and the Matrix Voice developer kit. The corpus has speech data in three English accents, namely U.S. English, British English, and Indian English, which came from 251 U.S. speakers and synthetic voices generated by Google Cloud Platform’s text-to-speech API. Beyond this, Libra-Adapt contains wind, rain, and laughter background noises intended to serve as added confounders.

Libra-Adapt word error rate

Above: Word error rate of a fine-tuned DeepSpeech model trained and tested on various microphone pairs for U.S. English speech. The columns correspond to the training microphone domain and rows correspond to the test microphone domain.

During experiments, the researchers compared the speech recognition performance of a pre-trained DeepSpeech model (version 0.5.0) across the aforementioned six devices. They found that when data from the same microphone was used for training and testing the model, DeepSpeech unsurprisingly achieved the smallest error rate (e.g., 11.39% in the case of PlayStation Eye). But the inverse was also true: When there was a mismatch between the training and testing sets, the word error rate jumped substantially (e.g., 24.18% when a model trained on PlayStation Eye-recorded speech was tested on Matrix Voice speech).

The researchers say that Libra-Adapt, which is available in open source, can be used to create scenarios that test the generalizability of speech recognition algorithms. As an example, they tested a DeepSpeech model trained on U.S.-accented speech collected by a ReSpeaker microphone against Indian-accented speech with rain background noise recorded by a PlayStation Eye. The results show the model suffered an error rate uptick of nearly 29.8%, pointing to poor robustness on the model’s part.

Although the coauthors claim to have manually verified hundreds of Libra-Adapt’s recordings, they caution that some might be incomplete or noisy. That’s the reason why they plan to develop unsupervised domain adaptation algorithms in future work to tackle domain shifts in the data set.