OpenAI recently released Whisper, a 1.6 billion parameter AI model capable of transcribing and translating speech audio from 97 different languages, showing robust performance on a wide range of automated speech recognition (ASR) tasks. The model trained on 680,000 hours of audio data collected from the web was soon published as open source on GitHub.
The Whisper neural network
Whisper uses a transform-encoder-decoder architecture, the input audio is split into 30-second chunks, converted to a log-Mel spectrogram, and then passed through an encoder. Unlike most state-of-the-art ASR models, it has not been fitted to a specific data set, but instead has been trained using weak supervision on a large-scale noisy data set collected from the Internet. Although it did not beat the specialized LibriSpeech performance models, in zero-shot evaluations on a diverse dataset, Whisper proved to be more robust and made 50% fewer errors than those models.
According to OpenAI:
“The primary intended users of Whisper models are AI researchers studying the robustness, generalizability, capabilities, biases and constraints of the current model. However, Whisper is also potentially very useful as an automatic speech recognition solution for developers, especially for English speech recognition.”
A model trained in part with noisy transcripts
Training a speech recognition DL model using only supervised learning requires a large data set, so researchers typically turn to transfer learning.
The researchers chose to train Whisper on a large audio dataset, 680,000 hours including ” a large amount of poor transcripts” retrieved from the Internet and 117,000 of which were from languages other than English, with the model then tasked with transcribing into the original language or translating into English.
Although the model favors quantity over quality, it performs well on a wide range of tasks, including transcription in multiple languages, translation, and language identification.
However, OpenAI recognizes that Whisper has its limitations, especially in the area of text prediction. Since it has been trained on a large amount of noisy data, it could include words in its transcripts that were not actually spoken. Furthermore, Whisper is not equally accurate across languages, with a higher error rate when dealing with speakers of languages that are poorly represented in the training data.
OpenAI states on GitHub:
“While Whisper models cannot be used for real-time transcription, their speed and size suggest that others may be able to build applications that enable near real-time speech recognition and translation. The real value of beneficial applications built on Whisper models suggests that the disparate performance of these models may have real economic implications… We hope that the technology will be used primarily for beneficial purposes, making automatic speech recognition technology more accessible may allow more players to build capable monitoring technologies or scale up existing monitoring efforts, as the speed and accuracy allow affordable automatic transcription and translation of large volumes of audio communication.”
Translated from Focus sur Whisper, le système de reconnaissance vocale automatique d’OpenAI