Automatic Speech Recognition
Transformers
PyTorch
Estonian
whisper
audio
hf-asr-leaderboard
Eval Results (legacy)
Instructions to use TalTechNLP/whisper-medium-et with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TalTechNLP/whisper-medium-et with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="TalTechNLP/whisper-medium-et")# Load model directly from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq processor = AutoProcessor.from_pretrained("TalTechNLP/whisper-medium-et") model = AutoModelForSpeechSeq2Seq.from_pretrained("TalTechNLP/whisper-medium-et") - Notebooks
- Google Colab
- Kaggle
| license: cc-by-4.0 | |
| tags: | |
| - audio | |
| - automatic-speech-recognition | |
| - hf-asr-leaderboard | |
| language: et | |
| model-index: | |
| - name: TalTechNLP/whisper-medium-et | |
| results: | |
| - task: | |
| name: Automatic Speech Recognition | |
| type: automatic-speech-recognition | |
| dataset: | |
| name: Common Voice 11 | |
| type: mozilla-foundation/common_voice_11_0 | |
| config: et | |
| split: test | |
| metrics: | |
| - name: Test WER | |
| type: wer | |
| value: 14.66 | |
| - name: Test CER | |
| type: cer | |
| value: 3.76 | |
| - task: | |
| name: Automatic Speech Recognition | |
| type: automatic-speech-recognition | |
| dataset: | |
| name: Common Voice 8 | |
| type: mozilla-foundation/common_voice_8_0 | |
| config: et | |
| split: test | |
| metrics: | |
| - name: Test WER | |
| type: wer | |
| value: 13.793 | |
| - name: Test CER | |
| type: cer | |
| value: 3.194 | |
| # Whisper-medium-et | |
| This is a Whisper-medium model [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) finetuned on around 800 hours of diverse Estonian data. | |
| ## Model description | |
| This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech. | |
| ## Intended uses & limitations | |
| This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc. | |
| ## How to use | |
| Use as any other Whisper model via HF transformers, or use a faster decoder like [faster-whisper](https://github.com/guillaumekln/faster-whisper). | |
| #### Limitations and bias | |
| Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following: | |
| * Speech containing technical and other domain-specific terms | |
| * Children's speech | |
| * Non-native speech | |
| * Speech recorded under very noisy conditions or with a microphone far from the speaker | |
| * Very spontaneous and overlapping speech | |
| ## Training data | |
| Acoustic training data: | |
| | Type | Amount (h) | | |
| |-----------------------|:------:| | |
| | Broadcast speech | 591 | | |
| | Spontaneous speech | 53 | | |
| | Elderly speech corpus | 53 | | |
| | Talks, lectures | 49 | | |
| | Parliament speeches | 31 | | |
| | *Total* | *761* | | |
| ## Training procedure | |
| Finetuned using Espnet, and then comverted to transformers format using [this](https://gist.github.com/alumae/2dcf473b667cec9d513b80ea24e94672) script. | |
| Finetuning procedure is similar to [this](https://huggingface.co/espnet/shihlun_asr_whisper_medium_finetuned_librispeech100) model. | |
| ## Evaluation results | |
| ### WER | |
| WER results below are obtained using greedy decoding (i.e., beam size 1). | |
| |Dataset | WER | | |
| |---|---| | |
| | Common Voice 8.0 | 13.8 | | |
| | Common Voice 11.0 | 14.7 | | |