The Eesen framework drastically simplifies the existing pipeline to build state-of-the-art ASR systems. Acoustic models in Eesen are deep bidirectional RNNs trained with the CTC objective function.
Eesen contains 4 key components to enable end-to-end ASR:
- Acoustic Model — Bi-directional RNNs with LSTM units.
- Training — Connectionist temporal classification (CTC) as the training objective.
- WFST Decoding — A principled decoding approach based on Weighted Finite-State Transducers (WFSTs), or
- RNN-LM Decoding — Decoding based on (character) RNN language models, when using Tensorflow.
Acoustic modeling in Eesen involves learning a single recurrent neural network (RNN) predicting context-independent targets (phonemes or characters). To remove the need for pre-generated frame labels, the connectionist temporal classification (CTC) objective function is adopted to infer the alignments between speech and label sequences. A distinctive feature of Eesen is a generalized decoding approach based on weighted finite-state transducers (WFSTs), which enables the efficient incorporation of lexicons and language models into CTC decoding. Experiments show that compared with the standard hybrid DNN systems, Eesen achieves comparable word error rates (WERs), while at the same time speeding up decoding significantly.
Key Features
- The WFST-based decoding approach can incorporate lexicons and language models into CTC decoding in an effective and efficient way.
- The RNN-LM decoding approach does not require a fixed lexicon.
- GPU implementation of LSTM model training and CTC learning, now also using Tensorflow.
- Multiple utterances are processed in parallel for training speed-up.
- Fully-fledged example setups to demonstrate end-to-end system building, with both phonemes and characters as labels, following Kaldi recipes and conventions.
Website: github.com/srvk/eesen
Support:
Developer: Yajie Miao and contributors
License: Apache License 2.0
Eesen is written in C++. Learn C++ with our recommended free books and free tutorials.
Related Software
| Speech Recognition Tools | |
|---|---|
| Whisper | Automatic speech recognition (system trained on 680,000 hours of data |
| Flashlight | Fast, flexible machine learning library written entirely in C++. |
| Coqui STT | Deep-learning toolkit for training and deploying speech-to-text models |
| Kaldi | C++ toolkit designed for speech recognition researchers. |
| SpeechBrain | All-in-one conversational AI toolkit based on PyTorch |
| Handy | Offline speech-to-text application |
| ESPnet | End-to-End speech processing toolkit |
| deepspeech.pytorch | Implementation of DeepSpeech2 using Baidu Warp-CTC. |
| Whispering | Transcription application with global speech-to-text functionality |
| Julius | Two-pass large vocabulary continuous speech recognition engine |
| CMUSphinx | Speech recognition system for mobile and server applications |
| Simon | Flexible speech recognition software |
| hyprwhspr | Native speech-to-text designed for Arch / Omarchy |
| ostt | Open Speech-to-Text |
| DeepSpeech | TensorFlow implementation of Baidu's DeepSpeech architecture. |
| OpenSeq2Seq | TensorFlow-based toolkit for sequence-to-sequence models |
| Eesen | End-to-End Speech Recognition |
Read our verdict in the software roundup.
Explore our comprehensive directory of recommended free and open source software. Our carefully curated collection spans every major software category.This directory is part of our ongoing series of informative articles for Linux enthusiasts. It features hundreds of detailed reviews, along with open source alternatives to proprietary solutions from major corporations such as Google, Microsoft, Apple, Adobe, IBM, Cisco, Oracle, and Autodesk. You’ll also find interesting projects to try, hardware coverage, free programming books and tutorials, and much more. Discovered a useful open source Linux program that we haven’t covered yet? Let us know by completing this form. |

