Moshi is a speech-text foundation model and full-duplex spoken dialogue framework.
It’s designed for real-time spoken interaction and uses Mimi, a streaming neural audio codec, to handle low-latency audio processing. The project includes multiple inference stacks for different use cases, covering PyTorch for research, MLX for on-device inference on Apple hardware, Rust for production deployments, and a web client used for the live demo.
Mimi is a neural audio codec that processes 24 kHz audio, down to a 12.5 Hz representation with a bandwidth of 1.1 kbps, in a fully streaming manner (latency of 80ms, the frame size), yet performs better than existing, non-streaming, codecs like SpeechTokenizer (50 Hz, 4kbps), or SemantiCodec (50 Hz, 1.3kbps).
This is free and open source software.
Key Features
- Provides a speech-text foundation model for real-time dialogue.
- Supports full-duplex spoken interaction.
- Uses the Mimi streaming neural audio codec for low-latency audio processing.
- Includes PyTorch, MLX, and Rust inference stacks for research, on-device use, and production deployment.
- Offers on-device inference support for iPhone and Mac through MLX.
- Includes a web UI client for interactive browser-based use.
- Can run related Kyutai multi-stream models including simultaneous speech translation and speech generation systems.
Website: moshi.chat
Support:
Developer: Kyutai Labs
License: Apache License 2.0 / MIT License
Moshi is written in Python and Rust. Learn Python with our recommended free books and free tutorials.
Related Software
| Speech Tools | |
|---|---|
| Piper | Fast, local neural text to speech system |
| Tortoise | Multi-voice text-to-speech system trained with an emphasis on quality |
| Coqui TTS | Offers pretrained models in more than 1,100 different languages |
| Bark | Transformer-based text-to-audio model. |
| Festival | General multi-lingual speech synthesis system |
| PraatSpeechAnalyser | Software for speech analysis and synthesis |
| Speech Note | Speech to Text, Text to Speech and Machine Translation |
| Mimic 3 | Lightweight Text to Speech engine |
| OrcaScreenReader | Scriptable screen reader |
| Flite | Small, fast run time text to speech synthesis engine |
| RHVoice | Gives the visually impaired a synthesis voice with their screen reader |
| eSpeak NG | Continuation of the eSpeak project |
| eSpeak | Speech synthesizer using a formant synthesis method |
| Gespeaker | GTK-based frontend for eSpeak |
Read our verdict in the software roundup.
Explore our comprehensive directory of recommended free and open source software. Our carefully curated collection spans every major software category.This directory is part of our ongoing series of informative articles for Linux enthusiasts. It features hundreds of detailed reviews, along with open source alternatives to proprietary solutions from major corporations such as Google, Microsoft, Apple, Adobe, IBM, Cisco, Oracle, and Autodesk. You’ll also find interesting projects to try, hardware coverage, free programming books and tutorials, and much more. Discovered a useful open source Linux program that we haven’t covered yet? Let us know by completing this form. |

