VoxCPM2 is a free, open-source text-to-speech modelfrom OpenBMB. It's a 2B parameter model trained on 2 million+ hours of multilingual audio, and it can clone voices, design voices from text descriptions, and generate 48kHz studio-quality audio. No API key. No monthly fee. Runs on your own machine.
What Makes VoxCPM2 Different
Most TTS systems work by breaking speech into tokens (discrete chunks), which limits naturalness. VoxCPM2 skips that entirely. It uses a tokenizer-free diffusion architecture that generates speech directly in a continuous latent space.
- 2B parameters: bigger than most open-source TTS models
- 30 languages: no language tag needed, it auto-detects
- 48kHz output: studio quality, no external upsampler
- ~8GB VRAM
- Apache 2.0 license: fully open, commercial use allowed
Requirements
- Python 3.10 to 3.12 (not 3.13, it's not supported yet)
- PyTorch 2.5.0+
- CUDA 12.0+ (you need an NVIDIA GPU; CPU inference is very slow)
- ~8GB VRAM
- ~10GB disk space for the model weights
Installation
Install the package with pip:
pip install voxcpmThat's it. The model weights download automatically from HuggingFace the first time you run it (~8GB, so give it a few minutes on first load).