VoxCPM2 is a free, open-source text-to-speech modelfrom OpenBMB. It's a 2B parameter model trained on 2 million+ hours of multilingual audio, and it can clone voices, design voices from text descriptions, and generate 48kHz studio-quality audio. No API key. No monthly fee. Runs on your own machine.
What makes VoxCPM2 different
Most TTS systems work by breaking speech into tokens (discrete chunks), which limits naturalness. VoxCPM2 skips that entirely. It uses a tokenizer-free diffusion architecture that generates speech directly in a continuous latent space.
- 2B parameters: bigger than most open-source TTS models
- 30 languages: no language tag needed, it auto-detects
- 48kHz output: studio quality, no external upsampler
- ~8GB VRAM
- Apache 2.0 license: fully open, commercial use allowed
Requirements
- Python 3.10 to 3.12 (not 3.13, it's not supported yet)
- PyTorch 2.5.0+
- CUDA 12.0+ (you need an NVIDIA GPU; CPU inference is very slow)
- ~8GB VRAM
- ~10GB disk space for the model weights
Installation
Install the package with pip:
pip install voxcpmThat's it. The model weights download automatically from HuggingFace the first time you run it (~8GB, so give it a few minutes on first load).
Feature 1: basic text-to-speech
The simplest use case: give it text, get a .wav file back.
from voxcpm import VoxCPM
import soundfile as sf
model = VoxCPM.from_pretrained(
"openbmb/VoxCPM2",
load_denoiser=False,
)
wav = model.generate(
text="VoxCPM2 just became my favourite open-source TTS model.",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("output.wav", wav, model.tts_model.sample_rate)What the parameters mean:
load_denoiser=False: skips a secondary denoising pass, 2x faster with minimal quality losscfg_value: controls how closely the model follows your instructions (2.0 is a good default)inference_timesteps: 10 is fast, 20 to 30 is higher quality
Feature 2: voice design (no audio needed)
This is where it gets interesting. You can describe a voice in plain English and the model generates it. No reference audio required.
wav = model.generate(
text="(A middle-aged man, deep authoritative voice, calm and measured pace) Welcome to the build. Let's get into it.",
cfg_value=2.0,
inference_timesteps=10,
)You can describe:
- Gender and age: young woman, elderly man, teenage boy
- Tone: warm, authoritative, nervous, cheerful, dry
- Pace: slow and deliberate, fast-paced, measured
- Emotion: excited, sad, sarcastic, enthusiastic