VoxCPM2: Open-Source TTS, Voice Cloning & Voice Design

VoxCPM2 is a free, open-source text-to-speech modelfrom OpenBMB. It's a 2B parameter model trained on 2 million+ hours of multilingual audio, and it can clone voices, design voices from text descriptions, and generate 48kHz studio-quality audio. No API key. No monthly fee. Runs on your own machine.

What makes VoxCPM2 different

Most TTS systems work by breaking speech into tokens (discrete chunks), which limits naturalness. VoxCPM2 skips that entirely. It uses a tokenizer-free diffusion architecture that generates speech directly in a continuous latent space.

2B parameters: bigger than most open-source TTS models
30 languages: no language tag needed, it auto-detects
48kHz output: studio quality, no external upsampler
~8GB VRAM
Apache 2.0 license: fully open, commercial use allowed

Requirements

Python 3.10 to 3.12 (not 3.13, it's not supported yet)
PyTorch 2.5.0+
CUDA 12.0+ (you need an NVIDIA GPU; CPU inference is very slow)
~8GB VRAM
~10GB disk space for the model weights

Installation

Install the package with pip:

pip install voxcpm

That's it. The model weights download automatically from HuggingFace the first time you run it (~8GB, so give it a few minutes on first load).

Feature 1: basic text-to-speech

The simplest use case: give it text, get a .wav file back.

from voxcpm import VoxCPM
import soundfile as sf

model = VoxCPM.from_pretrained(
    "openbmb/VoxCPM2",
    load_denoiser=False,
)

wav = model.generate(
    text="VoxCPM2 just became my favourite open-source TTS model.",
    cfg_value=2.0,
    inference_timesteps=10,
)

sf.write("output.wav", wav, model.tts_model.sample_rate)

What the parameters mean:

load_denoiser=False: skips a secondary denoising pass, 2x faster with minimal quality loss
cfg_value: controls how closely the model follows your instructions (2.0 is a good default)
inference_timesteps: 10 is fast, 20 to 30 is higher quality

Feature 2: voice design (no audio needed)

This is where it gets interesting. You can describe a voice in plain English and the model generates it. No reference audio required.

wav = model.generate(
    text="(A middle-aged man, deep authoritative voice, calm and measured pace) Welcome to the build. Let's get into it.",
    cfg_value=2.0,
    inference_timesteps=10,
)

You can describe: