Builds·3 min read·Apr 28, 2026

VoxCPM2: Open-Source TTS, Voice Cloning & Voice Design

A free, open-source 2B parameter TTS model. Clone voices, design voices from text, generate studio-quality audio. No API key.

VoxCPM2 is a free, open-source text-to-speech modelfrom OpenBMB. It's a 2B parameter model trained on 2 million+ hours of multilingual audio, and it can clone voices, design voices from text descriptions, and generate 48kHz studio-quality audio. No API key. No monthly fee. Runs on your own machine.

What makes VoxCPM2 different

Most TTS systems work by breaking speech into tokens (discrete chunks), which limits naturalness. VoxCPM2 skips that entirely. It uses a tokenizer-free diffusion architecture that generates speech directly in a continuous latent space.

  • 2B parameters: bigger than most open-source TTS models
  • 30 languages: no language tag needed, it auto-detects
  • 48kHz output: studio quality, no external upsampler
  • ~8GB VRAM
  • Apache 2.0 license: fully open, commercial use allowed

Requirements

  • Python 3.10 to 3.12 (not 3.13, it's not supported yet)
  • PyTorch 2.5.0+
  • CUDA 12.0+ (you need an NVIDIA GPU; CPU inference is very slow)
  • ~8GB VRAM
  • ~10GB disk space for the model weights

Installation

Install the package with pip:

pip install voxcpm

That's it. The model weights download automatically from HuggingFace the first time you run it (~8GB, so give it a few minutes on first load).

Feature 1: basic text-to-speech

The simplest use case: give it text, get a .wav file back.

from voxcpm import VoxCPM
import soundfile as sf

model = VoxCPM.from_pretrained(
    "openbmb/VoxCPM2",
    load_denoiser=False,
)

wav = model.generate(
    text="VoxCPM2 just became my favourite open-source TTS model.",
    cfg_value=2.0,
    inference_timesteps=10,
)

sf.write("output.wav", wav, model.tts_model.sample_rate)

What the parameters mean:

  • load_denoiser=False: skips a secondary denoising pass, 2x faster with minimal quality loss
  • cfg_value: controls how closely the model follows your instructions (2.0 is a good default)
  • inference_timesteps: 10 is fast, 20 to 30 is higher quality

Feature 2: voice design (no audio needed)

This is where it gets interesting. You can describe a voice in plain English and the model generates it. No reference audio required.

wav = model.generate(
    text="(A middle-aged man, deep authoritative voice, calm and measured pace) Welcome to the build. Let's get into it.",
    cfg_value=2.0,
    inference_timesteps=10,
)

You can describe:

  • Gender and age: young woman, elderly man, teenage boy
  • Tone: warm, authoritative, nervous, cheerful, dry
  • Pace: slow and deliberate, fast-paced, measured
  • Emotion: excited, sad, sarcastic, enthusiastic

The AI Side Hustle Cookbook

Liked this guide? Shout me a coffee.

$4.99 gets you the full playbook: 50 recipes you can build, ship, and get paid for with Claude Code. Working code in every one. The pricing, the deploy, the pitfalls. Every revision free for life.

Shout me a coffee