nicnonac

yt-dub is a Python CLI that takes a YouTube URL, downloads the video, transcribes it with Whisper, translates the transcript, and re-synthesizes the audio in the original speaker's cloned voice using VoxCPM2. The output is a dubbed video that sounds like the creator actually recorded it in another language. Apache 2.0 model, MIT tool, total API cost: $0.

Repo: github.com/niccholasw/yt-dub

The Problem Nobody Talks About

YouTube creators who want to reach non-English audiences have two options. Pay ElevenLabs $22/month (minimum) for voice dubbing, which caps out fast on longer videos. Or use YouTube's auto-dub, which sounds robotic and strips the creator's personality from the audio.

Both options treat the creator's voice as disposable. The dubbed version sounds like a different person reading a translation. For creators whose personality IS the product, that kills the video.

The third option didn't exist until VoxCPM2 dropped. A 2-billion parameter open-source model that clones any voice from a 15-second sample and synthesizes new speech in 30 languages. Apache 2.0 licensed. No API key. No credits to burn through. OpenBMB trained it on 2 million+ hours of multilingual audio.

yt-dub wraps VoxCPM2 into a single command that takes a YouTube URL and a language code and outputs a fully dubbed video.

Why This Works

Voice cloning, not voice replacement.VoxCPM2 extracts the speaker's timbre, accent, and tone from a reference clip. The dubbed version preserves who the speaker sounds like. A deep male voice stays deep. A fast-talking energetic delivery stays energetic.

30 languages, one model. No separate model per language. No language tags to configure. VoxCPM2 auto-detects the target language from the text and handles the phoneme mapping internally. English to Japanese, Spanish to Korean, Arabic to French. Same command.

$0 in recurring costs.ElevenLabs charges per character. YouTube's built-in dubbing is locked behind the Partner Program. yt-dub runs entirely on your local GPU. The only cost is electricity.

48kHz studio output. Native, no upsampler. The output audio is production-ready, not lo-fi demo quality.

Step 1: Prerequisites

You need four things installed before yt-dub:

Python 3.10 to 3.12.VoxCPM2 doesn't support 3.13 yet.

python --version

GPU with 8GB+ VRAM. NVIDIA RTX 3080 or better, or Apple Silicon (M1 Pro and up). VoxCPM2 runs natively on both CUDA and MPS. CPU inference is technically possible but too slow to be useful on real videos.

# NVIDIA
nvidia-smi

# Mac (Apple Silicon detected automatically by PyTorch)
python -c "import torch; print(torch.backends.mps.is_available())"

ffmpeg:

# Mac
brew install ffmpeg

# Ubuntu/Debian
sudo apt install ffmpeg

yt-dlp:

pip install yt-dlp

Step 2: Install yt-dub

Clone the repo and install in editable mode:

git clone https://github.com/niccholasw/yt-dub.git
cd yt-dub
pip install -e .

This pulls in VoxCPM2, Whisper, the translation layer, and all dependencies. First run will download model weights (~8GB for VoxCPM2, ~3GB for Whisper large-v3). Budget 10 to 15 minutes on first install depending on your connection.

For Claude-powered translation (optional, better quality on nuanced content):

pip install -e ".[claude]"
export ANTHROPIC_API_KEY=your-key

Step 3: Dub Your First Video

Pick any YouTube video. Run one command:

yt-dub "https://youtube.com/watch?v=VIDEO_ID" --lang es

The CLI runs the full pipeline:

Downloads the video and extracts audio (yt-dlp)
Transcribes the audio with Whisper large-v3
Translates every segment to Spanish (Google Translate, free)
Extracts a 15-second voice sample from the source audio
Feeds each translated segment + voice sample into VoxCPM2
Stitches the synthesized segments into a single audio track
Muxes the new audio back into the video with ffmpeg

Output lands in output/dubbed_es.mp4.