yt-dub is a Python CLI that takes a YouTube URL, downloads the video, transcribes it with Whisper, translates the transcript, and re-synthesizes the audio in the original speaker's cloned voice using VoxCPM2. The output is a dubbed video that sounds like the creator actually recorded it in another language. Apache 2.0 model, MIT tool, total API cost: $0.
Repo: github.com/niccholasw/yt-dub
The Problem Nobody Talks About
YouTube creators who want to reach non-English audiences have two options. Pay ElevenLabs $22/month (minimum) for voice dubbing, which caps out fast on longer videos. Or use YouTube's auto-dub, which sounds robotic and strips the creator's personality from the audio.
Both options treat the creator's voice as disposable. The dubbed version sounds like a different person reading a translation. For creators whose personality IS the product, that kills the video.
The third option didn't exist until VoxCPM2 dropped. A 2-billion parameter open-source model that clones any voice from a 15-second sample and synthesizes new speech in 30 languages. Apache 2.0 licensed. No API key. No credits to burn through. OpenBMB trained it on 2 million+ hours of multilingual audio.
yt-dub wraps VoxCPM2 into a single command that takes a YouTube URL and a language code and outputs a fully dubbed video.
Why This Works
Voice cloning, not voice replacement.VoxCPM2 extracts the speaker's timbre, accent, and tone from a reference clip. The dubbed version preserves who the speaker sounds like. A deep male voice stays deep. A fast-talking energetic delivery stays energetic.
30 languages, one model. No separate model per language. No language tags to configure. VoxCPM2 auto-detects the target language from the text and handles the phoneme mapping internally. English to Japanese, Spanish to Korean, Arabic to French. Same command.
$0 in recurring costs.ElevenLabs charges per character. YouTube's built-in dubbing is locked behind the Partner Program. yt-dub runs entirely on your local GPU. The only cost is electricity.
48kHz studio output. Native, no upsampler. The output audio is production-ready, not lo-fi demo quality.
Step 1: Prerequisites
You need four things installed before yt-dub:
Python 3.10 to 3.12.VoxCPM2 doesn't support 3.13 yet.
python --versionGPU with 8GB+ VRAM. NVIDIA RTX 3080 or better, or Apple Silicon (M1 Pro and up). VoxCPM2 runs natively on both CUDA and MPS. CPU inference is technically possible but too slow to be useful on real videos.
# NVIDIA
nvidia-smi
# Mac (Apple Silicon detected automatically by PyTorch)
python -c "import torch; print(torch.backends.mps.is_available())"ffmpeg:
# Mac
brew install ffmpeg
# Ubuntu/Debian
sudo apt install ffmpegyt-dlp:
pip install yt-dlpStep 2: Install yt-dub
Clone the repo and install in editable mode:
git clone https://github.com/niccholasw/yt-dub.git
cd yt-dub
pip install -e .This pulls in VoxCPM2, Whisper, the translation layer, and all dependencies. First run will download model weights (~8GB for VoxCPM2, ~3GB for Whisper large-v3). Budget 10 to 15 minutes on first install depending on your connection.
For Claude-powered translation (optional, better quality on nuanced content):
pip install -e ".[claude]"
export ANTHROPIC_API_KEY=your-keyStep 3: Dub Your First Video
Pick any YouTube video. Run one command:
yt-dub "https://youtube.com/watch?v=VIDEO_ID" --lang esThe CLI runs the full pipeline:
- Downloads the video and extracts audio (yt-dlp)
- Transcribes the audio with Whisper large-v3
- Translates every segment to Spanish (Google Translate, free)
- Extracts a 15-second voice sample from the source audio
- Feeds each translated segment + voice sample into VoxCPM2
- Stitches the synthesized segments into a single audio track
- Muxes the new audio back into the video with ffmpeg
Output lands in output/dubbed_es.mp4.