| docs | ||
| vtbp | ||
| .gitignore | ||
| README.md | ||
| requirements-api.txt | ||
| requirements.txt | ||
| setup.py | ||
| setup_env.sh | ||
| voice-translate-bed-preserve-POC.md | ||
VTBP - Voice Translate Bed Preserve
A Python CLI tool that translates voice in videos while preserving background music and sound effects. Now with Google API integration for Mac compatibility! Choose between local processing or cloud APIs for best performance on your platform.
Features
🚀 Hybrid Processing Options
- Local Processing: Traditional local models (Faster-Whisper, OPUS-MT, Piper TTS)
- API Integration: Google Gemini 2.5 Pro + Google Cloud TTS (recommended for Mac)
- Mix & Match: Use APIs for some components, local for others
🎵 Core Capabilities
- Audio Separation: Separate voice from music/SFX using Demucs (MPS accelerated on Apple Silicon)
- Speech Recognition: Choose between Faster-Whisper (local) or Gemini 2.5 Pro (API)
- Machine Translation: OPUS-MT models (local) or Gemini context-aware translation (API)
- Text-to-Speech: Piper TTS (local) or Google Neural2 voices (API)
- Time Alignment: librosa-based stretching (Mac compatible, pure Python)
- Audio Mixing: Professional mixing with sidechain ducking and loudness normalization
- Video Remuxing: Stream-copy video for fast output generation
Installation
Prerequisites
- Python 3.9 or higher
- FFmpeg 5.0 or higher
macOS (Homebrew)
brew install ffmpeg
Ubuntu/Debian
sudo apt-get update && sudo apt-get install -y ffmpeg
Quick Start (API Mode - Recommended for Mac)
-
Clone and setup
git clone <repository-url> cd voice_translate_video python3 -m venv venv source venv/bin/activate -
Install API-optimized dependencies
pip install -r requirements-api.txt pip install -e . -
Automated setup (uses pre-configured service account)
./setup_env.sh -
Ready to go!
vtbp translate input.mp4 output.mp4 --asr-provider gemini --tts-provider google
Manual API Setup (Alternative)
If you prefer manual configuration or have your own service account:
# Set environment variables
export GEMINI_API_KEY="your-gemini-api-key"
export GOOGLE_CLOUD_PROJECT="your-project-id"
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your-service-account.json"
# Or use default credentials
gcloud auth application-default login
Local Processing Setup
-
Clone and setup
git clone <repository-url> cd voice_translate_video python3 -m venv venv source venv/bin/activate -
Install local dependencies
pip install -r requirements.txt pip install -e . -
Additional setup for local mode
# macOS: No additional requirements (uses librosa) # Linux: Install rubberband if you want optimal stretching sudo apt-get install librubberband-dev # Optional
API Setup (Recommended)
Google Gemini API
- Get your API key from Google AI Studio
- Set environment variable:
export GEMINI_API_KEY="AIzaSyAuuVGcvqfoP7pqX-YwieGszPsNSeAft-0"
Google Cloud TTS
- Create a Google Cloud project
- Enable the Text-to-Speech API
- Set up authentication:
# Option A: Default credentials (easiest) gcloud auth application-default login export GOOGLE_CLOUD_PROJECT="your-project-id" # Option B: Service account (production) export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json" export GOOGLE_CLOUD_PROJECT="your-project-id"
Validate Setup
# Check API credentials
vtbp translate --validate-apis
# Estimate costs for a video
vtbp translate input.mp4 output.mp4 --estimate-cost --asr-provider gemini --tts-provider google
Local TTS Setup (Optional)
Piper TTS Setup
Download Piper TTS and voice models:
- Download Piper binary from releases
- Download voice models from Piper voices
- Ensure
piperis in your PATH
Example voice downloads:
# Spanish voice
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/es/es_ES/nicone/medium/es_ES-nicone-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/es/es_ES/nicone/medium/es_ES-nicone-medium.onnx.json
# French voice
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/fr/fr_FR/sylvain/high/fr_FR-sylvain-high.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/fr/fr_FR/sylvain/high/fr_FR-sylvain-high.onnx.json
Usage
🚀 API Mode (Recommended for Mac)
Basic API translation:
vtbp translate input.mp4 output.mp4 \
--src-lang en --tgt-lang es \
--asr-provider gemini --tts-provider google
With specific Neural2 voice:
vtbp translate input.mp4 output.mp4 \
--src-lang en --tgt-lang es \
--asr-provider gemini --tts-provider google \
--voice "es-ES-Neural2-A"
Full API mode with all Gemini:
vtbp translate input.mp4 output.mp4 \
--src-lang en --tgt-lang fr \
--asr-provider gemini \
--tts-provider google \
--translation-provider gemini \
--voice "fr-FR-Neural2-A"
🏠 Local Mode
Traditional local processing:
vtbp translate input.mp4 output.mp4 \
--src-lang en --tgt-lang es \
--voice spanish_voice.onnx
🔀 Hybrid Mode
Mix API and local components:
# Use Gemini for ASR, local for everything else
vtbp translate input.mp4 output.mp4 \
--asr-provider gemini \
--voice spanish_voice.onnx
# Use APIs for ASR/TTS, local translation
vtbp translate input.mp4 output.mp4 \
--asr-provider gemini \
--tts-provider google \
--translation-provider opus
💰 Cost Management
Estimate costs before running:
vtbp translate input.mp4 output.mp4 \
--asr-provider gemini --tts-provider google \
--estimate-cost
Validate API setup:
vtbp translate --validate-apis \
--asr-provider gemini --tts-provider google
Command Options
vtbp translate [OPTIONS] INPUT_VIDEO OUTPUT_VIDEO
Provider Options (NEW):
--asr-provider TEXT ASR provider (whisper/gemini)
--tts-provider TEXT TTS provider (piper/google)
--translation-provider TEXT Translation provider (opus/gemini)
--voice TEXT Voice model path or name
--estimate-cost Show API cost estimation
--validate-apis Check API credentials
Core Options:
--src-lang TEXT Source language code (default: auto)
--tgt-lang TEXT Target language code (default: es)
--sep TEXT Audio separation model (default: htdemucs)
--device TEXT Processing device (cpu/cuda/mps/auto)
--work-dir TEXT Working directory (default: work)
--keep-temp Keep temporary files
Audio Processing:
--lufs FLOAT Target LUFS for loudness (-16.0)
--duck-threshold FLOAT Ducking threshold 0.0-1.0 (0.08)
--duck-ratio FLOAT Ducking compression ratio (6.0)
--duck-attack FLOAT Ducking attack time ms (5.0)
--duck-release FLOAT Ducking release time ms (250.0)
--voice-gain FLOAT Voice gain adjustment dB (0.0)
--bed-gain FLOAT Bed gain adjustment dB (-3.0)
--no-duck Disable sidechain ducking
--no-loudnorm Disable loudness normalization
--sample-rate INTEGER Audio sample rate (48000)
Advanced Examples
High-quality API translation with custom settings:
vtbp translate video.mp4 french.mp4 \
--src-lang en --tgt-lang fr \
--asr-provider gemini \
--tts-provider google \
--translation-provider gemini \
--voice "fr-FR-Neural2-A" \
--lufs -18 \
--duck-threshold 0.06 \
--voice-gain 1.0 \
--bed-gain -4.0
Fast processing with Apple Silicon:
vtbp translate video.mp4 output.mp4 \
--src-lang auto --tgt-lang es \
--asr-provider gemini \
--tts-provider google \
--device mps \
--sep htdemucs
Debug mode (keep temporary files):
vtbp translate video.mp4 output.mp4 \
--src-lang en --tgt-lang de \
--asr-provider gemini \
--tts-provider google \
--voice "de-DE-Neural2-A" \
--keep-temp \
--work-dir debug_work
💰 API Pricing & Cost Management
Estimated Costs (5-minute video)
- Gemini ASR: ~$0.0075 (multimodal processing)
- Google TTS Neural2: ~$0.10-0.20 (depending on text length)
- Gemini Translation: ~$0.05-0.10 (context-aware)
- Total per video: ~$0.20-0.35
Cost Optimization Tips
# Use cost estimation before processing
vtbp translate video.mp4 output.mp4 --estimate-cost --asr-provider gemini --tts-provider google
# Hybrid mode: API for quality, local for cost
vtbp translate video.mp4 output.mp4 --asr-provider gemini --translation-provider opus --voice local.onnx
# Batch processing: Process multiple videos in one session to amortize setup costs
🖥️ Platform Compatibility
macOS (Recommended: API Mode)
- Best: API mode eliminates all compatibility issues
- Good: Local mode with librosa (pure Python, no binary deps)
- Issues Fixed: No more pyrubberband or Piper installation problems
Linux/Windows
- Best: Hybrid mode (Gemini ASR + Google TTS, local separation)
- Good: Full local mode (all dependencies available)
- Good: Full API mode (cloud processing)
Language Support
Supported Language Codes
Common languages supported by OPUS-MT:
en- Englishes- Spanishfr- Frenchde- Germanit- Italianpt- Portugueseru- Russianzh- Chineseja- Japaneseko- Koreanar- Arabichi- Hindinl- Dutchsv- Swedishda- Danishno- Norwegianfi- Finnishpl- Polishtr- Turkishth- Thaivi- Vietnamese
Finding Translation Models
VTBP automatically downloads OPUS-MT models for supported language pairs from Helsinki-NLP.
Common model patterns:
Helsinki-NLP/opus-mt-en-es(English → Spanish)Helsinki-NLP/opus-mt-fr-en(French → English)Helsinki-NLP/opus-mt-mul-en(Multiple languages → English)
Pipeline Overview
- Audio Extraction - Extract stereo audio from video (48kHz)
- Source Separation - Split into voice and bed using Demucs HTDemucs
- Speech Recognition - Transcribe voice with word timestamps (Faster-Whisper)
- Text Translation - Translate segments preserving timing (OPUS-MT)
- Speech Synthesis - Generate translated speech (Piper TTS)
- Time Alignment - Stretch synthesized audio to match original timing (Rubber Band)
- Audio Mixing - Mix translated voice with bed using sidechain ducking
- Loudness Normalization - Normalize to broadcast standards (EBU R128)
- Video Remuxing - Combine with original video stream
Quality Guidelines
Best Results
- Clean speech with minimal background noise
- Consistent speaker throughout video
- Good separation between voice and music
- Target duration: 60s to 10 minutes per video
Audio Standards
- Loudness: -16 LUFS integrated (streaming standard)
- Dynamics: ≤ 7 LU loudness range
- Peak: ≤ -1.5 dBTP true peak
- Sample Rate: 48 kHz (preserves quality)
Performance Optimization
Device Selection
- CPU: Works on all systems, slower
- CUDA: Best for NVIDIA GPUs, requires CUDA toolkit
- MPS: Apple Silicon Macs (M1/M2), good performance
Model Size Trade-offs
- ASR Model:
large-v2(best quality) vsmedium(faster) - Separation:
htdemucs(balanced) vsmdx23c(experimental)
Troubleshooting
Common Issues
"Piper not found"
# Install Piper and add to PATH
wget https://github.com/rhasspy/piper/releases/latest/download/piper_linux_x86_64.tar.gz
tar -xzf piper_linux_x86_64.tar.gz
sudo cp piper/piper /usr/local/bin/
"No OPUS-MT model found"
- Check language codes are correct
- Verify internet connection for model download
- Try alternative language codes (e.g.,
japinstead ofjafor Japanese)
"CUDA out of memory"
# Use smaller models
vtbp translate video.mp4 output.mp4 --asr medium --device cpu
Poor voice separation
- Try different separation models:
--sep mdx23c - Check if voice is clearly audible in original
- Consider preprocessing audio to reduce noise
Timing issues
# Adjust segment duration limits
vtbp translate video.mp4 output.mp4 --min-segment 1.0 --max-segment 5.0
Getting Help
vtbp --help # General help
vtbp translate --help # Command-specific help
vtbp info # Show available components
vtbp version # Show version
Development
Running Tests
# Install development dependencies
pip install -e .[dev]
# Run tests
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=vtbp --cov-report=html
Code Quality
# Format code
black vtbp/
# Lint
flake8 vtbp/
# Type checking
mypy vtbp/
License
MIT License - see LICENSE file for details.
Contributing
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
Acknowledgments
- Demucs - Facebook Research (audio separation)
- Faster-Whisper - SYSTRAN (speech recognition)
- OPUS-MT - University of Helsinki (machine translation)
- Piper - Rhasspy (text-to-speech)
- Rubber Band - Breakfast Quay (time stretching)
- FFmpeg - FFmpeg team (audio/video processing)
Note: This is a proof-of-concept implementation. For production use, consider additional error handling, batch processing capabilities, and voice model licensing compliance.