No description
Find a file
2025-09-02 11:43:56 -05:00
docs initial commit with VAD advanced mode 2025-09-02 11:43:56 -05:00
vtbp initial commit with VAD advanced mode 2025-09-02 11:43:56 -05:00
.gitignore initial commit with VAD advanced mode 2025-09-02 11:43:56 -05:00
README.md initial commit with VAD advanced mode 2025-09-02 11:43:56 -05:00
requirements-api.txt initial commit with VAD advanced mode 2025-09-02 11:43:56 -05:00
requirements.txt initial commit with VAD advanced mode 2025-09-02 11:43:56 -05:00
setup.py initial commit with VAD advanced mode 2025-09-02 11:43:56 -05:00
setup_env.sh initial commit with VAD advanced mode 2025-09-02 11:43:56 -05:00
voice-translate-bed-preserve-POC.md initial commit with VAD advanced mode 2025-09-02 11:43:56 -05:00

VTBP - Voice Translate Bed Preserve

A Python CLI tool that translates voice in videos while preserving background music and sound effects. Now with Google API integration for Mac compatibility! Choose between local processing or cloud APIs for best performance on your platform.

Features

🚀 Hybrid Processing Options

  • Local Processing: Traditional local models (Faster-Whisper, OPUS-MT, Piper TTS)
  • API Integration: Google Gemini 2.5 Pro + Google Cloud TTS (recommended for Mac)
  • Mix & Match: Use APIs for some components, local for others

🎵 Core Capabilities

  • Audio Separation: Separate voice from music/SFX using Demucs (MPS accelerated on Apple Silicon)
  • Speech Recognition: Choose between Faster-Whisper (local) or Gemini 2.5 Pro (API)
  • Machine Translation: OPUS-MT models (local) or Gemini context-aware translation (API)
  • Text-to-Speech: Piper TTS (local) or Google Neural2 voices (API)
  • Time Alignment: librosa-based stretching (Mac compatible, pure Python)
  • Audio Mixing: Professional mixing with sidechain ducking and loudness normalization
  • Video Remuxing: Stream-copy video for fast output generation

Installation

Prerequisites

  • Python 3.9 or higher
  • FFmpeg 5.0 or higher

macOS (Homebrew)

brew install ffmpeg

Ubuntu/Debian

sudo apt-get update && sudo apt-get install -y ffmpeg
  1. Clone and setup

    git clone <repository-url>
    cd voice_translate_video
    python3 -m venv venv
    source venv/bin/activate
    
  2. Install API-optimized dependencies

    pip install -r requirements-api.txt
    pip install -e .
    
  3. Automated setup (uses pre-configured service account)

    ./setup_env.sh
    
  4. Ready to go!

    vtbp translate input.mp4 output.mp4 --asr-provider gemini --tts-provider google
    

Manual API Setup (Alternative)

If you prefer manual configuration or have your own service account:

# Set environment variables
export GEMINI_API_KEY="your-gemini-api-key"
export GOOGLE_CLOUD_PROJECT="your-project-id"
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your-service-account.json"

# Or use default credentials
gcloud auth application-default login

Local Processing Setup

  1. Clone and setup

    git clone <repository-url>
    cd voice_translate_video
    python3 -m venv venv
    source venv/bin/activate
    
  2. Install local dependencies

    pip install -r requirements.txt
    pip install -e .
    
  3. Additional setup for local mode

    # macOS: No additional requirements (uses librosa)
    # Linux: Install rubberband if you want optimal stretching
    sudo apt-get install librubberband-dev  # Optional
    

Google Gemini API

  1. Get your API key from Google AI Studio
  2. Set environment variable:
    export GEMINI_API_KEY="AIzaSyAuuVGcvqfoP7pqX-YwieGszPsNSeAft-0"
    

Google Cloud TTS

  1. Create a Google Cloud project
  2. Enable the Text-to-Speech API
  3. Set up authentication:
    # Option A: Default credentials (easiest)
    gcloud auth application-default login
    export GOOGLE_CLOUD_PROJECT="your-project-id"
    
    # Option B: Service account (production)
    export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"
    export GOOGLE_CLOUD_PROJECT="your-project-id"
    

Validate Setup

# Check API credentials
vtbp translate --validate-apis

# Estimate costs for a video
vtbp translate input.mp4 output.mp4 --estimate-cost --asr-provider gemini --tts-provider google

Local TTS Setup (Optional)

Piper TTS Setup

Download Piper TTS and voice models:

  1. Download Piper binary from releases
  2. Download voice models from Piper voices
  3. Ensure piper is in your PATH

Example voice downloads:

# Spanish voice
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/es/es_ES/nicone/medium/es_ES-nicone-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/es/es_ES/nicone/medium/es_ES-nicone-medium.onnx.json

# French voice  
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/fr/fr_FR/sylvain/high/fr_FR-sylvain-high.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/fr/fr_FR/sylvain/high/fr_FR-sylvain-high.onnx.json

Usage

Basic API translation:

vtbp translate input.mp4 output.mp4 \
  --src-lang en --tgt-lang es \
  --asr-provider gemini --tts-provider google

With specific Neural2 voice:

vtbp translate input.mp4 output.mp4 \
  --src-lang en --tgt-lang es \
  --asr-provider gemini --tts-provider google \
  --voice "es-ES-Neural2-A"

Full API mode with all Gemini:

vtbp translate input.mp4 output.mp4 \
  --src-lang en --tgt-lang fr \
  --asr-provider gemini \
  --tts-provider google \
  --translation-provider gemini \
  --voice "fr-FR-Neural2-A"

🏠 Local Mode

Traditional local processing:

vtbp translate input.mp4 output.mp4 \
  --src-lang en --tgt-lang es \
  --voice spanish_voice.onnx

🔀 Hybrid Mode

Mix API and local components:

# Use Gemini for ASR, local for everything else
vtbp translate input.mp4 output.mp4 \
  --asr-provider gemini \
  --voice spanish_voice.onnx

# Use APIs for ASR/TTS, local translation
vtbp translate input.mp4 output.mp4 \
  --asr-provider gemini \
  --tts-provider google \
  --translation-provider opus

💰 Cost Management

Estimate costs before running:

vtbp translate input.mp4 output.mp4 \
  --asr-provider gemini --tts-provider google \
  --estimate-cost

Validate API setup:

vtbp translate --validate-apis \
  --asr-provider gemini --tts-provider google

Command Options

vtbp translate [OPTIONS] INPUT_VIDEO OUTPUT_VIDEO

Provider Options (NEW):
  --asr-provider TEXT           ASR provider (whisper/gemini)
  --tts-provider TEXT           TTS provider (piper/google) 
  --translation-provider TEXT   Translation provider (opus/gemini)
  --voice TEXT                  Voice model path or name
  --estimate-cost               Show API cost estimation
  --validate-apis               Check API credentials

Core Options:
  --src-lang TEXT               Source language code (default: auto)
  --tgt-lang TEXT               Target language code (default: es)
  --sep TEXT                    Audio separation model (default: htdemucs)
  --device TEXT                 Processing device (cpu/cuda/mps/auto)
  --work-dir TEXT               Working directory (default: work)
  --keep-temp                   Keep temporary files

Audio Processing:
  --lufs FLOAT                  Target LUFS for loudness (-16.0)
  --duck-threshold FLOAT        Ducking threshold 0.0-1.0 (0.08)
  --duck-ratio FLOAT            Ducking compression ratio (6.0)
  --duck-attack FLOAT           Ducking attack time ms (5.0)
  --duck-release FLOAT          Ducking release time ms (250.0)
  --voice-gain FLOAT            Voice gain adjustment dB (0.0)
  --bed-gain FLOAT              Bed gain adjustment dB (-3.0)
  --no-duck                     Disable sidechain ducking
  --no-loudnorm                 Disable loudness normalization
  --sample-rate INTEGER         Audio sample rate (48000)

Advanced Examples

High-quality API translation with custom settings:

vtbp translate video.mp4 french.mp4 \
  --src-lang en --tgt-lang fr \
  --asr-provider gemini \
  --tts-provider google \
  --translation-provider gemini \
  --voice "fr-FR-Neural2-A" \
  --lufs -18 \
  --duck-threshold 0.06 \
  --voice-gain 1.0 \
  --bed-gain -4.0

Fast processing with Apple Silicon:

vtbp translate video.mp4 output.mp4 \
  --src-lang auto --tgt-lang es \
  --asr-provider gemini \
  --tts-provider google \
  --device mps \
  --sep htdemucs

Debug mode (keep temporary files):

vtbp translate video.mp4 output.mp4 \
  --src-lang en --tgt-lang de \
  --asr-provider gemini \
  --tts-provider google \
  --voice "de-DE-Neural2-A" \
  --keep-temp \
  --work-dir debug_work

💰 API Pricing & Cost Management

Estimated Costs (5-minute video)

  • Gemini ASR: ~$0.0075 (multimodal processing)
  • Google TTS Neural2: ~$0.10-0.20 (depending on text length)
  • Gemini Translation: ~$0.05-0.10 (context-aware)
  • Total per video: ~$0.20-0.35

Cost Optimization Tips

# Use cost estimation before processing
vtbp translate video.mp4 output.mp4 --estimate-cost --asr-provider gemini --tts-provider google

# Hybrid mode: API for quality, local for cost
vtbp translate video.mp4 output.mp4 --asr-provider gemini --translation-provider opus --voice local.onnx

# Batch processing: Process multiple videos in one session to amortize setup costs

🖥️ Platform Compatibility

  • Best: API mode eliminates all compatibility issues
  • Good: Local mode with librosa (pure Python, no binary deps)
  • Issues Fixed: No more pyrubberband or Piper installation problems

Linux/Windows

  • Best: Hybrid mode (Gemini ASR + Google TTS, local separation)
  • Good: Full local mode (all dependencies available)
  • Good: Full API mode (cloud processing)

Language Support

Supported Language Codes

Common languages supported by OPUS-MT:

  • en - English
  • es - Spanish
  • fr - French
  • de - German
  • it - Italian
  • pt - Portuguese
  • ru - Russian
  • zh - Chinese
  • ja - Japanese
  • ko - Korean
  • ar - Arabic
  • hi - Hindi
  • nl - Dutch
  • sv - Swedish
  • da - Danish
  • no - Norwegian
  • fi - Finnish
  • pl - Polish
  • tr - Turkish
  • th - Thai
  • vi - Vietnamese

Finding Translation Models

VTBP automatically downloads OPUS-MT models for supported language pairs from Helsinki-NLP.

Common model patterns:

  • Helsinki-NLP/opus-mt-en-es (English → Spanish)
  • Helsinki-NLP/opus-mt-fr-en (French → English)
  • Helsinki-NLP/opus-mt-mul-en (Multiple languages → English)

Pipeline Overview

  1. Audio Extraction - Extract stereo audio from video (48kHz)
  2. Source Separation - Split into voice and bed using Demucs HTDemucs
  3. Speech Recognition - Transcribe voice with word timestamps (Faster-Whisper)
  4. Text Translation - Translate segments preserving timing (OPUS-MT)
  5. Speech Synthesis - Generate translated speech (Piper TTS)
  6. Time Alignment - Stretch synthesized audio to match original timing (Rubber Band)
  7. Audio Mixing - Mix translated voice with bed using sidechain ducking
  8. Loudness Normalization - Normalize to broadcast standards (EBU R128)
  9. Video Remuxing - Combine with original video stream

Quality Guidelines

Best Results

  • Clean speech with minimal background noise
  • Consistent speaker throughout video
  • Good separation between voice and music
  • Target duration: 60s to 10 minutes per video

Audio Standards

  • Loudness: -16 LUFS integrated (streaming standard)
  • Dynamics: ≤ 7 LU loudness range
  • Peak: ≤ -1.5 dBTP true peak
  • Sample Rate: 48 kHz (preserves quality)

Performance Optimization

Device Selection

  • CPU: Works on all systems, slower
  • CUDA: Best for NVIDIA GPUs, requires CUDA toolkit
  • MPS: Apple Silicon Macs (M1/M2), good performance

Model Size Trade-offs

  • ASR Model: large-v2 (best quality) vs medium (faster)
  • Separation: htdemucs (balanced) vs mdx23c (experimental)

Troubleshooting

Common Issues

"Piper not found"

# Install Piper and add to PATH
wget https://github.com/rhasspy/piper/releases/latest/download/piper_linux_x86_64.tar.gz
tar -xzf piper_linux_x86_64.tar.gz
sudo cp piper/piper /usr/local/bin/

"No OPUS-MT model found"

  • Check language codes are correct
  • Verify internet connection for model download
  • Try alternative language codes (e.g., jap instead of ja for Japanese)

"CUDA out of memory"

# Use smaller models
vtbp translate video.mp4 output.mp4 --asr medium --device cpu

Poor voice separation

  • Try different separation models: --sep mdx23c
  • Check if voice is clearly audible in original
  • Consider preprocessing audio to reduce noise

Timing issues

# Adjust segment duration limits
vtbp translate video.mp4 output.mp4 --min-segment 1.0 --max-segment 5.0

Getting Help

vtbp --help           # General help
vtbp translate --help # Command-specific help  
vtbp info            # Show available components
vtbp version         # Show version

Development

Running Tests

# Install development dependencies
pip install -e .[dev]

# Run tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=vtbp --cov-report=html

Code Quality

# Format code
black vtbp/

# Lint
flake8 vtbp/

# Type checking  
mypy vtbp/

License

MIT License - see LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open Pull Request

Acknowledgments

  • Demucs - Facebook Research (audio separation)
  • Faster-Whisper - SYSTRAN (speech recognition)
  • OPUS-MT - University of Helsinki (machine translation)
  • Piper - Rhasspy (text-to-speech)
  • Rubber Band - Breakfast Quay (time stretching)
  • FFmpeg - FFmpeg team (audio/video processing)

Note: This is a proof-of-concept implementation. For production use, consider additional error handling, batch processing capabilities, and voice model licensing compliance.