Watch

No description

Python 98.9%
Shell 1.1%

Find a file

Repository files (latest commit first)
Filename	Latest commit message	Latest commit date
michael cbbca56836 initial commit with VAD advanced mode		2025-09-02 11:43:56 -05:00
docs	initial commit with VAD advanced mode	2025-09-02 11:43:56 -05:00
vtbp	initial commit with VAD advanced mode	2025-09-02 11:43:56 -05:00
.gitignore	initial commit with VAD advanced mode	2025-09-02 11:43:56 -05:00
README.md	initial commit with VAD advanced mode	2025-09-02 11:43:56 -05:00
requirements-api.txt	initial commit with VAD advanced mode	2025-09-02 11:43:56 -05:00
requirements.txt	initial commit with VAD advanced mode	2025-09-02 11:43:56 -05:00
setup.py	initial commit with VAD advanced mode	2025-09-02 11:43:56 -05:00
setup_env.sh	initial commit with VAD advanced mode	2025-09-02 11:43:56 -05:00
voice-translate-bed-preserve-POC.md	initial commit with VAD advanced mode	2025-09-02 11:43:56 -05:00

README.md

VTBP - Voice Translate Bed Preserve

A Python CLI tool that translates voice in videos while preserving background music and sound effects. Now with Google API integration for Mac compatibility! Choose between local processing or cloud APIs for best performance on your platform.

Features

🚀 Hybrid Processing Options

Local Processing: Traditional local models (Faster-Whisper, OPUS-MT, Piper TTS)
API Integration: Google Gemini 2.5 Pro + Google Cloud TTS (recommended for Mac)
Mix & Match: Use APIs for some components, local for others

🎵 Core Capabilities

Audio Separation: Separate voice from music/SFX using Demucs (MPS accelerated on Apple Silicon)
Speech Recognition: Choose between Faster-Whisper (local) or Gemini 2.5 Pro (API)
Machine Translation: OPUS-MT models (local) or Gemini context-aware translation (API)
Text-to-Speech: Piper TTS (local) or Google Neural2 voices (API)
Time Alignment: librosa-based stretching (Mac compatible, pure Python)
Audio Mixing: Professional mixing with sidechain ducking and loudness normalization
Video Remuxing: Stream-copy video for fast output generation

Installation

Prerequisites

Python 3.9 or higher
FFmpeg 5.0 or higher

macOS (Homebrew)

brew install ffmpeg

Ubuntu/Debian

sudo apt-get update && sudo apt-get install -y ffmpeg

Quick Start (API Mode - Recommended for Mac)

Clone and setup

git clone <repository-url>
cd voice_translate_video
python3 -m venv venv
source venv/bin/activate

Install API-optimized dependencies

pip install -r requirements-api.txt
pip install -e .

Automated setup (uses pre-configured service account)
```
./setup_env.sh
```

Ready to go!

vtbp translate input.mp4 output.mp4 --asr-provider gemini --tts-provider google

Manual API Setup (Alternative)

If you prefer manual configuration or have your own service account:

# Set environment variables
export GEMINI_API_KEY="your-gemini-api-key"
export GOOGLE_CLOUD_PROJECT="your-project-id"
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your-service-account.json"

# Or use default credentials
gcloud auth application-default login

Local Processing Setup

Clone and setup

git clone <repository-url>
cd voice_translate_video
python3 -m venv venv
source venv/bin/activate

Install local dependencies

pip install -r requirements.txt
pip install -e .

Additional setup for local mode

# macOS: No additional requirements (uses librosa)
# Linux: Install rubberband if you want optimal stretching
sudo apt-get install librubberband-dev  # Optional

API Setup (Recommended)

Google Gemini API

Get your API key from Google AI Studio

Set environment variable:

export GEMINI_API_KEY="AIzaSyAuuVGcvqfoP7pqX-YwieGszPsNSeAft-0"

Google Cloud TTS

Create a Google Cloud project
Enable the Text-to-Speech API

Set up authentication:

# Option A: Default credentials (easiest)
gcloud auth application-default login
export GOOGLE_CLOUD_PROJECT="your-project-id"

# Option B: Service account (production)
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"
export GOOGLE_CLOUD_PROJECT="your-project-id"

Validate Setup

# Check API credentials
vtbp translate --validate-apis

# Estimate costs for a video
vtbp translate input.mp4 output.mp4 --estimate-cost --asr-provider gemini --tts-provider google

Local TTS Setup (Optional)

Piper TTS Setup

Download Piper TTS and voice models:

Download Piper binary from releases
Download voice models from Piper voices
Ensure piper is in your PATH

Example voice downloads:

# Spanish voice
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/es/es_ES/nicone/medium/es_ES-nicone-medium.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/es/es_ES/nicone/medium/es_ES-nicone-medium.onnx.json

# French voice  
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/fr/fr_FR/sylvain/high/fr_FR-sylvain-high.onnx
wget https://huggingface.co/rhasspy/piper-voices/resolve/v1.0.0/fr/fr_FR/sylvain/high/fr_FR-sylvain-high.onnx.json

Usage

🚀 API Mode (Recommended for Mac)

Basic API translation:

vtbp translate input.mp4 output.mp4 \
  --src-lang en --tgt-lang es \
  --asr-provider gemini --tts-provider google

With specific Neural2 voice:

vtbp translate input.mp4 output.mp4 \
  --src-lang en --tgt-lang es \
  --asr-provider gemini --tts-provider google \
  --voice "es-ES-Neural2-A"

Full API mode with all Gemini:

vtbp translate input.mp4 output.mp4 \
  --src-lang en --tgt-lang fr \
  --asr-provider gemini \
  --tts-provider google \
  --translation-provider gemini \
  --voice "fr-FR-Neural2-A"

🏠 Local Mode

Traditional local processing:

vtbp translate input.mp4 output.mp4 \
  --src-lang en --tgt-lang es \
  --voice spanish_voice.onnx

🔀 Hybrid Mode

Mix API and local components:

# Use Gemini for ASR, local for everything else
vtbp translate input.mp4 output.mp4 \
  --asr-provider gemini \
  --voice spanish_voice.onnx

# Use APIs for ASR/TTS, local translation
vtbp translate input.mp4 output.mp4 \
  --asr-provider gemini \
  --tts-provider google \
  --translation-provider opus

💰 Cost Management

Estimate costs before running:

vtbp translate input.mp4 output.mp4 \
  --asr-provider gemini --tts-provider google \
  --estimate-cost

Validate API setup:

vtbp translate --validate-apis \
  --asr-provider gemini --tts-provider google

Command Options

vtbp translate [OPTIONS] INPUT_VIDEO OUTPUT_VIDEO

Provider Options (NEW):
  --asr-provider TEXT           ASR provider (whisper/gemini)
  --tts-provider TEXT           TTS provider (piper/google) 
  --translation-provider TEXT   Translation provider (opus/gemini)
  --voice TEXT                  Voice model path or name
  --estimate-cost               Show API cost estimation
  --validate-apis               Check API credentials

Core Options:
  --src-lang TEXT               Source language code (default: auto)
  --tgt-lang TEXT               Target language code (default: es)
  --sep TEXT                    Audio separation model (default: htdemucs)
  --device TEXT                 Processing device (cpu/cuda/mps/auto)
  --work-dir TEXT               Working directory (default: work)
  --keep-temp                   Keep temporary files

Audio Processing:
  --lufs FLOAT                  Target LUFS for loudness (-16.0)
  --duck-threshold FLOAT        Ducking threshold 0.0-1.0 (0.08)
  --duck-ratio FLOAT            Ducking compression ratio (6.0)
  --duck-attack FLOAT           Ducking attack time ms (5.0)
  --duck-release FLOAT          Ducking release time ms (250.0)
  --voice-gain FLOAT            Voice gain adjustment dB (0.0)
  --bed-gain FLOAT              Bed gain adjustment dB (-3.0)
  --no-duck                     Disable sidechain ducking
  --no-loudnorm                 Disable loudness normalization
  --sample-rate INTEGER         Audio sample rate (48000)

Advanced Examples

High-quality API translation with custom settings:

vtbp translate video.mp4 french.mp4 \
  --src-lang en --tgt-lang fr \
  --asr-provider gemini \
  --tts-provider google \
  --translation-provider gemini \
  --voice "fr-FR-Neural2-A" \
  --lufs -18 \
  --duck-threshold 0.06 \
  --voice-gain 1.0 \
  --bed-gain -4.0

Fast processing with Apple Silicon:

vtbp translate video.mp4 output.mp4 \
  --src-lang auto --tgt-lang es \
  --asr-provider gemini \
  --tts-provider google \
  --device mps \
  --sep htdemucs

Debug mode (keep temporary files):

vtbp translate video.mp4 output.mp4 \
  --src-lang en --tgt-lang de \
  --asr-provider gemini \
  --tts-provider google \
  --voice "de-DE-Neural2-A" \
  --keep-temp \
  --work-dir debug_work

💰 API Pricing & Cost Management

Estimated Costs (5-minute video)

Gemini ASR: ~$0.0075 (multimodal processing)
Google TTS Neural2: ~$0.10-0.20 (depending on text length)
Gemini Translation: ~$0.05-0.10 (context-aware)
Total per video: ~$0.20-0.35

Cost Optimization Tips

# Use cost estimation before processing
vtbp translate video.mp4 output.mp4 --estimate-cost --asr-provider gemini --tts-provider google

# Hybrid mode: API for quality, local for cost
vtbp translate video.mp4 output.mp4 --asr-provider gemini --translation-provider opus --voice local.onnx

# Batch processing: Process multiple videos in one session to amortize setup costs

🖥️ Platform Compatibility

macOS (Recommended: API Mode)

Best: API mode eliminates all compatibility issues
Good: Local mode with librosa (pure Python, no binary deps)
Issues Fixed: No more pyrubberband or Piper installation problems

Linux/Windows

Best: Hybrid mode (Gemini ASR + Google TTS, local separation)
Good: Full local mode (all dependencies available)
Good: Full API mode (cloud processing)

Language Support

Supported Language Codes

Common languages supported by OPUS-MT:

en - English
es - Spanish
fr - French
de - German
it - Italian
pt - Portuguese
ru - Russian
zh - Chinese
ja - Japanese
ko - Korean
ar - Arabic
hi - Hindi
nl - Dutch
sv - Swedish
da - Danish
no - Norwegian
fi - Finnish
pl - Polish
tr - Turkish
th - Thai
vi - Vietnamese

Finding Translation Models

VTBP automatically downloads OPUS-MT models for supported language pairs from Helsinki-NLP.

Common model patterns:

Helsinki-NLP/opus-mt-en-es (English → Spanish)
Helsinki-NLP/opus-mt-fr-en (French → English)
Helsinki-NLP/opus-mt-mul-en (Multiple languages → English)

Pipeline Overview

Audio Extraction - Extract stereo audio from video (48kHz)
Source Separation - Split into voice and bed using Demucs HTDemucs
Speech Recognition - Transcribe voice with word timestamps (Faster-Whisper)
Text Translation - Translate segments preserving timing (OPUS-MT)
Speech Synthesis - Generate translated speech (Piper TTS)
Time Alignment - Stretch synthesized audio to match original timing (Rubber Band)
Audio Mixing - Mix translated voice with bed using sidechain ducking
Loudness Normalization - Normalize to broadcast standards (EBU R128)
Video Remuxing - Combine with original video stream

Quality Guidelines

Best Results

Clean speech with minimal background noise
Consistent speaker throughout video
Good separation between voice and music
Target duration: 60s to 10 minutes per video

Audio Standards

Loudness: -16 LUFS integrated (streaming standard)
Dynamics: ≤ 7 LU loudness range
Peak: ≤ -1.5 dBTP true peak
Sample Rate: 48 kHz (preserves quality)

Performance Optimization

Device Selection

CPU: Works on all systems, slower
CUDA: Best for NVIDIA GPUs, requires CUDA toolkit
MPS: Apple Silicon Macs (M1/M2), good performance

Model Size Trade-offs

ASR Model: large-v2 (best quality) vs medium (faster)
Separation: htdemucs (balanced) vs mdx23c (experimental)

Troubleshooting

Common Issues

"Piper not found"

# Install Piper and add to PATH
wget https://github.com/rhasspy/piper/releases/latest/download/piper_linux_x86_64.tar.gz
tar -xzf piper_linux_x86_64.tar.gz
sudo cp piper/piper /usr/local/bin/

"No OPUS-MT model found"

Check language codes are correct
Verify internet connection for model download
Try alternative language codes (e.g., jap instead of ja for Japanese)

"CUDA out of memory"

# Use smaller models
vtbp translate video.mp4 output.mp4 --asr medium --device cpu

Poor voice separation

Try different separation models: --sep mdx23c
Check if voice is clearly audible in original
Consider preprocessing audio to reduce noise

Timing issues

# Adjust segment duration limits
vtbp translate video.mp4 output.mp4 --min-segment 1.0 --max-segment 5.0

Getting Help

vtbp --help           # General help
vtbp translate --help # Command-specific help  
vtbp info            # Show available components
vtbp version         # Show version

Development

Running Tests

# Install development dependencies
pip install -e .[dev]

# Run tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=vtbp --cov-report=html

Code Quality

# Format code
black vtbp/

# Lint
flake8 vtbp/

# Type checking  
mypy vtbp/

License

MIT License - see LICENSE file for details.

Contributing

Fork the repository
Create feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open Pull Request

Acknowledgments

Demucs - Facebook Research (audio separation)
Faster-Whisper - SYSTRAN (speech recognition)
OPUS-MT - University of Helsinki (machine translation)
Piper - Rhasspy (text-to-speech)
Rubber Band - Breakfast Quay (time stretching)
FFmpeg - FFmpeg team (audio/video processing)

Note: This is a proof-of-concept implementation. For production use, consider additional error handling, batch processing capabilities, and voice model licensing compliance.