voice2text/CLAUDE.md
2025-11-03 15:23:55 +00:00

14 KiB
Executable file

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Voice to Text is a two-tier web application that transcribes audio files using OpenAI Whisper and optionally translates them using DeepL API. The application consists of:

  • Python Flask API (backend): Handles transcription and translation
  • PHP web interface (frontend): User interface, authentication, and request handling
  • Microsoft Azure AD SSO: OAuth2 with PKCE flow for authentication

Development Setup

Initial Setup

# 1. Configure authentication
cp .env.example .env
# Edit .env with your Azure AD credentials

# 2. Install PHP dependencies
composer install

# 3. Install Python dependencies and create virtual environment
./setup.sh

# 4. Start the Python API server
./start_api.sh

# Or manually:
source venv/bin/activate
python api.py

The API runs on http://localhost:5010 by default. The PHP frontend should be served via MAMP, Apache, or any PHP-enabled web server with HTTPS enabled for production.

Testing the Application

# Check if Python API is running
curl http://localhost:5010/health

# Or use the PHP diagnostic page
# Visit: check_api.php in browser

# Test downloads
# Visit: test_download.php in browser

Python Version Compatibility

  • Recommended: Python 3.10 or 3.11
  • Supported: Python 3.8+
  • Warning: Python 3.12+ may have compatibility issues with some dependencies

Architecture

Three-Layer Design

The application uses a separation of concerns:

  1. Authentication Layer: Microsoft Azure AD SSO with OAuth2 PKCE flow
  2. Python API (api.py): Computation-heavy tasks (Whisper transcription, DeepL translation)
  3. PHP Frontend: User interface, session management, file handling, and proxying requests to Python API

Authentication Flow

User Browser
    ↓
login.php (landing page)
    ↓ (clicks "Sign in with Microsoft")
auth.php
    ↓ (generates PKCE code_verifier & code_challenge)
Azure AD OAuth2 Authorization Endpoint
    ↓ (user authenticates)
auth.php (callback)
    ↓ (exchanges code + code_verifier for token)
Microsoft Graph API (/me)
    ↓ (retrieves user info)
Session initialized:
    - $_SESSION['authenticated'] = true
    - $_SESSION['user_id'], ['user_name'], ['user_email']
    - $_SESSION['user_files'] = []
    ↓
index.php (main app)

Request Flow (After Authentication)

User Browser (index.php)
    ↓ (jQuery AJAX + FormData)
process.php
    ↓ (auth check via isAuthenticated())
    ↓ (cURL to Python API)
api.py (Flask)
    ↓ (Whisper transcription)
    ↓ (Optional: DeepL translation)
outputs/ directory
    ↓ (files tracked in $_SESSION['user_files'])
download.php
    ↓ (auth + ownership check)
User Browser (download)

Key Components

Authentication & Configuration Files:

auth_config.php (Authentication & Environment Configuration):

  • Loads environment variables from .env using vlucas/phpdotenv
  • Defines Azure AD configuration constants (CLIENT_ID, AUTHORITY, REDIRECT_URI)
  • Configures secure session settings (httponly, secure, samesite)
  • Provides helper functions:
    • isAuthenticated(): Check if user is logged in and session is valid
    • requireAuth(): Redirect to login.php if not authenticated
    • getCurrentUser(): Get current user info from session

login.php (Landing Page):

  • First page users see when not authenticated
  • Displays "Sign in with Microsoft" button with Microsoft logo
  • Matches black/gold theme of main application
  • Redirects to index.php if already authenticated

auth.php (OAuth2 PKCE Handler):

  • Implements OAuth2 Authorization Code flow with PKCE
  • Step 1: Generates code_verifier (64-char random string) and code_challenge (SHA256 hash)
  • Step 2: Redirects to Azure AD with PKCE parameters
  • Step 3: Handles callback, verifies state (CSRF protection)
  • Step 4: Exchanges authorization code + code_verifier for access token
  • Step 5: Calls Microsoft Graph API to get user info
  • Step 6: Initializes session with user data and empty file list
  • Step 7: Redirects to index.php

logout.php (Session Destruction):

  • Clears all session variables
  • Destroys session cookie
  • Destroys session
  • Redirects to login.php

config.php (Configuration Loader):

  • Requires auth_config.php
  • Starts session if not already started
  • All configuration now loaded from .env via auth_config.php

API & Core Files:

api.py (Flask REST API - Port 5010):

  • /health: Health check endpoint
  • /transcribe: Main endpoint - accepts audio file, format (txt/vtt/srt), translation settings
  • /download/<filename>: Serves transcribed files
  • Whisper model loaded once at startup and kept in memory
  • DeepL translator initialized at startup
  • Generates both original and translated files when translation is enabled

process.php (PHP request handler):

  • Auth check: Calls isAuthenticated() - returns error if not authenticated
  • Receives multipart/form-data from frontend
  • Validates file size (350MB limit)
  • Forwards to Python API via cURL
  • File tracking: Adds original and translated filenames to $_SESSION['user_files']
  • Returns formatted HTML for display (truncated at 10,000 chars for preview)
  • Provides download links for full files

index.php (Main UI):

  • Auth required: Calls requireAuth() at top - redirects to login.php if not authenticated
  • Displays user header with name, email, and logout button
  • jQuery-based AJAX file upload
  • Format selector (txt/vtt/srt)
  • Translation toggle with language selector (30+ languages)
  • Real-time progress bar during processing
  • In-page preview of transcriptions
  • Download buttons for original and translated files

download.php (File server):

  • Auth required: Calls isAuthenticated() - returns 401 if not authenticated
  • Ownership check: Verifies requested file is in $_SESSION['user_files']
  • Returns 403 Forbidden if user doesn't own the file
  • Logs unauthorized download attempts
  • Serves files from outputs/ directory
  • Security: Uses basename() to prevent directory traversal
  • Sets proper Content-Type headers based on file extension

.env (Environment Variables):

  • AZURE_CLIENT_ID: Azure AD application client ID
  • AZURE_AUTHORITY: Azure AD authority URL with tenant ID
  • AZURE_REDIRECT_URI: OAuth2 redirect URI (must match Azure AD config)
  • DEEPL_API_KEY: DeepL API key for translation
  • PYTHON_API_URL: Python Flask API endpoint (default: http://localhost:5010)
  • SESSION_TIMEOUT: Session timeout in seconds (default: 28800 = 8 hours)

Output Formats

Text (.txt)

Plain text transcription - full text of audio

VTT (.vtt)

WebVTT subtitle format with timestamps:

WEBVTT

00:00:00.000 --> 00:00:05.123
First segment text

00:00:05.123 --> 00:00:10.456
Second segment text

SRT (.srt)

SubRip subtitle format with timestamps:

1
00:00:00,000 --> 00:00:05,123
First segment text

2
00:00:05,123 --> 00:00:10,456
Second segment text

Key Difference: VTT uses period (.) for milliseconds, SRT uses comma (,)

Whisper Models

Available models (edit api.py line 26 to change):

  • tiny: Fastest, least accurate
  • base: Default - good balance
  • small: Better accuracy, slower
  • medium: High accuracy, much slower
  • large: Best accuracy, very slow

Changing the model:

# In api.py line 26:
model = whisper.load_model("small")  # Change from "base" to desired model

File Size and Timeout Limits

  • Maximum file size: 350MB (configured in .htaccess and process.php)
  • Processing timeout: 5 minutes (300 seconds in process.php)
  • PHP settings (.htaccess):
    • upload_max_filesize: 350M
    • post_max_size: 350M
    • max_execution_time: 1200 seconds
    • memory_limit: 512M

Translation

Translation is powered by DeepL API:

  • Supports 30+ languages
  • Translation happens after transcription
  • Original language is auto-detected by Whisper
  • Both original and translated files are saved with suffixes:
    • filename_original.{ext}
    • filename_translated.{ext}

File Handling

outputs/ Directory

All transcribed files are saved here. The directory:

  • Created automatically by setup.sh or api.py
  • Should have write permissions (777 in production)
  • Files are named: {original_filename}_original.{ext} and {original_filename}_translated.{ext}
  • Not tracked by git (see .gitignore)

Temporary Files

  • Audio files are saved temporarily during processing
  • Cleaned up automatically after transcription (api.py line 186-187)

Authentication & Security

Microsoft Azure AD SSO

  • OAuth2 with PKCE: Uses Proof Key for Code Exchange (RFC 7636)
  • No client secret needed: PKCE allows public clients to authenticate securely
  • Code verifier: 64-character random string generated for each auth request
  • Code challenge: SHA256 hash of code_verifier, sent to Azure AD
  • Token exchange: Authorization code + code_verifier exchanged for access token

Session-Based File Access Control

  • Session tracking: Files tracked in $_SESSION['user_files'] array
  • Upload tracking: When user transcribes audio, both original and translated filenames added to their session
  • Download validation: download.php checks if requested file is in user's session before serving
  • Session timeout: Configurable (default: 8 hours) - after timeout, user loses access to their files
  • Trade-off: Files remain in outputs/ directory but become inaccessible after session expires

Session Security

  • httponly: Session cookies not accessible via JavaScript (XSS protection)
  • secure: Session cookies only transmitted over HTTPS (production)
  • samesite: Set to 'Lax' to prevent CSRF attacks
  • strict_mode: Rejects uninitialized session IDs
  • Session regeneration: Session ID regenerated after login to prevent session fixation
  • CSRF protection: OAuth2 state parameter validates callback authenticity

File Security

  • basename(): Prevents directory traversal attacks in download.php
  • File size validation: 350MB limit enforced in both .htaccess and process.php
  • Ownership logging: Unauthorized download attempts logged with user ID
  • No file type validation: Relies on FFmpeg to handle/reject unsupported formats

Environment Variables

  • .env file: All sensitive credentials stored in .env (not in git)
  • API keys: DeepL and Azure credentials loaded from environment
  • .gitignore: .env explicitly excluded from version control

Production Considerations

  1. HTTPS required: Secure cookies require HTTPS in production
  2. File cleanup: Old files in outputs/ should be cleaned via cron job
  3. Session storage: Consider Redis/Memcached for multi-server deployments
  4. Rate limiting: No rate limiting currently - consider adding for production
  5. Logging: Unauthorized attempts logged - monitor for suspicious activity

Session-Only File Tracking

How It Works

Files are tracked in the PHP session ($_SESSION['user_files'] array) rather than a database. This approach was chosen for simplicity.

File Lifecycle

  1. User uploads audio → process.php transcribes → adds filenames to $_SESSION['user_files']
  2. User can download files as long as session is active
  3. Session expires or user logs out → files become inaccessible
  4. Files remain in outputs/ directory but cannot be downloaded

Trade-offs

Pros:

  • Simple implementation - no database needed
  • Automatic "expiration" via session timeout
  • Works well for temporary transcription tasks

Cons:

  • Files inaccessible after session expires
  • Can't access files across multiple devices/browsers
  • Orphaned files accumulate in outputs/ directory

Future Upgrades

To implement persistent file ownership:

  1. Add SQLite/MySQL database with users and files tables
  2. Store file ownership in database instead of session
  3. Modify download.php to check database ownership
  4. Consider filename-based ownership (encode user_id in filename)

Common Development Tasks

Changing Whisper Model

Edit api.py line 26 and restart the API:

# After editing
./start_api.sh

Adjusting File Size Limits

Edit both:

  1. .htaccess - PHP upload limits
  2. process.php line 12 - PHP validation
  3. If using production Apache: /etc/php/.../php.ini

Testing Authentication Flow

  1. Clear your browser cookies
  2. Visit the application root
  3. Should redirect to login.php
  4. Click "Sign in with Microsoft"
  5. Authenticate with Azure AD
  6. Should redirect back to index.php with user header visible

Testing Transcription

Via Web UI:

  1. Log in via login.php
  2. Upload a test audio file
  3. Check that files appear in test_download.php

Via API directly (bypasses auth):

curl -X POST http://localhost:5010/transcribe \
  -F "audio=@test.mp3" \
  -F "format=txt" \
  -F "translate=0"

Testing File Access Control

  1. Upload a file while logged in
  2. Note the filename from the download link
  3. Log out
  4. Try to access download.php?file=filename directly
  5. Should receive 401 Unauthorized

Adding New Languages

Edit the language selector in index.php (lines 41-73) to add DeepL-supported languages.

Production Deployment

See README.md sections:

  • "Production Deployment (Apache)" for full Apache setup
  • "Setup Python API as Systemd Service" for running API as a service
  • "Monitoring and Maintenance" for logs and cleanup

Key production considerations:

  1. Set up systemd service for Python API (voice2text-api.service)
  2. Configure Apache virtual host
  3. Set proper file permissions (www-data:www-data)
  4. Set up log rotation
  5. Configure cron job to clean old files in outputs/
  6. Move API keys to environment variables

Debugging

API Not Responding

  1. Check if API is running: curl http://localhost:5010/health
  2. Check process: ps aux | grep python
  3. Test Python directly: source venv/bin/activate && python api.py
  4. Visit check_api.php in browser for diagnostic info

Upload Fails

  1. Check outputs/ directory exists and is writable
  2. Verify file size is under 350MB
  3. Check Apache/PHP error logs
  4. Verify FFmpeg is installed: which ffmpeg

Transcription Errors

  1. Check Python API logs (stdout/stderr)
  2. Verify audio file format is supported by FFmpeg
  3. Test with a small sample file first
  4. Check available disk space in /tmp

Code Style Notes

  • Python: Uses Flask conventions, logging via Python logging module
  • PHP: Uses procedural style, cURL for HTTP requests
  • JavaScript: jQuery-based, uses AJAX for async file upload
  • CSS: BEM-like naming, black/gold theme with animations