14 KiB
Executable file
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
Voice to Text is a two-tier web application that transcribes audio files using OpenAI Whisper and optionally translates them using DeepL API. The application consists of:
- Python Flask API (backend): Handles transcription and translation
- PHP web interface (frontend): User interface, authentication, and request handling
- Microsoft Azure AD SSO: OAuth2 with PKCE flow for authentication
Development Setup
Initial Setup
# 1. Configure authentication
cp .env.example .env
# Edit .env with your Azure AD credentials
# 2. Install PHP dependencies
composer install
# 3. Install Python dependencies and create virtual environment
./setup.sh
# 4. Start the Python API server
./start_api.sh
# Or manually:
source venv/bin/activate
python api.py
The API runs on http://localhost:5010 by default. The PHP frontend should be served via MAMP, Apache, or any PHP-enabled web server with HTTPS enabled for production.
Testing the Application
# Check if Python API is running
curl http://localhost:5010/health
# Or use the PHP diagnostic page
# Visit: check_api.php in browser
# Test downloads
# Visit: test_download.php in browser
Python Version Compatibility
- Recommended: Python 3.10 or 3.11
- Supported: Python 3.8+
- Warning: Python 3.12+ may have compatibility issues with some dependencies
Architecture
Three-Layer Design
The application uses a separation of concerns:
- Authentication Layer: Microsoft Azure AD SSO with OAuth2 PKCE flow
- Python API (api.py): Computation-heavy tasks (Whisper transcription, DeepL translation)
- PHP Frontend: User interface, session management, file handling, and proxying requests to Python API
Authentication Flow
User Browser
↓
login.php (landing page)
↓ (clicks "Sign in with Microsoft")
auth.php
↓ (generates PKCE code_verifier & code_challenge)
Azure AD OAuth2 Authorization Endpoint
↓ (user authenticates)
auth.php (callback)
↓ (exchanges code + code_verifier for token)
Microsoft Graph API (/me)
↓ (retrieves user info)
Session initialized:
- $_SESSION['authenticated'] = true
- $_SESSION['user_id'], ['user_name'], ['user_email']
- $_SESSION['user_files'] = []
↓
index.php (main app)
Request Flow (After Authentication)
User Browser (index.php)
↓ (jQuery AJAX + FormData)
process.php
↓ (auth check via isAuthenticated())
↓ (cURL to Python API)
api.py (Flask)
↓ (Whisper transcription)
↓ (Optional: DeepL translation)
outputs/ directory
↓ (files tracked in $_SESSION['user_files'])
download.php
↓ (auth + ownership check)
User Browser (download)
Key Components
Authentication & Configuration Files:
auth_config.php (Authentication & Environment Configuration):
- Loads environment variables from .env using vlucas/phpdotenv
- Defines Azure AD configuration constants (CLIENT_ID, AUTHORITY, REDIRECT_URI)
- Configures secure session settings (httponly, secure, samesite)
- Provides helper functions:
isAuthenticated(): Check if user is logged in and session is validrequireAuth(): Redirect to login.php if not authenticatedgetCurrentUser(): Get current user info from session
login.php (Landing Page):
- First page users see when not authenticated
- Displays "Sign in with Microsoft" button with Microsoft logo
- Matches black/gold theme of main application
- Redirects to index.php if already authenticated
auth.php (OAuth2 PKCE Handler):
- Implements OAuth2 Authorization Code flow with PKCE
- Step 1: Generates code_verifier (64-char random string) and code_challenge (SHA256 hash)
- Step 2: Redirects to Azure AD with PKCE parameters
- Step 3: Handles callback, verifies state (CSRF protection)
- Step 4: Exchanges authorization code + code_verifier for access token
- Step 5: Calls Microsoft Graph API to get user info
- Step 6: Initializes session with user data and empty file list
- Step 7: Redirects to index.php
logout.php (Session Destruction):
- Clears all session variables
- Destroys session cookie
- Destroys session
- Redirects to login.php
config.php (Configuration Loader):
- Requires auth_config.php
- Starts session if not already started
- All configuration now loaded from .env via auth_config.php
API & Core Files:
api.py (Flask REST API - Port 5010):
/health: Health check endpoint/transcribe: Main endpoint - accepts audio file, format (txt/vtt/srt), translation settings/download/<filename>: Serves transcribed files- Whisper model loaded once at startup and kept in memory
- DeepL translator initialized at startup
- Generates both original and translated files when translation is enabled
process.php (PHP request handler):
- Auth check: Calls isAuthenticated() - returns error if not authenticated
- Receives multipart/form-data from frontend
- Validates file size (350MB limit)
- Forwards to Python API via cURL
- File tracking: Adds original and translated filenames to $_SESSION['user_files']
- Returns formatted HTML for display (truncated at 10,000 chars for preview)
- Provides download links for full files
index.php (Main UI):
- Auth required: Calls requireAuth() at top - redirects to login.php if not authenticated
- Displays user header with name, email, and logout button
- jQuery-based AJAX file upload
- Format selector (txt/vtt/srt)
- Translation toggle with language selector (30+ languages)
- Real-time progress bar during processing
- In-page preview of transcriptions
- Download buttons for original and translated files
download.php (File server):
- Auth required: Calls isAuthenticated() - returns 401 if not authenticated
- Ownership check: Verifies requested file is in $_SESSION['user_files']
- Returns 403 Forbidden if user doesn't own the file
- Logs unauthorized download attempts
- Serves files from outputs/ directory
- Security: Uses basename() to prevent directory traversal
- Sets proper Content-Type headers based on file extension
.env (Environment Variables):
- AZURE_CLIENT_ID: Azure AD application client ID
- AZURE_AUTHORITY: Azure AD authority URL with tenant ID
- AZURE_REDIRECT_URI: OAuth2 redirect URI (must match Azure AD config)
- DEEPL_API_KEY: DeepL API key for translation
- PYTHON_API_URL: Python Flask API endpoint (default: http://localhost:5010)
- SESSION_TIMEOUT: Session timeout in seconds (default: 28800 = 8 hours)
Output Formats
Text (.txt)
Plain text transcription - full text of audio
VTT (.vtt)
WebVTT subtitle format with timestamps:
WEBVTT
00:00:00.000 --> 00:00:05.123
First segment text
00:00:05.123 --> 00:00:10.456
Second segment text
SRT (.srt)
SubRip subtitle format with timestamps:
1
00:00:00,000 --> 00:00:05,123
First segment text
2
00:00:05,123 --> 00:00:10,456
Second segment text
Key Difference: VTT uses period (.) for milliseconds, SRT uses comma (,)
Whisper Models
Available models (edit api.py line 26 to change):
tiny: Fastest, least accuratebase: Default - good balancesmall: Better accuracy, slowermedium: High accuracy, much slowerlarge: Best accuracy, very slow
Changing the model:
# In api.py line 26:
model = whisper.load_model("small") # Change from "base" to desired model
File Size and Timeout Limits
- Maximum file size: 350MB (configured in .htaccess and process.php)
- Processing timeout: 5 minutes (300 seconds in process.php)
- PHP settings (.htaccess):
- upload_max_filesize: 350M
- post_max_size: 350M
- max_execution_time: 1200 seconds
- memory_limit: 512M
Translation
Translation is powered by DeepL API:
- Supports 30+ languages
- Translation happens after transcription
- Original language is auto-detected by Whisper
- Both original and translated files are saved with suffixes:
filename_original.{ext}filename_translated.{ext}
File Handling
outputs/ Directory
All transcribed files are saved here. The directory:
- Created automatically by setup.sh or api.py
- Should have write permissions (777 in production)
- Files are named:
{original_filename}_original.{ext}and{original_filename}_translated.{ext} - Not tracked by git (see .gitignore)
Temporary Files
- Audio files are saved temporarily during processing
- Cleaned up automatically after transcription (api.py line 186-187)
Authentication & Security
Microsoft Azure AD SSO
- OAuth2 with PKCE: Uses Proof Key for Code Exchange (RFC 7636)
- No client secret needed: PKCE allows public clients to authenticate securely
- Code verifier: 64-character random string generated for each auth request
- Code challenge: SHA256 hash of code_verifier, sent to Azure AD
- Token exchange: Authorization code + code_verifier exchanged for access token
Session-Based File Access Control
- Session tracking: Files tracked in $_SESSION['user_files'] array
- Upload tracking: When user transcribes audio, both original and translated filenames added to their session
- Download validation: download.php checks if requested file is in user's session before serving
- Session timeout: Configurable (default: 8 hours) - after timeout, user loses access to their files
- Trade-off: Files remain in outputs/ directory but become inaccessible after session expires
Session Security
- httponly: Session cookies not accessible via JavaScript (XSS protection)
- secure: Session cookies only transmitted over HTTPS (production)
- samesite: Set to 'Lax' to prevent CSRF attacks
- strict_mode: Rejects uninitialized session IDs
- Session regeneration: Session ID regenerated after login to prevent session fixation
- CSRF protection: OAuth2 state parameter validates callback authenticity
File Security
- basename(): Prevents directory traversal attacks in download.php
- File size validation: 350MB limit enforced in both .htaccess and process.php
- Ownership logging: Unauthorized download attempts logged with user ID
- No file type validation: Relies on FFmpeg to handle/reject unsupported formats
Environment Variables
- .env file: All sensitive credentials stored in .env (not in git)
- API keys: DeepL and Azure credentials loaded from environment
- .gitignore: .env explicitly excluded from version control
Production Considerations
- HTTPS required: Secure cookies require HTTPS in production
- File cleanup: Old files in outputs/ should be cleaned via cron job
- Session storage: Consider Redis/Memcached for multi-server deployments
- Rate limiting: No rate limiting currently - consider adding for production
- Logging: Unauthorized attempts logged - monitor for suspicious activity
Session-Only File Tracking
How It Works
Files are tracked in the PHP session ($_SESSION['user_files'] array) rather than a database. This approach was chosen for simplicity.
File Lifecycle
- User uploads audio → process.php transcribes → adds filenames to $_SESSION['user_files']
- User can download files as long as session is active
- Session expires or user logs out → files become inaccessible
- Files remain in outputs/ directory but cannot be downloaded
Trade-offs
Pros:
- Simple implementation - no database needed
- Automatic "expiration" via session timeout
- Works well for temporary transcription tasks
Cons:
- Files inaccessible after session expires
- Can't access files across multiple devices/browsers
- Orphaned files accumulate in outputs/ directory
Future Upgrades
To implement persistent file ownership:
- Add SQLite/MySQL database with
usersandfilestables - Store file ownership in database instead of session
- Modify download.php to check database ownership
- Consider filename-based ownership (encode user_id in filename)
Common Development Tasks
Changing Whisper Model
Edit api.py line 26 and restart the API:
# After editing
./start_api.sh
Adjusting File Size Limits
Edit both:
.htaccess- PHP upload limitsprocess.phpline 12 - PHP validation- If using production Apache:
/etc/php/.../php.ini
Testing Authentication Flow
- Clear your browser cookies
- Visit the application root
- Should redirect to login.php
- Click "Sign in with Microsoft"
- Authenticate with Azure AD
- Should redirect back to index.php with user header visible
Testing Transcription
Via Web UI:
- Log in via login.php
- Upload a test audio file
- Check that files appear in test_download.php
Via API directly (bypasses auth):
curl -X POST http://localhost:5010/transcribe \
-F "audio=@test.mp3" \
-F "format=txt" \
-F "translate=0"
Testing File Access Control
- Upload a file while logged in
- Note the filename from the download link
- Log out
- Try to access download.php?file=filename directly
- Should receive 401 Unauthorized
Adding New Languages
Edit the language selector in index.php (lines 41-73) to add DeepL-supported languages.
Production Deployment
See README.md sections:
- "Production Deployment (Apache)" for full Apache setup
- "Setup Python API as Systemd Service" for running API as a service
- "Monitoring and Maintenance" for logs and cleanup
Key production considerations:
- Set up systemd service for Python API (voice2text-api.service)
- Configure Apache virtual host
- Set proper file permissions (www-data:www-data)
- Set up log rotation
- Configure cron job to clean old files in outputs/
- Move API keys to environment variables
Debugging
API Not Responding
- Check if API is running:
curl http://localhost:5010/health - Check process:
ps aux | grep python - Test Python directly:
source venv/bin/activate && python api.py - Visit check_api.php in browser for diagnostic info
Upload Fails
- Check outputs/ directory exists and is writable
- Verify file size is under 350MB
- Check Apache/PHP error logs
- Verify FFmpeg is installed:
which ffmpeg
Transcription Errors
- Check Python API logs (stdout/stderr)
- Verify audio file format is supported by FFmpeg
- Test with a small sample file first
- Check available disk space in /tmp
Code Style Notes
- Python: Uses Flask conventions, logging via Python logging module
- PHP: Uses procedural style, cURL for HTTP requests
- JavaScript: jQuery-based, uses AJAX for async file upload
- CSS: BEM-like naming, black/gold theme with animations