video-accessibility/docs/video_accessibility_spec.md

24 KiB

Video Accessibility Processing Platform - Software Specification

1. Executive Summary

The Video Accessibility Processing Platform is a comprehensive web application designed to automatically generate closed captions and audio descriptions for video content using artificial intelligence. The platform provides a complete workflow from video upload through AI processing, human quality control, multi-language translation, and final content delivery.

Core Capabilities:

  • Automated generation of closed captions and audio descriptions using Google Gemini 2.5 Pro
  • Multi-language translation and transcreation services
  • Professional quality control workflow for reviewers
  • Text-to-speech generation for audio descriptions
  • Role-based access control for clients, reviewers, and administrators
  • Real-time job status updates via WebSocket connections
  • Secure file storage and signed URL download system

Target Users:

  • Clients: Organizations needing video accessibility services
  • Reviewers: Professional accessibility specialists who review and approve content
  • Administrators: System administrators managing users and system operations

2. System Architecture

2.1 Technology Stack

Frontend:

  • React 18 with TypeScript
  • Vite for build tooling
  • TanStack Query for state management
  • React Router for navigation
  • Tailwind CSS for styling

Backend:

  • FastAPI (Python 3.11+) for REST API
  • Celery with Redis for background task processing
  • MongoDB Atlas for data storage
  • JWT authentication with HttpOnly refresh cookies

External Services:

  • Google Cloud Storage for file storage
  • Google Gemini 2.5 Pro for AI processing
  • Google Cloud Translate for language translation
  • ElevenLabs for text-to-speech synthesis

Infrastructure:

  • Docker containerization
  • Redis for caching and task queues
  • WebSocket support for real-time updates

2.2 System Components

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   React SPA     │    │   FastAPI       │    │   Celery        │
│   Frontend      │◄──►│   Backend       │◄──►│   Workers       │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                │                       │
                                ▼                       ▼
                       ┌─────────────────┐    ┌─────────────────┐
                       │   MongoDB       │    │   Redis         │
                       │   Database      │    │   Queue/Cache   │
                       └─────────────────┘    └─────────────────┘
                                │
                                ▼
                       ┌─────────────────┐
                       │ Google Cloud    │
                       │ Storage         │
                       └─────────────────┘

3. User Roles and Access Control

3.1 Role Definitions

Client Role:

  • Upload videos and create processing jobs
  • View own job status and progress
  • Download completed accessibility assets
  • Limited to own content only

Reviewer Role:

  • Access quality control dashboard
  • Review AI-generated content for accuracy
  • Edit VTT files (captions and audio descriptions)
  • Approve or reject English content
  • Perform final review of completed jobs
  • Access to all jobs in system

Admin Role:

  • Full system access including all reviewer capabilities
  • User management (create, edit, deactivate users)
  • System monitoring and health checks
  • Bulk operations and maintenance tasks
  • Access to audit logs and system statistics

3.2 Authentication System

JWT Token Management:

  • Access tokens stored in memory (15-minute expiry)
  • Refresh tokens stored in HttpOnly cookies (7-day expiry)
  • Automatic token refresh for active sessions
  • Secure logout with cookie clearing

Security Features:

  • Password hashing using bcrypt
  • CORS protection with configurable origins
  • Rate limiting on authentication endpoints
  • Session-based security with proper token rotation

4. Job Processing Workflow

4.1 Job Status State Machine

The system implements a comprehensive state machine for tracking job progress:

created → ingesting → ai_processing → pending_qc → approved_english → translating → tts_generating → pending_final_review → completed
                                         ↓
                                      rejected → (manual intervention required)
                                         ↓
                                   qc_feedback → (back to pending_qc after fixes)

Status Definitions:

  • created: Job record created, video uploaded to storage
  • ingesting: Video being processed for metadata extraction
  • ai_processing: AI analyzing video content and generating captions/audio descriptions
  • pending_qc: Awaiting human quality control review
  • approved_english: English content approved, ready for translation
  • rejected: Content rejected, requires client revision
  • qc_feedback: Reviewer provided feedback, awaiting fixes
  • translating: Processing multi-language translations
  • tts_generating: Generating audio files from text descriptions
  • pending_final_review: All content ready, awaiting final approval
  • completed: Job finished, all assets available for download

4.2 Processing Pipeline

Phase 1: Upload and Ingestion

  1. Client uploads MP4 video file through web interface
  2. File stored in Google Cloud Storage with unique job ID path
  3. Job record created in MongoDB with metadata
  4. Background Celery task queued for processing

Phase 2: AI Content Generation

  1. Video file sent to Google Gemini 2.5 Pro API
  2. AI generates:
    • Plain text transcript
    • Closed captions in WebVTT format
    • Audio description script in WebVTT format
    • Confidence score for generated content
  3. Generated content stored in GCS and linked to job
  4. Job status updated to pending_qc

Phase 3: Quality Control Review

  1. Reviewer accesses job through QC dashboard
  2. Side-by-side video player with generated captions/audio descriptions
  3. Inline VTT editor for making corrections
  4. Timing adjustment tools for synchronization
  5. Approve or reject with reviewer notes
  6. If approved, job moves to translation phase

Phase 4: Translation and Localization

  1. Automatic translation of approved English content
  2. Support for standard translation and cultural transcreation
  3. Available target languages: Spanish, French, German (expandable)
  4. Translated VTT files stored per language

Phase 5: Audio Generation

  1. Text-to-speech synthesis using ElevenLabs API
  2. MP3 files generated for each audio description track
  3. Language-specific voice selection
  4. Audio files stored alongside VTT content

Phase 6: Final Review and Delivery

  1. Final review by authorized reviewer
  2. Asset validation to ensure all requested outputs present
  3. Client notification of job completion
  4. Signed URL generation for secure downloads

5. User Interface and Experience

5.1 Client Workflow

Dashboard:

  • Overview of all jobs with status indicators
  • Quick actions for creating new jobs
  • Real-time status updates via WebSocket
  • Notification system for job completion

Job Creation Process:

  1. Video Upload: Drag-and-drop interface with progress tracking
  2. Job Configuration:
    • Descriptive title
    • Source language selection
    • Output format selection (captions VTT, audio description VTT, audio MP3)
    • Target languages for translation
  3. Processing Initiation: Automatic background processing begins
  4. Confirmation: Success page with job tracking link

Job Monitoring:

  • Detailed status view with progress indicators
  • Processing history timeline
  • Real-time updates without page refresh
  • Error notifications with context

Content Download:

  • Secure download links for completed assets
  • Organized by language (en/, es/, fr/, de/)
  • File format options (VTT, MP3)
  • Source video access

5.2 Reviewer Workflow

Quality Control Dashboard:

  • Queue view of jobs pending review
  • Priority sorting by creation date
  • Job metadata preview
  • Quick status filtering

Review Interface:

  • Video Player: HTML5 player with custom controls
  • VTT Editor: Syntax-highlighted editor with validation
  • Side-by-Side View: Simultaneous video and text editing
  • Timing Tools: Bulk timing adjustment with offset controls
  • Review Controls: Approve/reject with mandatory notes

Advanced Features:

  • Keyboard shortcuts for efficient workflow (A=Approve, R=Reject, S=Save)
  • View mode switching (side-by-side, video-only, editor-only)
  • Real-time VTT validation and error highlighting
  • Unsaved changes warnings

Final Review Process:

  • Asset validation before completion
  • Final quality checks
  • Client notification triggering
  • Completion workflow

5.3 Administrator Interface

User Management:

  • Create users with role assignment
  • Password reset functionality
  • User activation/deactivation
  • Role-based permission enforcement

System Monitoring:

  • Health check dashboard with component status
  • Job processing statistics and metrics
  • Queue monitoring for background tasks
  • Performance analytics

Audit and Security:

  • Comprehensive audit logging
  • Security event monitoring
  • User activity tracking
  • System maintenance tools

6. Data Models and Storage

6.1 Job Data Structure

interface Job {
  id: string;                    // Unique job identifier
  client_id: string;            // Owner client ID
  title: string;                // Human-readable job name
  status: JobStatus;            // Current processing status
  
  source: {
    filename: string;           // Storage path
    original_filename: string;  // User's original filename
    gcs_uri: string;           // Google Cloud Storage URI
    duration_s: number;        // Video duration in seconds
    language: string;          // Source language code
  };
  
  requested_outputs: {
    captions_vtt: boolean;          // Closed captions requested
    audio_description_vtt: boolean; // Audio description script requested
    audio_description_mp3: boolean; // Audio voiceover requested
    languages: string[];            // Target languages
    transcreation: string[];        // Languages requiring cultural adaptation
  };
  
  outputs: {
    [language: string]: {
      captions_vtt_gcs?: string;      // VTT file location
      ad_vtt_gcs?: string;           // Audio description VTT location
      ad_mp3_gcs?: string;           // Audio MP3 file location
      origin: "translate" | "transcreate"; // Processing method
      qa_notes?: string;             // Quality assurance notes
    };
  };
  
  ai: {
    ingestion_json: object;     // Full AI response data
    confidence: number;         // AI confidence score (0-1)
  };
  
  review: {
    notes: string;              // Current reviewer notes
    reviewer_id?: string;       // Last reviewer ID
    history: ReviewHistoryItem[]; // Complete review history
  };
  
  created_at: Date;
  updated_at: Date;
  error?: ErrorInfo;            // Processing error details
}

6.2 User Data Structure

interface User {
  id: string;
  email: string;               // Unique login identifier
  hashed_password: string;     // Bcrypt hashed password
  full_name: string;           // Display name
  role: "client" | "reviewer" | "admin";
  is_active: boolean;          // Account status
  created_at: Date;
  updated_at: Date;
}

6.3 File Storage Organization

Google Cloud Storage Bucket Structure:

gs://accessible-video/
├── {jobId}/
│   ├── source.mp4                    # Original video
│   ├── en/
│   │   ├── captions.vtt             # English captions
│   │   ├── ad.vtt                   # English audio description
│   │   └── ad.mp3                   # English audio file
│   ├── es/
│   │   ├── captions.vtt             # Spanish captions
│   │   ├── ad.vtt                   # Spanish audio description
│   │   └── ad.mp3                   # Spanish audio file
│   └── [other languages]/
└── health_check_dummy               # System health verification

Security Features:

  • Signed URLs with 24-hour expiration
  • Role-based access control
  • Automatic cleanup on job deletion
  • Secure upload with content-type validation

7. API Design

7.1 Authentication Endpoints

POST /api/v1/auth/login
POST /api/v1/auth/refresh
POST /api/v1/auth/logout

7.2 Job Management Endpoints

POST   /api/v1/jobs              # Create new job
GET    /api/v1/jobs              # List jobs (filtered by role)
GET    /api/v1/jobs/{id}         # Get job details
DELETE /api/v1/jobs/{id}         # Delete job
DELETE /api/v1/jobs/bulk         # Bulk delete (admin only)

# Job Actions
POST   /api/v1/jobs/{id}/actions/approve_english
POST   /api/v1/jobs/{id}/actions/reject
POST   /api/v1/jobs/{id}/actions/complete
POST   /api/v1/jobs/{id}/actions/reject_final

# Content Management
GET    /api/v1/jobs/{id}/vtt                    # Get VTT content
PATCH  /api/v1/jobs/{id}/vtt                    # Update VTT content
POST   /api/v1/jobs/{id}/vtt/adjust-timing      # Adjust timing
GET    /api/v1/jobs/{id}/downloads              # Get download URLs
GET    /api/v1/jobs/{id}/validate               # Validate assets

7.3 Administrative Endpoints

# User Management
GET    /api/v1/admin/users
POST   /api/v1/admin/users
GET    /api/v1/admin/users/{id}
PATCH  /api/v1/admin/users/{id}
DELETE /api/v1/admin/users/{id}

# System Monitoring
GET    /api/v1/admin/stats
GET    /api/v1/admin/health/detailed
GET    /api/v1/admin/jobs/stats
GET    /api/v1/admin/audit-logs

7.4 File Management

GET    /api/v1/files/signed-url/{path}    # Generate signed download URL
POST   /api/v1/files/upload               # Direct file upload endpoint

7.5 Real-time Updates

WebSocket Endpoints:

  • /ws/jobs - General job status updates
  • /ws/jobs/{job_id} - Job-specific status updates

WebSocket Message Format:

{
  "job_id": "string",
  "status": "string",
  "updated_at": "ISO8601",
  "job_title": "string",
  "message": "string",
  "progress": "number"
}

8. AI Services Integration

8.1 Google Gemini 2.5 Pro Integration

Content Generation Capabilities:

  • Video content analysis and understanding
  • Automatic transcript generation
  • Closed caption creation with proper timing
  • Audio description generation for visual elements
  • Content confidence scoring

Processing Flow:

  1. Video upload to Gemini Files API
  2. Content generation using multimodal prompt
  3. Structured JSON response parsing
  4. Error handling and self-healing for invalid responses
  5. Automatic file cleanup after processing

Quality Assurance:

  • VTT format validation
  • Timestamp accuracy verification
  • Content completeness checks
  • Fallback content generation for missing elements

8.2 Translation Services

Google Cloud Translate:

  • High-quality machine translation for standard content
  • Support for multiple target languages
  • VTT format preservation during translation
  • Batch processing for efficiency

Transcreation via Gemini:

  • Cultural adaptation for marketing content
  • Context-aware translation with brand guidelines
  • Maintained timing synchronization
  • Creative adaptation while preserving meaning

8.3 Text-to-Speech Integration

ElevenLabs TTS Service:

  • High-quality voice synthesis
  • Language-specific voice selection
  • MP3 output format
  • Proper pronunciation for accessibility terms

Audio Processing:

  • Per-cue synthesis for precise timing
  • Audio quality optimization
  • File format standardization
  • Integration with VTT timing

9. Quality Control Features

9.1 Review Workflow

Content Review Process:

  1. Initial Review: AI-generated content assessment
  2. Content Editing: Direct VTT file modification
  3. Synchronization Check: Video timing validation
  4. Quality Verification: Accessibility standards compliance
  5. Final Approval: Content ready for translation

Review Tools:

  • Integrated video player with caption overlay
  • Syntax-highlighted VTT editor
  • Real-time content validation
  • Timing adjustment utilities
  • Review history tracking

9.2 Quality Metrics

AI Confidence Scoring:

  • Content generation confidence (0-100%)
  • Quality indicators for reviewer guidance
  • Threshold-based workflow routing

Review Analytics:

  • Processing time tracking
  • Reviewer performance metrics
  • Quality score trending
  • Error rate monitoring

10. Security and Compliance

10.1 Data Security

Authentication Security:

  • JWT token-based authentication
  • HttpOnly cookie refresh tokens
  • Automatic token rotation
  • Secure password hashing (bcrypt)

File Security:

  • Signed URL access control
  • Time-limited download permissions
  • Secure file upload validation
  • Automatic cleanup procedures

API Security:

  • CORS protection
  • Rate limiting
  • Input validation and sanitization
  • SQL injection prevention (NoSQL)

10.2 Privacy Protection

Data Handling:

  • Client data isolation
  • Role-based access enforcement
  • Audit trail maintenance
  • Secure data deletion

Content Protection:

  • Temporary file processing
  • Secure cloud storage
  • Access logging
  • Data retention policies

10.3 Audit and Compliance

Audit Logging:

  • User action tracking
  • System event logging
  • Security event monitoring
  • Performance metric collection

Compliance Features:

  • Data export capabilities
  • User consent management
  • Access control documentation
  • Security incident tracking

11. Performance and Scalability

11.1 System Performance

Backend Performance:

  • Async request handling with FastAPI
  • Background task processing via Celery
  • Database query optimization
  • Caching strategy with Redis

Frontend Performance:

  • React Query for data caching
  • Lazy loading of components
  • Optimized bundle splitting
  • Progressive web app features

11.2 Scalability Architecture

Horizontal Scaling:

  • Stateless API servers
  • Independent worker processes
  • Load balancing ready
  • Database connection pooling

Resource Optimization:

  • File compression and optimization
  • CDN integration ready
  • Memory-efficient processing
  • Garbage collection optimization

11.3 Monitoring and Observability

Health Monitoring:

  • Component health checks
  • Service dependency monitoring
  • Performance metric collection
  • Error rate tracking

Logging and Debugging:

  • Structured logging with correlation IDs
  • Error tracking and alerting
  • Performance profiling
  • Debug mode capabilities

12. Deployment and Infrastructure

12.1 Containerization

Docker Configuration:

  • Multi-stage builds for optimization
  • Health check integration
  • Environment-based configuration
  • Security-hardened images

12.2 Environment Configuration

Development Environment:

  • Local Docker Compose setup
  • Hot-reload development servers
  • Test database seeding
  • Mock external services

Production Environment:

  • Cloud-native deployment
  • SSL/TLS termination
  • Environment variable management
  • Secret management integration

12.3 Database Management

MongoDB Configuration:

  • Document schema validation
  • Index optimization
  • Replica set support
  • Backup and recovery procedures

Migration System:

  • Schema version tracking
  • Safe migration procedures
  • Rollback capabilities
  • Data integrity validation

13. Testing Strategy

13.1 Testing Levels

Unit Testing:

  • Service layer testing
  • Utility function testing
  • Component testing
  • Mock external dependencies

Integration Testing:

  • API endpoint testing
  • Database integration testing
  • File storage integration
  • Authentication flow testing

End-to-End Testing:

  • Complete user workflow testing
  • Cross-browser compatibility
  • Mobile responsiveness
  • Performance testing

13.2 Testing Tools

Backend Testing:

  • PyTest for unit and integration tests
  • Factory Boy for test data generation
  • Async test support
  • Mock external services

Frontend Testing:

  • Jest for unit testing
  • React Testing Library
  • Playwright for E2E testing
  • Visual regression testing

14. Error Handling and Recovery

14.1 Error Classification

User Errors:

  • Invalid file formats
  • Insufficient permissions
  • Validation failures
  • Authentication errors

System Errors:

  • External service failures
  • Database connection issues
  • File storage problems
  • Processing timeouts

Recovery Strategies:

  • Automatic retry mechanisms
  • Graceful degradation
  • User-friendly error messages
  • Administrative error resolution

14.2 Reliability Features

Fault Tolerance:

  • Circuit breaker patterns
  • Timeout configurations
  • Retry logic with exponential backoff
  • Fallback procedures

Data Integrity:

  • Transaction management
  • Consistent state handling
  • Backup and recovery
  • Data validation

15. Configuration and Customization

15.1 System Configuration

Application Settings:

  • Environment-specific configurations
  • Feature flag support
  • Service endpoint configuration
  • Security parameter tuning

Processing Configuration:

  • AI model parameters
  • Translation service options
  • File size limits
  • Processing timeouts

15.2 User Customization

Client Settings:

  • Language preferences
  • Notification preferences
  • Default job settings
  • Download preferences

Reviewer Settings:

  • Workflow preferences
  • Editor configurations
  • Keyboard shortcuts
  • Quality thresholds

16. Future Enhancements

16.1 Planned Features

Enhanced AI Capabilities:

  • Multi-modal content analysis
  • Improved accuracy metrics
  • Custom model training
  • Advanced quality scoring

Extended Language Support:

  • Additional target languages
  • Regional dialect support
  • Custom transcreation workflows
  • Cultural adaptation tools

Advanced Workflow Features:

  • Batch processing capabilities
  • Template-based job creation
  • Advanced approval workflows
  • Custom review stages

16.2 Integration Opportunities

Third-Party Integrations:

  • Content management systems
  • Video hosting platforms
  • Accessibility testing tools
  • Quality assurance services

API Extensions:

  • Webhook support for job events
  • Advanced reporting APIs
  • Bulk operation endpoints
  • Custom integration points

17. Conclusion

The Video Accessibility Processing Platform represents a comprehensive solution for automated video accessibility content generation. Built with modern web technologies and integrated with leading AI services, the platform provides an end-to-end workflow from video upload to final content delivery.

The system's architecture supports scalability, security, and reliability while maintaining a focus on user experience and content quality. The role-based access control ensures appropriate separation of concerns between content creators, quality reviewers, and system administrators.

With its robust API design, real-time updates, and comprehensive error handling, the platform serves as a professional-grade solution for organizations requiring high-quality accessibility content at scale.


This specification document serves as the comprehensive technical and functional guide for the Video Accessibility Processing Platform, detailing all implemented features, workflows, and system capabilities as of the current release.