PDF-accessibility-saas/README's/ARCHITECTURE.md

17 KiB

Enterprise PDF Accessibility Checker - System Architecture

🏗️ System Overview

This document describes the technical architecture of the Enterprise PDF Accessibility Checker.


Component Architecture

┌─────────────────────────────────────────────────────────────┐
│                        USER LAYER                            │
├─────────────────────────────────────────────────────────────┤
│  • Web Browser (Drag & Drop Interface)                      │
│  • Command Line Interface                                   │
│  • REST API Clients                                         │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│                     WEB SERVER LAYER                         │
├─────────────────────────────────────────────────────────────┤
│  PHP Backend (api.php)                                      │
│  • Upload Management                                        │
│  • Job Queue                                                │
│  • Result Storage                                           │
│  • Authentication (optional)                                │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│                  PROCESSING ENGINE                           │
├─────────────────────────────────────────────────────────────┤
│  Python Script (enterprise_pdf_checker.py)                  │
│                                                              │
│  ┌────────────────────────────────────────────────┐        │
│  │         Core Checking Engine                   │        │
│  │  • PDF parsing (pypdf, pdfplumber)            │        │
│  │  • Structure analysis                          │        │
│  │  • Text extraction                             │        │
│  │  • Issue detection                             │        │
│  └────────────────────────────────────────────────┘        │
│                                                              │
│  ┌────────────────────────────────────────────────┐        │
│  │         Analysis Modules                       │        │
│  │  • Color Contrast Checker                     │        │
│  │  • Readability Analyzer                       │        │
│  │  • OCR Quality Checker                        │        │
│  │  • Link Validator                             │        │
│  │  • Form Field Analyzer                        │        │
│  └────────────────────────────────────────────────┘        │
│                                                              │
│  ┌────────────────────────────────────────────────┐        │
│  │         Cache Manager                          │        │
│  │  • API response caching                       │        │
│  │  • Cost optimization                          │        │
│  └────────────────────────────────────────────────┘        │
└────────────┬───────────────────────┬───────────────────────┘
             │                       │
             ▼                       ▼
┌──────────────────────┐   ┌──────────────────────────────────┐
│  EXTERNAL SERVICES   │   │    LOCAL PROCESSING               │
├──────────────────────┤   ├──────────────────────────────────┤
│  Anthropic Claude    │   │  • Tesseract OCR                  │
│  • Image analysis    │   │  • PIL/Pillow (image processing) │
│  • Alt text validate │   │  • TextBlob (NLP)                │
│  • Content quality   │   │  • NumPy (calculations)          │
│                      │   │  • pdf2image (rendering)         │
│  Google Cloud        │   └──────────────────────────────────┘
│  • Vision API        │
│  • Document AI       │
│  • OCR + analysis    │
└──────────────────────┘

Data Flow

1. Web Interface Flow

User uploads PDF
      ↓
index.html (JavaScript)
      ↓
POST /api.php?action=upload
      ↓
api.php saves to /uploads/
      ↓
Returns job_id
      ↓
POST /api.php?action=check (with job_id)
      ↓
api.php spawns Python process
      ↓
enterprise_pdf_checker.py processes PDF
      ↓
Calls Anthropic & Google APIs
      ↓
Writes results to /results/
      ↓
JavaScript polls /api.php?action=status
      ↓
GET /api.php?action=result
      ↓
Display results in browser

2. Command Line Flow

User runs: python3 enterprise_pdf_checker.py doc.pdf
      ↓
Script loads PDF with pypdf/pdfplumber
      ↓
Runs all checking modules sequentially
      ↓
For each image:
  • Extract image bytes
  • Check cache
  • If not cached:
      - Call Claude Vision API
      - Call Google Vision API
      - Cache results
  • Process analysis
      ↓
For each page:
  • Extract text
  • Check readability
  • Analyze color contrast
  • Validate structure
      ↓
Aggregate all issues
      ↓
Calculate accessibility score
      ↓
Generate JSON report
      ↓
Output to file or stdout

Module Details

1. EnterprisePDFChecker (Main Class)

Responsibilities:

  • Orchestrate all checks
  • Manage API clients
  • Track statistics
  • Generate reports

Key Methods:

  • check_all() - Run all accessibility checks
  • _check_basic_structure() - Verify PDF tagging
  • _check_images_comprehensive() - AI-powered image analysis
  • _check_color_contrast() - WCAG contrast validation
  • _check_readability() - Content quality analysis
  • generate_json_report() - Create output

2. ColorContrastChecker

Responsibilities:

  • Calculate luminance values
  • Compute contrast ratios
  • Validate WCAG compliance

Algorithm:

1. Convert PDF page to image
2. Sample N random pixel pairs
3. For each pair:
    Calculate relative luminance (WCAG formula)
    Compute contrast ratio: (L1 + 0.05) / (L2 + 0.05)
    Compare to WCAG thresholds:
     - AA Normal: 4.5:1
     - AA Large: 3.0:1
     - AAA Normal: 7.0:1
4. Report percentage failing standards

3. ReadabilityAnalyzer

Responsibilities:

  • Calculate reading difficulty
  • Identify complex content
  • Provide grade-level estimates

Metrics:

  • Flesch Reading Ease (0-100, higher = easier)
  • Flesch-Kincaid Grade Level (US school grade)
  • Average sentence length
  • Complex word percentage

4. CacheManager

Responsibilities:

  • Store API responses
  • Reduce duplicate calls
  • Control costs

Strategy:

# Cache key = SHA256(image_bytes) + prefix
# Cache hit: Return stored result (free)
# Cache miss: Call API → Cache → Return

Savings:

  • Repeat document check: ~$0.10 → $0.00
  • Similar images across documents: Cached automatically

API Integration

Anthropic Claude 3.5 Sonnet

Endpoint: https://api.anthropic.com/v1/messages

Request:

{
  "model": "claude-3-5-sonnet-20241022",
  "max_tokens": 1024,
  "messages": [{
    "role": "user",
    "content": [
      {"type": "image", "source": {...}},
      {"type": "text", "text": "Analyze for accessibility..."}
    ]
  }]
}

Response Parsing:

# Claude returns JSON with:
{
  "alt_text": "...",
  "has_text": true/false,
  "type": "decorative|informational|complex",
  "concerns": [...],
  "quality_rating": 1-10
}

Used For:

  • Alt text quality validation
  • Image content description
  • Text-in-image detection
  • Color-only information checks
  • Content quality analysis

Google Cloud Vision API

Endpoint: https://vision.googleapis.com/v1/images:annotate

Features Used:

  • TEXT_DETECTION - OCR for text in images
  • LABEL_DETECTION - Image content classification
  • IMAGE_PROPERTIES - Dominant colors
  • OBJECT_LOCALIZATION - Object identification

Used For:

  • Detecting text in images (WCAG 1.4.5)
  • Cross-validating Claude's analysis
  • OCR quality assessment
  • Object recognition

Google Document AI (Optional)

Endpoint: https://documentai.googleapis.com/v1/projects/*/locations/*/processors/*:process

Used For:

  • High-quality OCR on scanned PDFs
  • Complex document layout analysis
  • Better than Tesseract for production use

Database Schema

File Storage Structure

project/
├── uploads/
│   └── pdf_{job_id}.pdf           # Uploaded files
├── results/
│   ├── {job_id}.meta.json         # Job metadata
│   └── {job_id}.result.json       # Check results
└── .cache/
    └── {hash}.json                # Cached API responses

Job Metadata (*.meta.json)

{
  "job_id": "pdf_67890abcdef",
  "original_filename": "document.pdf",
  "uploaded_at": "2025-01-20 10:00:00",
  "file_size": 2048576,
  "status": "completed",
  "filepath": "/uploads/pdf_67890abcdef.pdf",
  "started_at": "2025-01-20 10:00:05",
  "completed_at": "2025-01-20 10:03:20"
}

Check Results (*.result.json)

{
  "filename": "document.pdf",
  "total_pages": 10,
  "accessibility_score": 75,
  "severity_counts": {
    "critical": 0,
    "error": 3,
    "warning": 5,
    "info": 2,
    "success": 8
  },
  "stats": {
    "total_checks": 16,
    "api_calls": 5,
    "cached_calls": 3,
    "total_cost_estimate": 0.08,
    "duration": 125.5
  },
  "issues": [...]
}

Security Considerations

1. Input Validation

  • File type whitelist (PDF only)
  • File size limit (50MB default)
  • Malware scanning (recommended)

2. API Key Protection

  • Stored in environment variables
  • Never in version control
  • Rotated regularly

3. File Access Control

# .htaccess
<FilesMatch "\.(json|meta)$">
    Require all denied
</FilesMatch>

4. Rate Limiting

  • Implement per-IP limits
  • Prevent API abuse
  • Monitor costs

5. HTTPS

  • Required for production
  • Protects API keys in transit
  • Secures file uploads

Performance Optimization

1. Caching Strategy

# Multi-level caching
L1: In-memory (Python dict)
L2: Disk (.cache/ directory)
L3: API response (if cache miss)

2. Parallel Processing

# Process multiple PDFs concurrently
from multiprocessing import Pool

with Pool(4) as pool:
    pool.map(check_pdf, pdf_files)

3. Image Optimization

# Reduce API costs
- Resize images to max 2048px
- Use JPEG compression (quality=85)
- Cache results by hash

4. Lazy Loading

# Don't load entire PDF into memory
# Process page-by-page using generators
for page in pdf_plumber.pages:
    process_page(page)

Scalability

Horizontal Scaling

Load Balancer
      │
      ├─→ Web Server 1 (api.php)
      │        ↓
      │   Processing Queue
      │
      ├─→ Web Server 2 (api.php)
      │        ↓
      │   Processing Queue
      │
      └─→ Web Server N (api.php)
               ↓
          Processing Queue
               ↓
      ┌───────┴───────┐
      ▼               ▼
  Worker 1        Worker N
  (Python)        (Python)

Queue-Based Architecture

# Use Redis or RabbitMQ
1. api.php  Push job to queue
2. Worker processes  Pull from queue
3. Process PDF
4. Store results
5. Notify completion (webhook/polling)

Cloud Deployment

AWS:

  • EC2 for web servers
  • S3 for file storage
  • SQS for job queue
  • Lambda for workers

Google Cloud:

  • Compute Engine for servers
  • Cloud Storage for files
  • Cloud Tasks for queue
  • Cloud Functions for workers

Monitoring & Logging

Key Metrics

  • Processing Time: Average duration per check
  • API Costs: Daily/monthly spend
  • Cache Hit Rate: Percentage of cached results
  • Error Rate: Failed checks per day
  • Queue Length: Pending jobs

Logging Strategy

import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('checker.log'),
        logging.StreamHandler()
    ]
)

# Log important events
logger.info(f"Processing: {filename}")
logger.warning(f"Low contrast detected: page {page_num}")
logger.error(f"API error: {error}")

Testing Strategy

Unit Tests

import unittest

class TestColorContrast(unittest.TestCase):
    def test_contrast_calculation(self):
        ratio = ColorContrastChecker.calculate_contrast_ratio(
            (255, 255, 255),  # White
            (0, 0, 0)         # Black
        )
        self.assertAlmostEqual(ratio, 21.0, places=1)

Integration Tests

# Test full pipeline
python3 enterprise_pdf_checker.py test_pdfs/sample.pdf
# Verify: results match expectations

API Tests

# Test Claude integration
def test_claude_api():
    result = analyze_image_with_claude(test_image_bytes)
    assert 'alt_text' in result
    assert len(result['alt_text']) < 125

Deployment Checklist

  • Install all dependencies
  • Configure API keys
  • Set up web server (Apache/Nginx)
  • Configure HTTPS
  • Set file permissions
  • Enable error logging
  • Test with sample PDFs
  • Configure backups
  • Set up monitoring
  • Document runbook

Future Enhancements

Planned Features

  1. User Authentication - Multi-user support
  2. Report History - Track changes over time
  3. Batch Upload - Multiple PDFs at once
  4. PDF Remediation - Auto-fix some issues
  5. Custom Rules - Organization-specific checks
  6. Webhooks - Completion notifications
  7. PDF Comparison - Before/after analysis
  8. API Rate Limiting - Per-user quotas
  9. Advanced Caching - Redis integration
  10. Machine Learning - Pattern detection

Technical Requirements Summary

Component Version Purpose
Python 3.8+ Core processing
PHP 7.4+ Web API
Tesseract 4.0+ OCR
Poppler 0.86+ PDF rendering
pypdf 4.0+ PDF parsing
Anthropic SDK 0.18+ Claude API
Google Cloud 3.4+ Vision API

Support & Maintenance

Regular Maintenance

  • Daily: Check logs for errors
  • Weekly: Review API costs
  • Monthly: Update dependencies
  • Quarterly: Security audit

Backup Strategy

  • Files: uploads/, results/ → Daily
  • Cache: .cache/ → Weekly
  • Code: Git repository → Continuous

Conclusion

This architecture provides:

  • High Quality: Best-in-class AI models
  • Scalability: Horizontal scaling support
  • Reliability: Caching + error handling
  • Maintainability: Modular design
  • Cost-Effective: Smart caching reduces API costs
  • Secure: Multiple security layers
  • Extensible: Easy to add new checks

The system is production-ready and can handle enterprise workloads while maintaining quality-first approach to accessibility validation.