Aimpress/PDF-accessibility-saas

Fork 0

Vadym Samoilenko cfa7eeeeac Initial commit: PDF Accessibility SaaS (forked from Oliver/pdf-accessibility)

2026-05-19 14:34:12 +01:00

17 KiB

Raw Permalink Blame History

Enterprise PDF Accessibility Checker - System Architecture

🏗️ System Overview

This document describes the technical architecture of the Enterprise PDF Accessibility Checker.

Component Architecture

┌─────────────────────────────────────────────────────────────┐
│                        USER LAYER                            │
├─────────────────────────────────────────────────────────────┤
│  • Web Browser (Drag & Drop Interface)                      │
│  • Command Line Interface                                   │
│  • REST API Clients                                         │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│                     WEB SERVER LAYER                         │
├─────────────────────────────────────────────────────────────┤
│  PHP Backend (api.php)                                      │
│  • Upload Management                                        │
│  • Job Queue                                                │
│  • Result Storage                                           │
│  • Authentication (optional)                                │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│                  PROCESSING ENGINE                           │
├─────────────────────────────────────────────────────────────┤
│  Python Script (enterprise_pdf_checker.py)                  │
│                                                              │
│  ┌────────────────────────────────────────────────┐        │
│  │         Core Checking Engine                   │        │
│  │  • PDF parsing (pypdf, pdfplumber)            │        │
│  │  • Structure analysis                          │        │
│  │  • Text extraction                             │        │
│  │  • Issue detection                             │        │
│  └────────────────────────────────────────────────┘        │
│                                                              │
│  ┌────────────────────────────────────────────────┐        │
│  │         Analysis Modules                       │        │
│  │  • Color Contrast Checker                     │        │
│  │  • Readability Analyzer                       │        │
│  │  • OCR Quality Checker                        │        │
│  │  • Link Validator                             │        │
│  │  • Form Field Analyzer                        │        │
│  └────────────────────────────────────────────────┘        │
│                                                              │
│  ┌────────────────────────────────────────────────┐        │
│  │         Cache Manager                          │        │
│  │  • API response caching                       │        │
│  │  • Cost optimization                          │        │
│  └────────────────────────────────────────────────┘        │
└────────────┬───────────────────────┬───────────────────────┘
             │                       │
             ▼                       ▼
┌──────────────────────┐   ┌──────────────────────────────────┐
│  EXTERNAL SERVICES   │   │    LOCAL PROCESSING               │
├──────────────────────┤   ├──────────────────────────────────┤
│  Anthropic Claude    │   │  • Tesseract OCR                  │
│  • Image analysis    │   │  • PIL/Pillow (image processing) │
│  • Alt text validate │   │  • TextBlob (NLP)                │
│  • Content quality   │   │  • NumPy (calculations)          │
│                      │   │  • pdf2image (rendering)         │
│  Google Cloud        │   └──────────────────────────────────┘
│  • Vision API        │
│  • Document AI       │
│  • OCR + analysis    │
└──────────────────────┘

Data Flow

1. Web Interface Flow

User uploads PDF
      ↓
index.html (JavaScript)
      ↓
POST /api.php?action=upload
      ↓
api.php saves to /uploads/
      ↓
Returns job_id
      ↓
POST /api.php?action=check (with job_id)
      ↓
api.php spawns Python process
      ↓
enterprise_pdf_checker.py processes PDF
      ↓
Calls Anthropic & Google APIs
      ↓
Writes results to /results/
      ↓
JavaScript polls /api.php?action=status
      ↓
GET /api.php?action=result
      ↓
Display results in browser

2. Command Line Flow

User runs: python3 enterprise_pdf_checker.py doc.pdf
      ↓
Script loads PDF with pypdf/pdfplumber
      ↓
Runs all checking modules sequentially
      ↓
For each image:
  • Extract image bytes
  • Check cache
  • If not cached:
      - Call Claude Vision API
      - Call Google Vision API
      - Cache results
  • Process analysis
      ↓
For each page:
  • Extract text
  • Check readability
  • Analyze color contrast
  • Validate structure
      ↓
Aggregate all issues
      ↓
Calculate accessibility score
      ↓
Generate JSON report
      ↓
Output to file or stdout

Module Details

1. EnterprisePDFChecker (Main Class)

Responsibilities:

Orchestrate all checks
Manage API clients
Track statistics
Generate reports

Key Methods:

check_all() - Run all accessibility checks
_check_basic_structure() - Verify PDF tagging
_check_images_comprehensive() - AI-powered image analysis
_check_color_contrast() - WCAG contrast validation
_check_readability() - Content quality analysis
generate_json_report() - Create output

2. ColorContrastChecker

Responsibilities:

Calculate luminance values
Compute contrast ratios
Validate WCAG compliance

Algorithm:

1. Convert PDF page to image
2. Sample N random pixel pairs
3. For each pair:
   • Calculate relative luminance (WCAG formula)
   • Compute contrast ratio: (L1 + 0.05) / (L2 + 0.05)
   • Compare to WCAG thresholds:
     - AA Normal: 4.5:1
     - AA Large: 3.0:1
     - AAA Normal: 7.0:1
4. Report percentage failing standards

3. ReadabilityAnalyzer

Responsibilities:

Calculate reading difficulty
Identify complex content
Provide grade-level estimates

Metrics:

Flesch Reading Ease (0-100, higher = easier)
Flesch-Kincaid Grade Level (US school grade)
Average sentence length
Complex word percentage

4. CacheManager

Responsibilities:

Store API responses
Reduce duplicate calls
Control costs

Strategy:

# Cache key = SHA256(image_bytes) + prefix
# Cache hit: Return stored result (free)
# Cache miss: Call API → Cache → Return

Savings:

Repeat document check: ~$0.10 → $0.00
Similar images across documents: Cached automatically

API Integration

Anthropic Claude 3.5 Sonnet

Endpoint: https://api.anthropic.com/v1/messages

Request:

{
  "model": "claude-3-5-sonnet-20241022",
  "max_tokens": 1024,
  "messages": [{
    "role": "user",
    "content": [
      {"type": "image", "source": {...}},
      {"type": "text", "text": "Analyze for accessibility..."}
    ]
  }]
}

Response Parsing:

# Claude returns JSON with:
{
  "alt_text": "...",
  "has_text": true/false,
  "type": "decorative|informational|complex",
  "concerns": [...],
  "quality_rating": 1-10
}

Used For:

Alt text quality validation
Image content description
Text-in-image detection
Color-only information checks
Content quality analysis

Google Cloud Vision API

Endpoint: https://vision.googleapis.com/v1/images:annotate

Features Used:

TEXT_DETECTION - OCR for text in images
LABEL_DETECTION - Image content classification
IMAGE_PROPERTIES - Dominant colors
OBJECT_LOCALIZATION - Object identification

Used For:

Detecting text in images (WCAG 1.4.5)
Cross-validating Claude's analysis
OCR quality assessment
Object recognition

Google Document AI (Optional)

Endpoint: https://documentai.googleapis.com/v1/projects/*/locations/*/processors/*:process

Used For:

High-quality OCR on scanned PDFs
Complex document layout analysis
Better than Tesseract for production use

Database Schema

File Storage Structure

project/
├── uploads/
│   └── pdf_{job_id}.pdf           # Uploaded files
├── results/
│   ├── {job_id}.meta.json         # Job metadata
│   └── {job_id}.result.json       # Check results
└── .cache/
    └── {hash}.json                # Cached API responses

Job Metadata (*.meta.json)

{
  "job_id": "pdf_67890abcdef",
  "original_filename": "document.pdf",
  "uploaded_at": "2025-01-20 10:00:00",
  "file_size": 2048576,
  "status": "completed",
  "filepath": "/uploads/pdf_67890abcdef.pdf",
  "started_at": "2025-01-20 10:00:05",
  "completed_at": "2025-01-20 10:03:20"
}

Check Results (*.result.json)

{
  "filename": "document.pdf",
  "total_pages": 10,
  "accessibility_score": 75,
  "severity_counts": {
    "critical": 0,
    "error": 3,
    "warning": 5,
    "info": 2,
    "success": 8
  },
  "stats": {
    "total_checks": 16,
    "api_calls": 5,
    "cached_calls": 3,
    "total_cost_estimate": 0.08,
    "duration": 125.5
  },
  "issues": [...]
}

Security Considerations

1. Input Validation

File type whitelist (PDF only)
File size limit (50MB default)
Malware scanning (recommended)

2. API Key Protection

Stored in environment variables
Never in version control
Rotated regularly

3. File Access Control

# .htaccess
<FilesMatch "\.(json|meta)$">
    Require all denied
</FilesMatch>

4. Rate Limiting

Implement per-IP limits
Prevent API abuse
Monitor costs

5. HTTPS

Required for production
Protects API keys in transit
Secures file uploads

Performance Optimization

1. Caching Strategy

# Multi-level caching
L1: In-memory (Python dict)
L2: Disk (.cache/ directory)
L3: API response (if cache miss)

2. Parallel Processing

# Process multiple PDFs concurrently
from multiprocessing import Pool

with Pool(4) as pool:
    pool.map(check_pdf, pdf_files)

3. Image Optimization

# Reduce API costs
- Resize images to max 2048px
- Use JPEG compression (quality=85)
- Cache results by hash

4. Lazy Loading

# Don't load entire PDF into memory
# Process page-by-page using generators
for page in pdf_plumber.pages:
    process_page(page)

Scalability

Horizontal Scaling

Load Balancer
      │
      ├─→ Web Server 1 (api.php)
      │        ↓
      │   Processing Queue
      │
      ├─→ Web Server 2 (api.php)
      │        ↓
      │   Processing Queue
      │
      └─→ Web Server N (api.php)
               ↓
          Processing Queue
               ↓
      ┌───────┴───────┐
      ▼               ▼
  Worker 1        Worker N
  (Python)        (Python)

Queue-Based Architecture

# Use Redis or RabbitMQ
1. api.php → Push job to queue
2. Worker processes → Pull from queue
3. Process PDF
4. Store results
5. Notify completion (webhook/polling)

Cloud Deployment

AWS:

EC2 for web servers
S3 for file storage
SQS for job queue
Lambda for workers

Google Cloud:

Compute Engine for servers
Cloud Storage for files
Cloud Tasks for queue
Cloud Functions for workers

Monitoring & Logging

Key Metrics

Processing Time: Average duration per check
API Costs: Daily/monthly spend
Cache Hit Rate: Percentage of cached results
Error Rate: Failed checks per day
Queue Length: Pending jobs

Logging Strategy

import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('checker.log'),
        logging.StreamHandler()
    ]
)

# Log important events
logger.info(f"Processing: {filename}")
logger.warning(f"Low contrast detected: page {page_num}")
logger.error(f"API error: {error}")

Testing Strategy

Unit Tests

import unittest

class TestColorContrast(unittest.TestCase):
    def test_contrast_calculation(self):
        ratio = ColorContrastChecker.calculate_contrast_ratio(
            (255, 255, 255),  # White
            (0, 0, 0)         # Black
        )
        self.assertAlmostEqual(ratio, 21.0, places=1)

Integration Tests

# Test full pipeline
python3 enterprise_pdf_checker.py test_pdfs/sample.pdf
# Verify: results match expectations

API Tests

# Test Claude integration
def test_claude_api():
    result = analyze_image_with_claude(test_image_bytes)
    assert 'alt_text' in result
    assert len(result['alt_text']) < 125

Deployment Checklist

Install all dependencies
Configure API keys
Set up web server (Apache/Nginx)
Configure HTTPS
Set file permissions
Enable error logging
Test with sample PDFs
Configure backups
Set up monitoring
Document runbook

Future Enhancements

Planned Features

User Authentication - Multi-user support
Report History - Track changes over time
Batch Upload - Multiple PDFs at once
PDF Remediation - Auto-fix some issues
Custom Rules - Organization-specific checks
Webhooks - Completion notifications
PDF Comparison - Before/after analysis
API Rate Limiting - Per-user quotas
Advanced Caching - Redis integration
Machine Learning - Pattern detection

Technical Requirements Summary

Component	Version	Purpose
Python	3.8+	Core processing
PHP	7.4+	Web API
Tesseract	4.0+	OCR
Poppler	0.86+	PDF rendering
pypdf	4.0+	PDF parsing
Anthropic SDK	0.18+	Claude API
Google Cloud	3.4+	Vision API

Support & Maintenance

Regular Maintenance

Daily: Check logs for errors
Weekly: Review API costs
Monthly: Update dependencies
Quarterly: Security audit

Backup Strategy

Files: uploads/, results/ → Daily
Cache: .cache/ → Weekly
Code: Git repository → Continuous

Conclusion

This architecture provides:

✅ High Quality: Best-in-class AI models
✅ Scalability: Horizontal scaling support
✅ Reliability: Caching + error handling
✅ Maintainability: Modular design
✅ Cost-Effective: Smart caching reduces API costs
✅ Secure: Multiple security layers
✅ Extensible: Easy to add new checks

The system is production-ready and can handle enterprise workloads while maintaining quality-first approach to accessibility validation.

17 KiB Raw Permalink Blame History

Enterprise PDF Accessibility Checker - System Architecture

🏗️ System Overview

Component Architecture

Data Flow

1. Web Interface Flow

2. Command Line Flow

Module Details

1. EnterprisePDFChecker (Main Class)

2. ColorContrastChecker

3. ReadabilityAnalyzer

4. CacheManager

API Integration

Anthropic Claude 3.5 Sonnet

Google Cloud Vision API

Google Document AI (Optional)

Database Schema

File Storage Structure

Job Metadata (*.meta.json)

Check Results (*.result.json)

Security Considerations

1. Input Validation

2. API Key Protection

3. File Access Control

4. Rate Limiting

5. HTTPS

Performance Optimization

1. Caching Strategy

2. Parallel Processing

3. Image Optimization

4. Lazy Loading

Scalability

Horizontal Scaling

Queue-Based Architecture

Cloud Deployment

Monitoring & Logging

Key Metrics

Logging Strategy

Testing Strategy

Unit Tests

Integration Tests

API Tests

Deployment Checklist

Future Enhancements

Planned Features

Technical Requirements Summary

Support & Maintenance

Regular Maintenance

Backup Strategy

Conclusion

17 KiB

Raw Permalink Blame History