17 KiB
17 KiB
Enterprise PDF Accessibility Checker - System Architecture
🏗️ System Overview
This document describes the technical architecture of the Enterprise PDF Accessibility Checker.
Component Architecture
┌─────────────────────────────────────────────────────────────┐
│ USER LAYER │
├─────────────────────────────────────────────────────────────┤
│ • Web Browser (Drag & Drop Interface) │
│ • Command Line Interface │
│ • REST API Clients │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ WEB SERVER LAYER │
├─────────────────────────────────────────────────────────────┤
│ PHP Backend (api.php) │
│ • Upload Management │
│ • Job Queue │
│ • Result Storage │
│ • Authentication (optional) │
└────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ PROCESSING ENGINE │
├─────────────────────────────────────────────────────────────┤
│ Python Script (enterprise_pdf_checker.py) │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ Core Checking Engine │ │
│ │ • PDF parsing (pypdf, pdfplumber) │ │
│ │ • Structure analysis │ │
│ │ • Text extraction │ │
│ │ • Issue detection │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ Analysis Modules │ │
│ │ • Color Contrast Checker │ │
│ │ • Readability Analyzer │ │
│ │ • OCR Quality Checker │ │
│ │ • Link Validator │ │
│ │ • Form Field Analyzer │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ Cache Manager │ │
│ │ • API response caching │ │
│ │ • Cost optimization │ │
│ └────────────────────────────────────────────────┘ │
└────────────┬───────────────────────┬───────────────────────┘
│ │
▼ ▼
┌──────────────────────┐ ┌──────────────────────────────────┐
│ EXTERNAL SERVICES │ │ LOCAL PROCESSING │
├──────────────────────┤ ├──────────────────────────────────┤
│ Anthropic Claude │ │ • Tesseract OCR │
│ • Image analysis │ │ • PIL/Pillow (image processing) │
│ • Alt text validate │ │ • TextBlob (NLP) │
│ • Content quality │ │ • NumPy (calculations) │
│ │ │ • pdf2image (rendering) │
│ Google Cloud │ └──────────────────────────────────┘
│ • Vision API │
│ • Document AI │
│ • OCR + analysis │
└──────────────────────┘
Data Flow
1. Web Interface Flow
User uploads PDF
↓
index.html (JavaScript)
↓
POST /api.php?action=upload
↓
api.php saves to /uploads/
↓
Returns job_id
↓
POST /api.php?action=check (with job_id)
↓
api.php spawns Python process
↓
enterprise_pdf_checker.py processes PDF
↓
Calls Anthropic & Google APIs
↓
Writes results to /results/
↓
JavaScript polls /api.php?action=status
↓
GET /api.php?action=result
↓
Display results in browser
2. Command Line Flow
User runs: python3 enterprise_pdf_checker.py doc.pdf
↓
Script loads PDF with pypdf/pdfplumber
↓
Runs all checking modules sequentially
↓
For each image:
• Extract image bytes
• Check cache
• If not cached:
- Call Claude Vision API
- Call Google Vision API
- Cache results
• Process analysis
↓
For each page:
• Extract text
• Check readability
• Analyze color contrast
• Validate structure
↓
Aggregate all issues
↓
Calculate accessibility score
↓
Generate JSON report
↓
Output to file or stdout
Module Details
1. EnterprisePDFChecker (Main Class)
Responsibilities:
- Orchestrate all checks
- Manage API clients
- Track statistics
- Generate reports
Key Methods:
check_all()- Run all accessibility checks_check_basic_structure()- Verify PDF tagging_check_images_comprehensive()- AI-powered image analysis_check_color_contrast()- WCAG contrast validation_check_readability()- Content quality analysisgenerate_json_report()- Create output
2. ColorContrastChecker
Responsibilities:
- Calculate luminance values
- Compute contrast ratios
- Validate WCAG compliance
Algorithm:
1. Convert PDF page to image
2. Sample N random pixel pairs
3. For each pair:
• Calculate relative luminance (WCAG formula)
• Compute contrast ratio: (L1 + 0.05) / (L2 + 0.05)
• Compare to WCAG thresholds:
- AA Normal: 4.5:1
- AA Large: 3.0:1
- AAA Normal: 7.0:1
4. Report percentage failing standards
3. ReadabilityAnalyzer
Responsibilities:
- Calculate reading difficulty
- Identify complex content
- Provide grade-level estimates
Metrics:
- Flesch Reading Ease (0-100, higher = easier)
- Flesch-Kincaid Grade Level (US school grade)
- Average sentence length
- Complex word percentage
4. CacheManager
Responsibilities:
- Store API responses
- Reduce duplicate calls
- Control costs
Strategy:
# Cache key = SHA256(image_bytes) + prefix
# Cache hit: Return stored result (free)
# Cache miss: Call API → Cache → Return
Savings:
- Repeat document check: ~$0.10 → $0.00
- Similar images across documents: Cached automatically
API Integration
Anthropic Claude 3.5 Sonnet
Endpoint: https://api.anthropic.com/v1/messages
Request:
{
"model": "claude-3-5-sonnet-20241022",
"max_tokens": 1024,
"messages": [{
"role": "user",
"content": [
{"type": "image", "source": {...}},
{"type": "text", "text": "Analyze for accessibility..."}
]
}]
}
Response Parsing:
# Claude returns JSON with:
{
"alt_text": "...",
"has_text": true/false,
"type": "decorative|informational|complex",
"concerns": [...],
"quality_rating": 1-10
}
Used For:
- Alt text quality validation
- Image content description
- Text-in-image detection
- Color-only information checks
- Content quality analysis
Google Cloud Vision API
Endpoint: https://vision.googleapis.com/v1/images:annotate
Features Used:
- TEXT_DETECTION - OCR for text in images
- LABEL_DETECTION - Image content classification
- IMAGE_PROPERTIES - Dominant colors
- OBJECT_LOCALIZATION - Object identification
Used For:
- Detecting text in images (WCAG 1.4.5)
- Cross-validating Claude's analysis
- OCR quality assessment
- Object recognition
Google Document AI (Optional)
Endpoint: https://documentai.googleapis.com/v1/projects/*/locations/*/processors/*:process
Used For:
- High-quality OCR on scanned PDFs
- Complex document layout analysis
- Better than Tesseract for production use
Database Schema
File Storage Structure
project/
├── uploads/
│ └── pdf_{job_id}.pdf # Uploaded files
├── results/
│ ├── {job_id}.meta.json # Job metadata
│ └── {job_id}.result.json # Check results
└── .cache/
└── {hash}.json # Cached API responses
Job Metadata (*.meta.json)
{
"job_id": "pdf_67890abcdef",
"original_filename": "document.pdf",
"uploaded_at": "2025-01-20 10:00:00",
"file_size": 2048576,
"status": "completed",
"filepath": "/uploads/pdf_67890abcdef.pdf",
"started_at": "2025-01-20 10:00:05",
"completed_at": "2025-01-20 10:03:20"
}
Check Results (*.result.json)
{
"filename": "document.pdf",
"total_pages": 10,
"accessibility_score": 75,
"severity_counts": {
"critical": 0,
"error": 3,
"warning": 5,
"info": 2,
"success": 8
},
"stats": {
"total_checks": 16,
"api_calls": 5,
"cached_calls": 3,
"total_cost_estimate": 0.08,
"duration": 125.5
},
"issues": [...]
}
Security Considerations
1. Input Validation
- File type whitelist (PDF only)
- File size limit (50MB default)
- Malware scanning (recommended)
2. API Key Protection
- Stored in environment variables
- Never in version control
- Rotated regularly
3. File Access Control
# .htaccess
<FilesMatch "\.(json|meta)$">
Require all denied
</FilesMatch>
4. Rate Limiting
- Implement per-IP limits
- Prevent API abuse
- Monitor costs
5. HTTPS
- Required for production
- Protects API keys in transit
- Secures file uploads
Performance Optimization
1. Caching Strategy
# Multi-level caching
L1: In-memory (Python dict)
L2: Disk (.cache/ directory)
L3: API response (if cache miss)
2. Parallel Processing
# Process multiple PDFs concurrently
from multiprocessing import Pool
with Pool(4) as pool:
pool.map(check_pdf, pdf_files)
3. Image Optimization
# Reduce API costs
- Resize images to max 2048px
- Use JPEG compression (quality=85)
- Cache results by hash
4. Lazy Loading
# Don't load entire PDF into memory
# Process page-by-page using generators
for page in pdf_plumber.pages:
process_page(page)
Scalability
Horizontal Scaling
Load Balancer
│
├─→ Web Server 1 (api.php)
│ ↓
│ Processing Queue
│
├─→ Web Server 2 (api.php)
│ ↓
│ Processing Queue
│
└─→ Web Server N (api.php)
↓
Processing Queue
↓
┌───────┴───────┐
▼ ▼
Worker 1 Worker N
(Python) (Python)
Queue-Based Architecture
# Use Redis or RabbitMQ
1. api.php → Push job to queue
2. Worker processes → Pull from queue
3. Process PDF
4. Store results
5. Notify completion (webhook/polling)
Cloud Deployment
AWS:
- EC2 for web servers
- S3 for file storage
- SQS for job queue
- Lambda for workers
Google Cloud:
- Compute Engine for servers
- Cloud Storage for files
- Cloud Tasks for queue
- Cloud Functions for workers
Monitoring & Logging
Key Metrics
- Processing Time: Average duration per check
- API Costs: Daily/monthly spend
- Cache Hit Rate: Percentage of cached results
- Error Rate: Failed checks per day
- Queue Length: Pending jobs
Logging Strategy
import logging
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('checker.log'),
logging.StreamHandler()
]
)
# Log important events
logger.info(f"Processing: {filename}")
logger.warning(f"Low contrast detected: page {page_num}")
logger.error(f"API error: {error}")
Testing Strategy
Unit Tests
import unittest
class TestColorContrast(unittest.TestCase):
def test_contrast_calculation(self):
ratio = ColorContrastChecker.calculate_contrast_ratio(
(255, 255, 255), # White
(0, 0, 0) # Black
)
self.assertAlmostEqual(ratio, 21.0, places=1)
Integration Tests
# Test full pipeline
python3 enterprise_pdf_checker.py test_pdfs/sample.pdf
# Verify: results match expectations
API Tests
# Test Claude integration
def test_claude_api():
result = analyze_image_with_claude(test_image_bytes)
assert 'alt_text' in result
assert len(result['alt_text']) < 125
Deployment Checklist
- Install all dependencies
- Configure API keys
- Set up web server (Apache/Nginx)
- Configure HTTPS
- Set file permissions
- Enable error logging
- Test with sample PDFs
- Configure backups
- Set up monitoring
- Document runbook
Future Enhancements
Planned Features
- User Authentication - Multi-user support
- Report History - Track changes over time
- Batch Upload - Multiple PDFs at once
- PDF Remediation - Auto-fix some issues
- Custom Rules - Organization-specific checks
- Webhooks - Completion notifications
- PDF Comparison - Before/after analysis
- API Rate Limiting - Per-user quotas
- Advanced Caching - Redis integration
- Machine Learning - Pattern detection
Technical Requirements Summary
| Component | Version | Purpose |
|---|---|---|
| Python | 3.8+ | Core processing |
| PHP | 7.4+ | Web API |
| Tesseract | 4.0+ | OCR |
| Poppler | 0.86+ | PDF rendering |
| pypdf | 4.0+ | PDF parsing |
| Anthropic SDK | 0.18+ | Claude API |
| Google Cloud | 3.4+ | Vision API |
Support & Maintenance
Regular Maintenance
- Daily: Check logs for errors
- Weekly: Review API costs
- Monthly: Update dependencies
- Quarterly: Security audit
Backup Strategy
- Files: uploads/, results/ → Daily
- Cache: .cache/ → Weekly
- Code: Git repository → Continuous
Conclusion
This architecture provides:
- ✅ High Quality: Best-in-class AI models
- ✅ Scalability: Horizontal scaling support
- ✅ Reliability: Caching + error handling
- ✅ Maintainability: Modular design
- ✅ Cost-Effective: Smart caching reduces API costs
- ✅ Secure: Multiple security layers
- ✅ Extensible: Easy to add new checks
The system is production-ready and can handle enterprise workloads while maintaining quality-first approach to accessibility validation.