pdf-accessibility/README's/ARCHITECTURE.md

# Enterprise PDF Accessibility Checker - System Architecture

## 🏗️ System Overview

This document describes the technical architecture of the Enterprise PDF Accessibility Checker.

---

## Component Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                        USER LAYER                            │
├─────────────────────────────────────────────────────────────┤
│  • Web Browser (Drag & Drop Interface)                      │
│  • Command Line Interface                                   │
│  • REST API Clients                                         │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│                     WEB SERVER LAYER                         │
├─────────────────────────────────────────────────────────────┤
│  PHP Backend (api.php)                                      │
│  • Upload Management                                        │
│  • Job Queue                                                │
│  • Result Storage                                           │
│  • Authentication (optional)                                │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│                  PROCESSING ENGINE                           │
├─────────────────────────────────────────────────────────────┤
│  Python Script (enterprise_pdf_checker.py)                  │
│                                                              │
│  ┌────────────────────────────────────────────────┐        │
│  │         Core Checking Engine                   │        │
│  │  • PDF parsing (pypdf, pdfplumber)            │        │
│  │  • Structure analysis                          │        │
│  │  • Text extraction                             │        │
│  │  • Issue detection                             │        │
│  └────────────────────────────────────────────────┘        │
│                                                              │
│  ┌────────────────────────────────────────────────┐        │
│  │         Analysis Modules                       │        │
│  │  • Color Contrast Checker                     │        │
│  │  • Readability Analyzer                       │        │
│  │  • OCR Quality Checker                        │        │
│  │  • Link Validator                             │        │
│  │  • Form Field Analyzer                        │        │
│  └────────────────────────────────────────────────┘        │
│                                                              │
│  ┌────────────────────────────────────────────────┐        │
│  │         Cache Manager                          │        │
│  │  • API response caching                       │        │
│  │  • Cost optimization                          │        │
│  └────────────────────────────────────────────────┘        │
└────────────┬───────────────────────┬───────────────────────┘
             │                       │
             ▼                       ▼
┌──────────────────────┐   ┌──────────────────────────────────┐
│  EXTERNAL SERVICES   │   │    LOCAL PROCESSING               │
├──────────────────────┤   ├──────────────────────────────────┤
│  Anthropic Claude    │   │  • Tesseract OCR                  │
│  • Image analysis    │   │  • PIL/Pillow (image processing) │
│  • Alt text validate │   │  • TextBlob (NLP)                │
│  • Content quality   │   │  • NumPy (calculations)          │
│                      │   │  • pdf2image (rendering)         │
│  Google Cloud        │   └──────────────────────────────────┘
│  • Vision API        │
│  • Document AI       │
│  • OCR + analysis    │
└──────────────────────┘
```

---

## Data Flow

### 1. Web Interface Flow

```
User uploads PDF
      ↓
index.html (JavaScript)
      ↓
POST /api.php?action=upload
      ↓
api.php saves to /uploads/
      ↓
Returns job_id
      ↓
POST /api.php?action=check (with job_id)
      ↓
api.php spawns Python process
      ↓
enterprise_pdf_checker.py processes PDF
      ↓
Calls Anthropic & Google APIs
      ↓
Writes results to /results/
      ↓
JavaScript polls /api.php?action=status
      ↓
GET /api.php?action=result
      ↓
Display results in browser
```

### 2. Command Line Flow

```
User runs: python3 enterprise_pdf_checker.py doc.pdf
      ↓
Script loads PDF with pypdf/pdfplumber
      ↓
Runs all checking modules sequentially
      ↓
For each image:
  • Extract image bytes
  • Check cache
  • If not cached:
      - Call Claude Vision API
      - Call Google Vision API
      - Cache results
  • Process analysis
      ↓
For each page:
  • Extract text
  • Check readability
  • Analyze color contrast
  • Validate structure
      ↓
Aggregate all issues
      ↓
Calculate accessibility score
      ↓
Generate JSON report
      ↓
Output to file or stdout
```

---

## Module Details

### 1. EnterprisePDFChecker (Main Class)

**Responsibilities:**
- Orchestrate all checks
- Manage API clients
- Track statistics
- Generate reports

**Key Methods:**
- `check_all()` - Run all accessibility checks
- `_check_basic_structure()` - Verify PDF tagging
- `_check_images_comprehensive()` - AI-powered image analysis
- `_check_color_contrast()` - WCAG contrast validation
- `_check_readability()` - Content quality analysis
- `generate_json_report()` - Create output

### 2. ColorContrastChecker

**Responsibilities:**
- Calculate luminance values
- Compute contrast ratios
- Validate WCAG compliance

**Algorithm:**
```python
1. Convert PDF page to image
2. Sample N random pixel pairs
3. For each pair:
   • Calculate relative luminance (WCAG formula)
   • Compute contrast ratio: (L1 + 0.05) / (L2 + 0.05)
   • Compare to WCAG thresholds:
     - AA Normal: 4.5:1
     - AA Large: 3.0:1
     - AAA Normal: 7.0:1
4. Report percentage failing standards
```

### 3. ReadabilityAnalyzer

**Responsibilities:**
- Calculate reading difficulty
- Identify complex content
- Provide grade-level estimates

**Metrics:**
- **Flesch Reading Ease** (0-100, higher = easier)
- **Flesch-Kincaid Grade Level** (US school grade)
- **Average sentence length**
- **Complex word percentage**

### 4. CacheManager

**Responsibilities:**
- Store API responses
- Reduce duplicate calls
- Control costs

**Strategy:**
```python
# Cache key = SHA256(image_bytes) + prefix
# Cache hit: Return stored result (free)
# Cache miss: Call API → Cache → Return
```

**Savings:**
- Repeat document check: ~$0.10 → $0.00
- Similar images across documents: Cached automatically

---

## API Integration

### Anthropic Claude 3.5 Sonnet

**Endpoint:** `https://api.anthropic.com/v1/messages`

**Request:**
```python
{
  "model": "claude-3-5-sonnet-20241022",
  "max_tokens": 1024,
  "messages": [{
    "role": "user",
    "content": [
      {"type": "image", "source": {...}},
      {"type": "text", "text": "Analyze for accessibility..."}
    ]
  }]
}
```

**Response Parsing:**
```python
# Claude returns JSON with:
{
  "alt_text": "...",
  "has_text": true/false,
  "type": "decorative|informational|complex",
  "concerns": [...],
  "quality_rating": 1-10
}
```

**Used For:**
- Alt text quality validation
- Image content description
- Text-in-image detection
- Color-only information checks
- Content quality analysis

### Google Cloud Vision API

**Endpoint:** `https://vision.googleapis.com/v1/images:annotate`

**Features Used:**
- **TEXT_DETECTION** - OCR for text in images
- **LABEL_DETECTION** - Image content classification
- **IMAGE_PROPERTIES** - Dominant colors
- **OBJECT_LOCALIZATION** - Object identification

**Used For:**
- Detecting text in images (WCAG 1.4.5)
- Cross-validating Claude's analysis
- OCR quality assessment
- Object recognition

### Google Document AI (Optional)

**Endpoint:** `https://documentai.googleapis.com/v1/projects/*/locations/*/processors/*:process`

**Used For:**
- High-quality OCR on scanned PDFs
- Complex document layout analysis
- Better than Tesseract for production use

---

## Database Schema

### File Storage Structure

```
project/
├── uploads/
│   └── pdf_{job_id}.pdf           # Uploaded files
├── results/
│   ├── {job_id}.meta.json         # Job metadata
│   └── {job_id}.result.json       # Check results
└── .cache/
    └── {hash}.json                # Cached API responses
```

### Job Metadata (*.meta.json)
```json
{
  "job_id": "pdf_67890abcdef",
  "original_filename": "document.pdf",
  "uploaded_at": "2025-01-20 10:00:00",
  "file_size": 2048576,
  "status": "completed",
  "filepath": "/uploads/pdf_67890abcdef.pdf",
  "started_at": "2025-01-20 10:00:05",
  "completed_at": "2025-01-20 10:03:20"
}
```

### Check Results (*.result.json)
```json
{
  "filename": "document.pdf",
  "total_pages": 10,
  "accessibility_score": 75,
  "severity_counts": {
    "critical": 0,
    "error": 3,
    "warning": 5,
    "info": 2,
    "success": 8
  },
  "stats": {
    "total_checks": 16,
    "api_calls": 5,
    "cached_calls": 3,
    "total_cost_estimate": 0.08,
    "duration": 125.5
  },
  "issues": [...]
}
```

---

## Security Considerations

### 1. Input Validation
- File type whitelist (PDF only)
- File size limit (50MB default)
- Malware scanning (recommended)

### 2. API Key Protection
- Stored in environment variables
- Never in version control
- Rotated regularly

### 3. File Access Control
```apache
# .htaccess
<FilesMatch "\.(json|meta)$">
    Require all denied
</FilesMatch>
```

### 4. Rate Limiting
- Implement per-IP limits
- Prevent API abuse
- Monitor costs

### 5. HTTPS
- Required for production
- Protects API keys in transit
- Secures file uploads

---

## Performance Optimization

### 1. Caching Strategy
```python
# Multi-level caching
L1: In-memory (Python dict)
L2: Disk (.cache/ directory)
L3: API response (if cache miss)
```

### 2. Parallel Processing
```python
# Process multiple PDFs concurrently
from multiprocessing import Pool

with Pool(4) as pool:
    pool.map(check_pdf, pdf_files)
```

### 3. Image Optimization
```python
# Reduce API costs
- Resize images to max 2048px
- Use JPEG compression (quality=85)
- Cache results by hash
```

### 4. Lazy Loading
```python
# Don't load entire PDF into memory
# Process page-by-page using generators
for page in pdf_plumber.pages:
    process_page(page)
```

---

## Scalability

### Horizontal Scaling

```
Load Balancer
      │
      ├─→ Web Server 1 (api.php)
      │        ↓
      │   Processing Queue
      │
      ├─→ Web Server 2 (api.php)
      │        ↓
      │   Processing Queue
      │
      └─→ Web Server N (api.php)
               ↓
          Processing Queue
               ↓
      ┌───────┴───────┐
      ▼               ▼
  Worker 1        Worker N
  (Python)        (Python)
```

### Queue-Based Architecture

```python
# Use Redis or RabbitMQ
1. api.php → Push job to queue
2. Worker processes → Pull from queue
3. Process PDF
4. Store results
5. Notify completion (webhook/polling)
```

### Cloud Deployment

**AWS:**
- EC2 for web servers
- S3 for file storage
- SQS for job queue
- Lambda for workers

**Google Cloud:**
- Compute Engine for servers
- Cloud Storage for files
- Cloud Tasks for queue
- Cloud Functions for workers

---

## Monitoring & Logging

### Key Metrics
- **Processing Time**: Average duration per check
- **API Costs**: Daily/monthly spend
- **Cache Hit Rate**: Percentage of cached results
- **Error Rate**: Failed checks per day
- **Queue Length**: Pending jobs

### Logging Strategy
```python
import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('checker.log'),
        logging.StreamHandler()
    ]
)

# Log important events
logger.info(f"Processing: {filename}")
logger.warning(f"Low contrast detected: page {page_num}")
logger.error(f"API error: {error}")
```

---

## Testing Strategy

### Unit Tests
```python
import unittest

class TestColorContrast(unittest.TestCase):
    def test_contrast_calculation(self):
        ratio = ColorContrastChecker.calculate_contrast_ratio(
            (255, 255, 255),  # White
            (0, 0, 0)         # Black
        )
        self.assertAlmostEqual(ratio, 21.0, places=1)
```

### Integration Tests
```bash
# Test full pipeline
python3 enterprise_pdf_checker.py test_pdfs/sample.pdf
# Verify: results match expectations
```

### API Tests
```python
# Test Claude integration
def test_claude_api():
    result = analyze_image_with_claude(test_image_bytes)
    assert 'alt_text' in result
    assert len(result['alt_text']) < 125
```

---

## Deployment Checklist

- [ ] Install all dependencies
- [ ] Configure API keys
- [ ] Set up web server (Apache/Nginx)
- [ ] Configure HTTPS
- [ ] Set file permissions
- [ ] Enable error logging
- [ ] Test with sample PDFs
- [ ] Configure backups
- [ ] Set up monitoring
- [ ] Document runbook

---

## Future Enhancements

### Planned Features
1. **User Authentication** - Multi-user support
2. **Report History** - Track changes over time
3. **Batch Upload** - Multiple PDFs at once
4. **PDF Remediation** - Auto-fix some issues
5. **Custom Rules** - Organization-specific checks
6. **Webhooks** - Completion notifications
7. **PDF Comparison** - Before/after analysis
8. **API Rate Limiting** - Per-user quotas
9. **Advanced Caching** - Redis integration
10. **Machine Learning** - Pattern detection

---

## Technical Requirements Summary

| Component | Version | Purpose |
|-----------|---------|---------|
| Python | 3.8+ | Core processing |
| PHP | 7.4+ | Web API |
| Tesseract | 4.0+ | OCR |
| Poppler | 0.86+ | PDF rendering |
| pypdf | 4.0+ | PDF parsing |
| Anthropic SDK | 0.18+ | Claude API |
| Google Cloud | 3.4+ | Vision API |

---

## Support & Maintenance

### Regular Maintenance
- **Daily**: Check logs for errors
- **Weekly**: Review API costs
- **Monthly**: Update dependencies
- **Quarterly**: Security audit

### Backup Strategy
- **Files**: uploads/, results/ → Daily
- **Cache**: .cache/ → Weekly
- **Code**: Git repository → Continuous

---

## Conclusion

This architecture provides:
- ✅ **High Quality**: Best-in-class AI models
- ✅ **Scalability**: Horizontal scaling support
- ✅ **Reliability**: Caching + error handling
- ✅ **Maintainability**: Modular design
- ✅ **Cost-Effective**: Smart caching reduces API costs
- ✅ **Secure**: Multiple security layers
- ✅ **Extensible**: Easy to add new checks

The system is production-ready and can handle enterprise workloads while maintaining quality-first approach to accessibility validation.