11 KiB
name: "PDF Accessibility Checker" client: Oliver status: active tech: [Python, PHP, JavaScript, PostgreSQL, Redis, Docker, Anthropic Claude, Google Cloud Vision] local_path: /Users/ai_leed/Documents/Projects/Oliver/pdf-accessibility deploy: docker-compose up -d (production: docker-compose -f docker-compose.prod.yml up -d) url: https://ai-sandbox.oliver.solutions/pdf-accessibility server: optical-web-1 tags: [oliver, pdf, accessibility, wcag, ai, php, redis, postgresql] created: 2026-04-14 last_commit: 2026-03-18 commits: 69 port: 8000 db: PostgreSQL
Overview
pdf-accessibility is an AI-powered PDF accessibility validation system that checks documents against WCAG 2.1 Level A & AA standards, achieving ~95% automated coverage. It combines traditional PDF analysis (pypdf, pdfplumber) with cutting-edge AI models (Anthropic Claude 3.5 Sonnet, Google Cloud Vision) to validate accessibility across 30+ criteria. The system serves enterprise users via a modern web UI with drag-and-drop uploads, a RESTful API with authentication, and a CLI tool for batch processing. Branded for Oliver with Montserrat font and black/#FFC407 color palette.
Tech Stack
- Frontend: Vanilla JavaScript, HTML5/CSS3 (drag-drop file upload, visual inspector, dark mode, responsive design)
- Backend: Python 3 (core engine), PHP (REST API and authentication layer)
- Database: PostgreSQL 16 (job tracking, audit logging, results storage)
- Infrastructure: Docker Compose (development & production stacks), Redis (job queue), Structured logging with rotation
- AI/ML: Anthropic Claude 3.5 Sonnet (image alt text validation), Google Cloud Vision (OCR, text-in-images detection)
- Key libraries: pypdf (PDF parsing), pdfplumber (text extraction), pdf2image (rasterization), Pillow (image processing), pytesseract (OCR), textblob (readability), weasyprint (PDF report generation), veraPDF (PDF/UA-1 validation)
Architecture
The system uses a three-interface architecture with a centralized asynchronous job queue backend:
┌─────────────────────────────────────────────────────────────┐
│ Three User Interfaces │
├─────────────────────┬──────────────────┬────────────────────┤
│ Web UI │ REST API │ CLI │
│ (index.html) │ (api.php) │ (enterprise_pdf_ │
│ Vanilla JS │ PHP endpoints │ checker.py) │
│ Drag-drop │ Bearer/Key auth │ Direct Python │
└─────────────────────┴──────────────────┴────────────────────┘
│
▼
┌──────────────┐
│ api.php │
│ (Router) │
└──────┬───────┘
│
┌──────────────────┼──────────────────┐
▼ ▼ ▼
┌────────┐ ┌──────────────┐ ┌──────────────┐
│ File │ │ Redis Queue │ │ PostgreSQL │
│ Upload │ │ (pdf:queue) │ │ (jobs, audit)│
│uploads/│ │ │ │ │
└────────┘ └──────┬───────┘ └──────────────┘
│
▼
┌──────────────────┐
│ worker.py │
│ (Daemon process) │
└──────┬───────────┘
│
▼
┌──────────────────────────────┐
│ EnterprisePDFChecker │
│ • 30+ WCAG checks │
│ • AI image analysis (Claude) │
│ • OCR (GCV) │
│ • Contrast analysis │
│ • Readability metrics │
└──────┬───────────────────────┘
│
▼
┌──────────────────┐
│ results/ │
│ {job_id}. │
│ result.json │
└──────────────────┘
Request Flow (Production Docker Stack):
- User uploads PDF via web UI, REST API, or CLI
api.phpvalidates auth (Bearer token or API key viaauth.php), saves file touploads/- Job created in PostgreSQL, queued to Redis (
pdf:queue) worker.pydaemon (background process) pops job, invokesEnterprisePDFChecker.check_all()- All external API calls (Claude, GCV) wrapped with exponential backoff retry logic (
retry_helper.py) - Results written to
results/{job_id}.result.json, PostgreSQL updated with completion status - Client polls
api.php?action=statusfor progress, fetches final results when ready - Automatic cleanup (
cleanup.py) removes uploads after 24h, results after 30 days
Key Source Files:
| File | Purpose |
|---|---|
enterprise_pdf_checker.py |
Core validation engine — 30+ WCAG checks, AI image analysis, scoring logic |
api.php |
REST API router — upload/check/status/result/remediate/download endpoints, CORS headers |
auth.php |
Authentication middleware — Bearer token, X-API-Key, dev mode localhost bypass |
worker.py |
Background daemon — Redis queue consumer, graceful shutdown on signals |
db_manager.py |
PostgreSQL ORM — jobs CRUD, audit logging, connection pooling |
redis_queue.py |
Redis operations — job enqueue/dequeue, status tracking, rate limiting |
pdf_remediation.py |
Auto-remediation — metadata fixing, language tagging, structural repairs |
retry_helper.py |
Exponential backoff — retries for Claude API, GCV API, transient failures |
report_generator.py |
Result formatting — JSON reports, HTML export, compliance summaries |
logger_config.py |
Structured logging — JSON output, file rotation (10MB max), syslog integration |
cleanup.py |
Scheduled task — file retention enforcement (24h uploads, 30d results) |
index.html |
Web UI root — loads CSS/JS, sets up drag-drop zone, result viewer |
js/app.js |
Frontend logic — API calls, progress polling, DOM updates, dark mode |
css/style.css |
Branding — Oliver palette (black, #FFC407), Montserrat font, responsive layout |
Dev Commands
# Activate virtual environment
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Run all tests (31 automated tests)
pytest tests/ -v
# Run with coverage report
pytest tests/ --cov=. --cov-report=html
# Run single test file
pytest tests/test_checker.py -v
# Skip integration tests (faster local runs)
pytest tests/ -m "not integration"
# Start development server (PHP)
php -S localhost:8000
# CLI: Full accessibility check
python enterprise_pdf_checker.py document.pdf --output report.json
# CLI: Quick check (skip AI image analysis)
python enterprise_pdf_checker.py document.pdf --quick
# CLI: Auto-remediate all fixable issues
python pdf_remediation.py document.pdf --output fixed.pdf --all
# Docker development stack (all services)
docker-compose up
# Docker production stack
docker-compose -f docker-compose.prod.yml up -d
# Run tests inside Docker container
docker-compose exec worker pytest tests/ -v
# View worker logs
docker-compose logs -f worker
Deployment
- Server: optical-web-1
- Deploy:
- Development:
docker-compose up - Production:
docker-compose -f docker-compose.prod.yml up -dOR viadeploy.sh(runs git pull, restarts Apache) - Manual: Push to
git@bitbucket.org:zlalani/pdf-accessibility.gitmain branch; server auto-deploys via webhook
- Development:
- URL: https://ai-sandbox.oliver.solutions/pdf-accessibility
- Port: 8000 (development), 80/443 (production via Apache reverse proxy)
- Service: None (Docker Compose manages container lifecycle; Apache may use systemd)
- Local path: /Users/ai_leed/Documents/Projects/Oliver/pdf-accessibility
Environment Variables
All configured in .env (copy from .env.example):
ANTHROPIC_API_KEY— Anthropic Claude API key (required; get from https://console.anthropic.com/)GOOGLE_API_KEY— Google Cloud Vision API key (optional; for OCR and text-in-images detection)GOOGLE_APPLICATION_CREDENTIALS— Path to GCP service account JSON (alternative to API key)DEV_MODE— Set totruefor localhost auth bypass (development only)DB_HOST,DB_PORT,DB_NAME,DB_USER,DB_PASSWORD— PostgreSQL connection (docker-compose provides defaults)CLOUD_RUN_URL— Optional Cloud Run service URL for distributed PDF processing; defaults to local PythonGCP_SA_KEY_PATH— Path to GCP service account key for Cloud Run authenticationGCS_BUCKET_NAME— Google Cloud Storage bucket for page images (default:optical-pdf-images)RETENTION_HOURS— Keep uploaded PDFs for N hours before deletion (default: 24)RESULTS_RETENTION_HOURS— Keep result/meta JSON for N hours before deletion (default: 720 = 30 days)AZURE_TENANT_ID,AZURE_CLIENT_ID, `AZURE_
Timeline / Git History
| Date | Change |
|---|---|
| 2026-03-18 | Fix CP14 heading detection via RoleMap + manual pass support |
| 2026-03-18 | Persist adjusted score to server on Recalculate |
| 2026-03-18 | Address client feedback: WCAG badges, table grouping, history UX |
| 2026-03-16 | PDF report reflects adjusted score + manual pass |
| 2026-03-13 | Move document history to separate history.html page |
| 2026-03-13 | Fix history: read jobs from data.data.jobs |
Sessions
2026-04-14 – Project catalogued
Done: Added to Obsidian second brain with full details.
Change Log
| Date | Requested | Changed | Files |
|---|---|---|---|
| 2026-03-18 | Fix heading detection | CP14 via RoleMap + manual pass | enterprise_pdf_checker.py |
| 2026-03-13 | Separate history page | Move to history.html | history.html, api.php |