obsidian/01 Projects/pdf-accessibility/PDF Accessibility Checker.md
2026-04-29 14:50:31 +01:00

10 KiB
Raw Blame History

name client status tech local_path deploy url server tags created last_commit commits port db
PDF Accessibility Checker Oliver Internal active
Python
PHP
JavaScript
PostgreSQL
Redis
Docker
Claude API
Google Cloud Vision
/Users/ai_leed/Documents/Projects/Oliver/pdf-accessibility docker-compose up or docker-compose -f docker-compose.prod.yml up -d http://localhost:8000 local
oliver
pdf
accessibility
wcag
ai
php
redis
postgresql
2026-04-14 2026-03-18 69 8000 PostgreSQL

Overview

PDF Accessibility is an AI-powered validation tool that checks PDF documents against WCAG 2.1 Level A & AA standards using Claude 3.5 Sonnet for image/alt-text analysis and Google Cloud Vision for OCR. It combines traditional PDF analysis with machine learning to achieve ~95% automated coverage of accessibility requirements, offering auto-remediation, a visual page inspector, and three interfaces: Web UI, REST API, and CLI. Built for Oliver Internal, it enables organizations to audit and fix PDF accessibility issues at scale with minimal manual intervention.

Tech Stack

  • Frontend: Vanilla JavaScript + HTML5/CSS3, drag-drop UI with visual inspector, dark mode support
  • Backend: PHP (REST API) + Python (core engine, worker daemon, CLI)
  • Database: PostgreSQL 16 (jobs tracking, audit logs), Redis (job queue)
  • Infrastructure: Docker Compose (dev & prod stacks), Azure AD/MSAL for SSO auth
  • AI/ML: Anthropic Claude 3.5 Sonnet (image analysis, alt-text validation), Google Cloud Vision (OCR, text detection), veraPDF (PDF/UA-1 validation)
  • Key libraries: pypdf, pdfplumber, Pillow, pdf2image, pytesseract, textblob, anthropic, google-cloud-vision

Architecture

The system comprises three parallel interfaces (Web UI, REST API, CLI) feeding into a unified core engine:

┌──────────────────────────────────────────────────────────────┐
│                    THREE INTERFACES                          │
├──────────────────┬───────────────────────┬──────────────────┤
│   Web UI         │     REST API          │   CLI            │
│ (index.html)     │     (api.php)         │ (Python direct)  │
│ Drag-drop        │ POST /upload          │ python ...py     │
│ Visual inspector │ GET /status           │ --output JSON    │
└────────┬─────────┴──────────┬────────────┴────────┬─────────┘
         │                    │                     │
         └────────────────────┼─────────────────────┘
                              ▼
                    ┌──────────────────┐
                    │  Job Orchestration│
                    ├──────────────────┤
                    │ auth.php          │ Auth gate (Bearer/X-API-Key)
                    │ db_manager.py     │ PostgreSQL ORM
                    │ redis_queue.py    │ Job queue (pdf:queue)
                    └────────┬──────────┘
                             ▼
         ┌───────────────────────────────────┐
         │   worker.py (Background Daemon)   │
         │  Pops jobs → Runs checks → Stores │
         │  results/{job_id}.result.json     │
         └────────┬────────────────────────┘
                  ▼
      ╔═════════════════════════════════════════╗
      ║ enterprise_pdf_checker.py               ║
      ║ CORE ENGINE: 30+ WCAG checks           ║
      ╠═════════════════════════════════════════╣
      │ 1. Metadata & structure (pypdf)        │
      │ 2. Text extraction & readability       │
      │ 3. Color contrast (Pillow analysis)    │
      │ 4. Image alt-text (Claude 3.5 Sonnet) │
      │ 5. OCR for text-in-images (GCV)       │
      │ 6. Heading hierarchy & tagging         │
      │ 7. Form field labels                   │
      │ 8. PDF/UA-1 validation (veraPDF)      │
      ╚═════════════════════════════════════════╝
                  │
                  ├─→ pdf_remediation.py (Auto-fix)
                  ├─→ report_generator.py (Format results)
                  └─→ retry_helper.py (Exponential backoff)

Data Flow (Production/Docker):

  1. Client uploads PDF via Web UI → api.php validates auth & saves to uploads/
  2. Job pushed to Redis queue; PostgreSQL job record created with status pending
  3. worker.py daemon polls Redis, pops job, runs EnterprisePDFChecker.check_all()
  4. Results written to results/{job_id}.result.json; DB updated with status complete
  5. Client polls api.php?action=status for progress; retrieves results when ready
  6. Optional: User triggers api.php?action=remediate for auto-fix → new PDF created

Key Design Decisions:

  • Async queue: Redis + background worker allows long-running AI checks without blocking HTTP
  • PostgreSQL: Persistent job history, audit logging, user isolation (MSAL integration)
  • Caching: AI responses cached by content hash in .cache/ to reduce API costs (~$0.015/image)
  • Graceful degradation: CLI mode works standalone; API mode requires Docker stack
  • Environment-aware: DEV_MODE=true bypasses auth for localhost; production requires valid keys

Dev Commands

# Setup
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
brew install php tesseract poppler verapdf  # macOS system deps

# Development (PHP dev server + Python worker in separate terminals)
source venv/bin/activate
php -S localhost:8000                        # Terminal 1: Start dev server
python worker.py                             # Terminal 2: Start background worker

# Docker development stack (all services)
docker-compose up

# Docker production stack (detached)
docker-compose -f docker-compose.prod.yml up -d

# Testing
pytest tests/ -v                             # All 31 tests
pytest tests/ --cov=. --cov-report=html     # With coverage (34% current)
pytest tests/test_checker.py -v              # Single file
pytest tests/ -m "not integration"           # Skip slow tests

# CLI Usage (direct Python)
python enterprise_pdf_checker.py document.pdf --output report.json  # Full check
python enterprise_pdf_checker.py document.pdf --quick               # Skip AI (faster)
python pdf_remediation.py document.pdf --output fixed.pdf --all     # Auto-remediate

Deployment

  • Server: Local development (can deploy to optical-web-1 or Cloud Run)
  • Deploy: docker-compose -f docker-compose.prod.yml up -d (Docker) or bash deploy.sh (Git-based)
  • URL: http://localhost:8000 (dev) | Production URL in .env (CLOUD_RUN_URL)
  • Port: 8000 (HTTP), 1220 (Redis in prod), 1221 (PostgreSQL in prod)
  • Service: None (containerized); can be wrapped in systemd if needed
  • Local path: /Users/ai_leed/Documents/Projects/Oliver/pdf-accessibility

Production Notes:

  • Set DEV_MODE=false before deploying
  • Rotate default API key dev_key_12345 → generate secure key
  • Store ANTHROPIC_API_KEY and GOOGLE_API_KEY in secrets manager
  • MongoDB/backup strategy for PostgreSQL recommended
  • veraPDF requires Java; included in Docker image

Environment Variables

  • ANTHROPIC_API_KEY — Anthropic Claude API key (required for AI image analysis). Get from https://console.anthropic.com/
  • GOOGLE_API_KEY — Google Cloud Vision API key (optional; for enhanced OCR). Alternative: GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
  • GOOGLE_CLOUD_PROJECT — GCP project ID (if using service account file)
  • DEV_MODE=true|false — If true, localhost bypasses authentication. Never enable in production.
  • DB_HOST, DB_PORT, DB_NAME, DB_USER, DB_PASSWORD — PostgreSQL connection (Docker: use service name postgres)
  • REDIS_URL — Redis connection string (Docker: redis://redis:6379)
  • RETENTION_HOURS — Delete uploaded PDFs after N hours (default 24)
  • RESULTS_RETENTION_HOURS — Keep result JSON files for N hours (default 720 = 30 days)
  • CLOUD_RUN_URL — Deployed Cloud Run endpoint for remote PDF processing
  • GCP_SA_KEY_PATH — Path to GCP service account key JSON
  • GCS_BUCKET_NAME — Google Cloud Storage bucket for page images
  • AZURE_TENANT_ID, AZURE_CLIENT_ID, AZURE_REDIRECT_URI — Azure AD (MSAL) SSO configuration

API / Endpoints

REST API (api.php base URL: http://localhost:8000)

Method Endpoint Auth Body/Params Returns
POST /api.php (action=upload) Bearer/X-API-Key file (multipart) {"job_id": "...", "status": "pending"}
GET /api.php?action=status&job_id=... Bearer/X-API-Key N/A {"status": "processing|complete|failed", "progress": 45}
GET /api.php?action=result&job_id=... Bearer/X-API-Key N/A {"result": {...}} (full WC

Timeline / Git History

Date Change
2026-03-18 Fix CP14 heading detection via RoleMap + manual pass support
2026-03-18 Persist adjusted score to server on Recalculate
2026-03-18 Address client feedback: WCAG badges, table grouping, history UX
2026-03-16 PDF report reflects adjusted score + manual pass
2026-03-13 Move document history to separate history.html page
2026-03-13 Fix history: read jobs from data.data.jobs

Sessions

2026-04-14 Project catalogued

Done: Added to Obsidian second brain with full details.


Change Log

Date Requested Changed Files
2026-03-18 Fix heading detection CP14 via RoleMap + manual pass enterprise_pdf_checker.py
2026-03-13 Separate history page Move to history.html history.html, api.php