obsidian/01 Projects/pdf-accessibility/PDF Accessibility Checker.md
2026-04-29 13:24:32 +01:00

11 KiB
Raw Blame History


name: "PDF Accessibility Checker" client: Oliver status: active tech: [Python, PHP, JavaScript, PostgreSQL, Redis, Docker, Anthropic Claude, Google Cloud Vision] local_path: /Users/ai_leed/Documents/Projects/Oliver/pdf-accessibility deploy: docker-compose up -d (production: docker-compose -f docker-compose.prod.yml up -d) url: https://ai-sandbox.oliver.solutions/pdf-accessibility server: optical-web-1 tags: [oliver, pdf, accessibility, wcag, ai, php, redis, postgresql] created: 2026-04-14 last_commit: 2026-03-18 commits: 69 port: 8000 db: PostgreSQL

Overview

pdf-accessibility is an AI-powered PDF accessibility validation system that checks documents against WCAG 2.1 Level A & AA standards, achieving ~95% automated coverage. It combines traditional PDF analysis (pypdf, pdfplumber) with cutting-edge AI models (Anthropic Claude 3.5 Sonnet, Google Cloud Vision) to validate accessibility across 30+ criteria. The system serves enterprise users via a modern web UI with drag-and-drop uploads, a RESTful API with authentication, and a CLI tool for batch processing. Branded for Oliver with Montserrat font and black/#FFC407 color palette.

Tech Stack

  • Frontend: Vanilla JavaScript, HTML5/CSS3 (drag-drop file upload, visual inspector, dark mode, responsive design)
  • Backend: Python 3 (core engine), PHP (REST API and authentication layer)
  • Database: PostgreSQL 16 (job tracking, audit logging, results storage)
  • Infrastructure: Docker Compose (development & production stacks), Redis (job queue), Structured logging with rotation
  • AI/ML: Anthropic Claude 3.5 Sonnet (image alt text validation), Google Cloud Vision (OCR, text-in-images detection)
  • Key libraries: pypdf (PDF parsing), pdfplumber (text extraction), pdf2image (rasterization), Pillow (image processing), pytesseract (OCR), textblob (readability), weasyprint (PDF report generation), veraPDF (PDF/UA-1 validation)

Architecture

The system uses a three-interface architecture with a centralized asynchronous job queue backend:

┌─────────────────────────────────────────────────────────────┐
│  Three User Interfaces                                      │
├─────────────────────┬──────────────────┬────────────────────┤
│  Web UI             │  REST API        │  CLI               │
│  (index.html)       │  (api.php)       │ (enterprise_pdf_   │
│  Vanilla JS         │  PHP endpoints   │  checker.py)       │
│  Drag-drop          │  Bearer/Key auth │  Direct Python     │
└─────────────────────┴──────────────────┴────────────────────┘
                           │
                           ▼
                    ┌──────────────┐
                    │  api.php     │
                    │  (Router)    │
                    └──────┬───────┘
                           │
        ┌──────────────────┼──────────────────┐
        ▼                  ▼                  ▼
    ┌────────┐      ┌──────────────┐    ┌──────────────┐
    │ File   │      │ Redis Queue  │    │ PostgreSQL   │
    │ Upload │      │ (pdf:queue)  │    │ (jobs, audit)│
    │uploads/│      │              │    │              │
    └────────┘      └──────┬───────┘    └──────────────┘
                           │
                           ▼
                    ┌──────────────────┐
                    │ worker.py        │
                    │ (Daemon process) │
                    └──────┬───────────┘
                           │
                           ▼
            ┌──────────────────────────────┐
            │ EnterprisePDFChecker         │
            │ • 30+ WCAG checks            │
            │ • AI image analysis (Claude) │
            │ • OCR (GCV)                  │
            │ • Contrast analysis          │
            │ • Readability metrics        │
            └──────┬───────────────────────┘
                   │
                   ▼
            ┌──────────────────┐
            │ results/         │
            │ {job_id}.        │
            │ result.json      │
            └──────────────────┘

Request Flow (Production Docker Stack):

  1. User uploads PDF via web UI, REST API, or CLI
  2. api.php validates auth (Bearer token or API key via auth.php), saves file to uploads/
  3. Job created in PostgreSQL, queued to Redis (pdf:queue)
  4. worker.py daemon (background process) pops job, invokes EnterprisePDFChecker.check_all()
  5. All external API calls (Claude, GCV) wrapped with exponential backoff retry logic (retry_helper.py)
  6. Results written to results/{job_id}.result.json, PostgreSQL updated with completion status
  7. Client polls api.php?action=status for progress, fetches final results when ready
  8. Automatic cleanup (cleanup.py) removes uploads after 24h, results after 30 days

Key Source Files:

File Purpose
enterprise_pdf_checker.py Core validation engine — 30+ WCAG checks, AI image analysis, scoring logic
api.php REST API router — upload/check/status/result/remediate/download endpoints, CORS headers
auth.php Authentication middleware — Bearer token, X-API-Key, dev mode localhost bypass
worker.py Background daemon — Redis queue consumer, graceful shutdown on signals
db_manager.py PostgreSQL ORM — jobs CRUD, audit logging, connection pooling
redis_queue.py Redis operations — job enqueue/dequeue, status tracking, rate limiting
pdf_remediation.py Auto-remediation — metadata fixing, language tagging, structural repairs
retry_helper.py Exponential backoff — retries for Claude API, GCV API, transient failures
report_generator.py Result formatting — JSON reports, HTML export, compliance summaries
logger_config.py Structured logging — JSON output, file rotation (10MB max), syslog integration
cleanup.py Scheduled task — file retention enforcement (24h uploads, 30d results)
index.html Web UI root — loads CSS/JS, sets up drag-drop zone, result viewer
js/app.js Frontend logic — API calls, progress polling, DOM updates, dark mode
css/style.css Branding — Oliver palette (black, #FFC407), Montserrat font, responsive layout

Dev Commands

# Activate virtual environment
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Run all tests (31 automated tests)
pytest tests/ -v

# Run with coverage report
pytest tests/ --cov=. --cov-report=html

# Run single test file
pytest tests/test_checker.py -v

# Skip integration tests (faster local runs)
pytest tests/ -m "not integration"

# Start development server (PHP)
php -S localhost:8000

# CLI: Full accessibility check
python enterprise_pdf_checker.py document.pdf --output report.json

# CLI: Quick check (skip AI image analysis)
python enterprise_pdf_checker.py document.pdf --quick

# CLI: Auto-remediate all fixable issues
python pdf_remediation.py document.pdf --output fixed.pdf --all

# Docker development stack (all services)
docker-compose up

# Docker production stack
docker-compose -f docker-compose.prod.yml up -d

# Run tests inside Docker container
docker-compose exec worker pytest tests/ -v

# View worker logs
docker-compose logs -f worker

Deployment

  • Server: optical-web-1
  • Deploy:
    • Development: docker-compose up
    • Production: docker-compose -f docker-compose.prod.yml up -d OR via deploy.sh (runs git pull, restarts Apache)
    • Manual: Push to git@bitbucket.org:zlalani/pdf-accessibility.git main branch; server auto-deploys via webhook
  • URL: https://ai-sandbox.oliver.solutions/pdf-accessibility
  • Port: 8000 (development), 80/443 (production via Apache reverse proxy)
  • Service: None (Docker Compose manages container lifecycle; Apache may use systemd)
  • Local path: /Users/ai_leed/Documents/Projects/Oliver/pdf-accessibility

Environment Variables

All configured in .env (copy from .env.example):

  • ANTHROPIC_API_KEY — Anthropic Claude API key (required; get from https://console.anthropic.com/)
  • GOOGLE_API_KEY — Google Cloud Vision API key (optional; for OCR and text-in-images detection)
  • GOOGLE_APPLICATION_CREDENTIALS — Path to GCP service account JSON (alternative to API key)
  • DEV_MODE — Set to true for localhost auth bypass (development only)
  • DB_HOST, DB_PORT, DB_NAME, DB_USER, DB_PASSWORD — PostgreSQL connection (docker-compose provides defaults)
  • CLOUD_RUN_URL — Optional Cloud Run service URL for distributed PDF processing; defaults to local Python
  • GCP_SA_KEY_PATH — Path to GCP service account key for Cloud Run authentication
  • GCS_BUCKET_NAME — Google Cloud Storage bucket for page images (default: optical-pdf-images)
  • RETENTION_HOURS — Keep uploaded PDFs for N hours before deletion (default: 24)
  • RESULTS_RETENTION_HOURS — Keep result/meta JSON for N hours before deletion (default: 720 = 30 days)
  • AZURE_TENANT_ID, AZURE_CLIENT_ID, `AZURE_

Timeline / Git History

Date Change
2026-03-18 Fix CP14 heading detection via RoleMap + manual pass support
2026-03-18 Persist adjusted score to server on Recalculate
2026-03-18 Address client feedback: WCAG badges, table grouping, history UX
2026-03-16 PDF report reflects adjusted score + manual pass
2026-03-13 Move document history to separate history.html page
2026-03-13 Fix history: read jobs from data.data.jobs

Sessions

2026-04-14 Project catalogued

Done: Added to Obsidian second brain with full details.


Change Log

Date Requested Changed Files
2026-03-18 Fix heading detection CP14 via RoleMap + manual pass enterprise_pdf_checker.py
2026-03-13 Separate history page Move to history.html history.html, api.php