obsidian/01 Projects/pdf-accessibility/PDF Accessibility Checker.md at fa7785ff7839a03c19b81c4aa2d1ff03bfbb38e3

Vadym Samoilenko affc6a8353 vault backup: 2026-04-29 13:24:32

2026-04-29 13:24:32 +01:00

11 KiB

Raw Blame History

name: "PDF Accessibility Checker" client: Oliver status: active tech: [Python, PHP, JavaScript, PostgreSQL, Redis, Docker, Anthropic Claude, Google Cloud Vision] local_path: /Users/ai_leed/Documents/Projects/Oliver/pdf-accessibility deploy: docker-compose up -d (production: docker-compose -f docker-compose.prod.yml up -d) url: https://ai-sandbox.oliver.solutions/pdf-accessibility server: optical-web-1 tags: [oliver, pdf, accessibility, wcag, ai, php, redis, postgresql] created: 2026-04-14 last_commit: 2026-03-18 commits: 69 port: 8000 db: PostgreSQL

Overview

pdf-accessibility is an AI-powered PDF accessibility validation system that checks documents against WCAG 2.1 Level A & AA standards, achieving ~95% automated coverage. It combines traditional PDF analysis (pypdf, pdfplumber) with cutting-edge AI models (Anthropic Claude 3.5 Sonnet, Google Cloud Vision) to validate accessibility across 30+ criteria. The system serves enterprise users via a modern web UI with drag-and-drop uploads, a RESTful API with authentication, and a CLI tool for batch processing. Branded for Oliver with Montserrat font and black/#FFC407 color palette.

Tech Stack

Frontend: Vanilla JavaScript, HTML5/CSS3 (drag-drop file upload, visual inspector, dark mode, responsive design)
Backend: Python 3 (core engine), PHP (REST API and authentication layer)
Database: PostgreSQL 16 (job tracking, audit logging, results storage)
Infrastructure: Docker Compose (development & production stacks), Redis (job queue), Structured logging with rotation
AI/ML: Anthropic Claude 3.5 Sonnet (image alt text validation), Google Cloud Vision (OCR, text-in-images detection)
Key libraries: pypdf (PDF parsing), pdfplumber (text extraction), pdf2image (rasterization), Pillow (image processing), pytesseract (OCR), textblob (readability), weasyprint (PDF report generation), veraPDF (PDF/UA-1 validation)

Architecture

The system uses a three-interface architecture with a centralized asynchronous job queue backend:

┌─────────────────────────────────────────────────────────────┐
│  Three User Interfaces                                      │
├─────────────────────┬──────────────────┬────────────────────┤
│  Web UI             │  REST API        │  CLI               │
│  (index.html)       │  (api.php)       │ (enterprise_pdf_   │
│  Vanilla JS         │  PHP endpoints   │  checker.py)       │
│  Drag-drop          │  Bearer/Key auth │  Direct Python     │
└─────────────────────┴──────────────────┴────────────────────┘
                           │
                           ▼
                    ┌──────────────┐
                    │  api.php     │
                    │  (Router)    │
                    └──────┬───────┘
                           │
        ┌──────────────────┼──────────────────┐
        ▼                  ▼                  ▼
    ┌────────┐      ┌──────────────┐    ┌──────────────┐
    │ File   │      │ Redis Queue  │    │ PostgreSQL   │
    │ Upload │      │ (pdf:queue)  │    │ (jobs, audit)│
    │uploads/│      │              │    │              │
    └────────┘      └──────┬───────┘    └──────────────┘
                           │
                           ▼
                    ┌──────────────────┐
                    │ worker.py        │
                    │ (Daemon process) │
                    └──────┬───────────┘
                           │
                           ▼
            ┌──────────────────────────────┐
            │ EnterprisePDFChecker         │
            │ • 30+ WCAG checks            │
            │ • AI image analysis (Claude) │
            │ • OCR (GCV)                  │
            │ • Contrast analysis          │
            │ • Readability metrics        │
            └──────┬───────────────────────┘
                   │
                   ▼
            ┌──────────────────┐
            │ results/         │
            │ {job_id}.        │
            │ result.json      │
            └──────────────────┘

Request Flow (Production Docker Stack):

User uploads PDF via web UI, REST API, or CLI
api.php validates auth (Bearer token or API key via auth.php), saves file to uploads/
Job created in PostgreSQL, queued to Redis (pdf:queue)
worker.py daemon (background process) pops job, invokes EnterprisePDFChecker.check_all()
All external API calls (Claude, GCV) wrapped with exponential backoff retry logic (retry_helper.py)
Results written to results/{job_id}.result.json, PostgreSQL updated with completion status
Client polls api.php?action=status for progress, fetches final results when ready
Automatic cleanup (cleanup.py) removes uploads after 24h, results after 30 days

Key Source Files:

File	Purpose
`enterprise_pdf_checker.py`	Core validation engine — 30+ WCAG checks, AI image analysis, scoring logic
`api.php`	REST API router — upload/check/status/result/remediate/download endpoints, CORS headers
`auth.php`	Authentication middleware — Bearer token, X-API-Key, dev mode localhost bypass
`worker.py`	Background daemon — Redis queue consumer, graceful shutdown on signals
`db_manager.py`	PostgreSQL ORM — jobs CRUD, audit logging, connection pooling
`redis_queue.py`	Redis operations — job enqueue/dequeue, status tracking, rate limiting
`pdf_remediation.py`	Auto-remediation — metadata fixing, language tagging, structural repairs
`retry_helper.py`	Exponential backoff — retries for Claude API, GCV API, transient failures
`report_generator.py`	Result formatting — JSON reports, HTML export, compliance summaries
`logger_config.py`	Structured logging — JSON output, file rotation (10MB max), syslog integration
`cleanup.py`	Scheduled task — file retention enforcement (24h uploads, 30d results)
`index.html`	Web UI root — loads CSS/JS, sets up drag-drop zone, result viewer
`js/app.js`	Frontend logic — API calls, progress polling, DOM updates, dark mode
`css/style.css`	Branding — Oliver palette (black, #FFC407), Montserrat font, responsive layout

Dev Commands

# Activate virtual environment
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Run all tests (31 automated tests)
pytest tests/ -v

# Run with coverage report
pytest tests/ --cov=. --cov-report=html

# Run single test file
pytest tests/test_checker.py -v

# Skip integration tests (faster local runs)
pytest tests/ -m "not integration"

# Start development server (PHP)
php -S localhost:8000

# CLI: Full accessibility check
python enterprise_pdf_checker.py document.pdf --output report.json

# CLI: Quick check (skip AI image analysis)
python enterprise_pdf_checker.py document.pdf --quick

# CLI: Auto-remediate all fixable issues
python pdf_remediation.py document.pdf --output fixed.pdf --all

# Docker development stack (all services)
docker-compose up

# Docker production stack
docker-compose -f docker-compose.prod.yml up -d

# Run tests inside Docker container
docker-compose exec worker pytest tests/ -v

# View worker logs
docker-compose logs -f worker

Deployment

Server: optical-web-1
Deploy:
- Development: docker-compose up
- Production: docker-compose -f docker-compose.prod.yml up -d OR via deploy.sh (runs git pull, restarts Apache)
- Manual: Push to git@bitbucket.org:zlalani/pdf-accessibility.git main branch; server auto-deploys via webhook
URL: https://ai-sandbox.oliver.solutions/pdf-accessibility
Port: 8000 (development), 80/443 (production via Apache reverse proxy)
Service: None (Docker Compose manages container lifecycle; Apache may use systemd)
Local path: /Users/ai_leed/Documents/Projects/Oliver/pdf-accessibility

Environment Variables

All configured in .env (copy from .env.example):

ANTHROPIC_API_KEY — Anthropic Claude API key (required; get from https://console.anthropic.com/)
GOOGLE_API_KEY — Google Cloud Vision API key (optional; for OCR and text-in-images detection)
GOOGLE_APPLICATION_CREDENTIALS — Path to GCP service account JSON (alternative to API key)
DEV_MODE — Set to true for localhost auth bypass (development only)
DB_HOST, DB_PORT, DB_NAME, DB_USER, DB_PASSWORD — PostgreSQL connection (docker-compose provides defaults)
CLOUD_RUN_URL — Optional Cloud Run service URL for distributed PDF processing; defaults to local Python
GCP_SA_KEY_PATH — Path to GCP service account key for Cloud Run authentication
GCS_BUCKET_NAME — Google Cloud Storage bucket for page images (default: optical-pdf-images)
RETENTION_HOURS — Keep uploaded PDFs for N hours before deletion (default: 24)
RESULTS_RETENTION_HOURS — Keep result/meta JSON for N hours before deletion (default: 720 = 30 days)
AZURE_TENANT_ID, AZURE_CLIENT_ID, `AZURE_

Timeline / Git History

Date	Change
2026-03-18	Fix CP14 heading detection via RoleMap + manual pass support
2026-03-18	Persist adjusted score to server on Recalculate
2026-03-18	Address client feedback: WCAG badges, table grouping, history UX
2026-03-16	PDF report reflects adjusted score + manual pass
2026-03-13	Move document history to separate history.html page
2026-03-13	Fix history: read jobs from data.data.jobs

Sessions

2026-04-14 – Project catalogued

Done: Added to Obsidian second brain with full details.

Change Log

Date	Requested	Changed	Files
2026-03-18	Fix heading detection	CP14 via RoleMap + manual pass	enterprise_pdf_checker.py
2026-03-13	Separate history page	Move to history.html	history.html, api.php

11 KiB Raw Blame History Unescape Escape