10 KiB
| name | client | status | tech | local_path | deploy | url | server | tags | created | last_commit | commits | port | db | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PDF Accessibility Checker | Oliver Internal | active |
|
/Users/ai_leed/Documents/Projects/Oliver/pdf-accessibility | docker-compose up or docker-compose -f docker-compose.prod.yml up -d | http://localhost:8000 | local |
|
2026-04-14 | 2026-03-18 | 69 | 8000 | PostgreSQL |
Overview
PDF Accessibility is an AI-powered validation tool that checks PDF documents against WCAG 2.1 Level A & AA standards using Claude 3.5 Sonnet for image/alt-text analysis and Google Cloud Vision for OCR. It combines traditional PDF analysis with machine learning to achieve ~95% automated coverage of accessibility requirements, offering auto-remediation, a visual page inspector, and three interfaces: Web UI, REST API, and CLI. Built for Oliver Internal, it enables organizations to audit and fix PDF accessibility issues at scale with minimal manual intervention.
Tech Stack
- Frontend: Vanilla JavaScript + HTML5/CSS3, drag-drop UI with visual inspector, dark mode support
- Backend: PHP (REST API) + Python (core engine, worker daemon, CLI)
- Database: PostgreSQL 16 (jobs tracking, audit logs), Redis (job queue)
- Infrastructure: Docker Compose (dev & prod stacks), Azure AD/MSAL for SSO auth
- AI/ML: Anthropic Claude 3.5 Sonnet (image analysis, alt-text validation), Google Cloud Vision (OCR, text detection), veraPDF (PDF/UA-1 validation)
- Key libraries: pypdf, pdfplumber, Pillow, pdf2image, pytesseract, textblob, anthropic, google-cloud-vision
Architecture
The system comprises three parallel interfaces (Web UI, REST API, CLI) feeding into a unified core engine:
┌──────────────────────────────────────────────────────────────┐
│ THREE INTERFACES │
├──────────────────┬───────────────────────┬──────────────────┤
│ Web UI │ REST API │ CLI │
│ (index.html) │ (api.php) │ (Python direct) │
│ Drag-drop │ POST /upload │ python ...py │
│ Visual inspector │ GET /status │ --output JSON │
└────────┬─────────┴──────────┬────────────┴────────┬─────────┘
│ │ │
└────────────────────┼─────────────────────┘
▼
┌──────────────────┐
│ Job Orchestration│
├──────────────────┤
│ auth.php │ Auth gate (Bearer/X-API-Key)
│ db_manager.py │ PostgreSQL ORM
│ redis_queue.py │ Job queue (pdf:queue)
└────────┬──────────┘
▼
┌───────────────────────────────────┐
│ worker.py (Background Daemon) │
│ Pops jobs → Runs checks → Stores │
│ results/{job_id}.result.json │
└────────┬────────────────────────┘
▼
╔═════════════════════════════════════════╗
║ enterprise_pdf_checker.py ║
║ CORE ENGINE: 30+ WCAG checks ║
╠═════════════════════════════════════════╣
│ 1. Metadata & structure (pypdf) │
│ 2. Text extraction & readability │
│ 3. Color contrast (Pillow analysis) │
│ 4. Image alt-text (Claude 3.5 Sonnet) │
│ 5. OCR for text-in-images (GCV) │
│ 6. Heading hierarchy & tagging │
│ 7. Form field labels │
│ 8. PDF/UA-1 validation (veraPDF) │
╚═════════════════════════════════════════╝
│
├─→ pdf_remediation.py (Auto-fix)
├─→ report_generator.py (Format results)
└─→ retry_helper.py (Exponential backoff)
Data Flow (Production/Docker):
- Client uploads PDF via Web UI →
api.phpvalidates auth & saves touploads/ - Job pushed to Redis queue; PostgreSQL job record created with status
pending worker.pydaemon polls Redis, pops job, runsEnterprisePDFChecker.check_all()- Results written to
results/{job_id}.result.json; DB updated with statuscomplete - Client polls
api.php?action=statusfor progress; retrieves results when ready - Optional: User triggers
api.php?action=remediatefor auto-fix → new PDF created
Key Design Decisions:
- Async queue: Redis + background worker allows long-running AI checks without blocking HTTP
- PostgreSQL: Persistent job history, audit logging, user isolation (MSAL integration)
- Caching: AI responses cached by content hash in
.cache/to reduce API costs (~$0.015/image) - Graceful degradation: CLI mode works standalone; API mode requires Docker stack
- Environment-aware:
DEV_MODE=truebypasses auth for localhost; production requires valid keys
Dev Commands
# Setup
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
brew install php tesseract poppler verapdf # macOS system deps
# Development (PHP dev server + Python worker in separate terminals)
source venv/bin/activate
php -S localhost:8000 # Terminal 1: Start dev server
python worker.py # Terminal 2: Start background worker
# Docker development stack (all services)
docker-compose up
# Docker production stack (detached)
docker-compose -f docker-compose.prod.yml up -d
# Testing
pytest tests/ -v # All 31 tests
pytest tests/ --cov=. --cov-report=html # With coverage (34% current)
pytest tests/test_checker.py -v # Single file
pytest tests/ -m "not integration" # Skip slow tests
# CLI Usage (direct Python)
python enterprise_pdf_checker.py document.pdf --output report.json # Full check
python enterprise_pdf_checker.py document.pdf --quick # Skip AI (faster)
python pdf_remediation.py document.pdf --output fixed.pdf --all # Auto-remediate
Deployment
- Server: Local development (can deploy to optical-web-1 or Cloud Run)
- Deploy:
docker-compose -f docker-compose.prod.yml up -d(Docker) orbash deploy.sh(Git-based) - URL:
http://localhost:8000(dev) | Production URL in.env(CLOUD_RUN_URL) - Port: 8000 (HTTP), 1220 (Redis in prod), 1221 (PostgreSQL in prod)
- Service: None (containerized); can be wrapped in systemd if needed
- Local path:
/Users/ai_leed/Documents/Projects/Oliver/pdf-accessibility
Production Notes:
- Set
DEV_MODE=falsebefore deploying - Rotate default API key
dev_key_12345→ generate secure key - Store
ANTHROPIC_API_KEYandGOOGLE_API_KEYin secrets manager - MongoDB/backup strategy for PostgreSQL recommended
- veraPDF requires Java; included in Docker image
Environment Variables
ANTHROPIC_API_KEY— Anthropic Claude API key (required for AI image analysis). Get from https://console.anthropic.com/GOOGLE_API_KEY— Google Cloud Vision API key (optional; for enhanced OCR). Alternative:GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.jsonGOOGLE_CLOUD_PROJECT— GCP project ID (if using service account file)DEV_MODE=true|false— Iftrue, localhost bypasses authentication. Never enable in production.DB_HOST,DB_PORT,DB_NAME,DB_USER,DB_PASSWORD— PostgreSQL connection (Docker: use service namepostgres)REDIS_URL— Redis connection string (Docker:redis://redis:6379)RETENTION_HOURS— Delete uploaded PDFs after N hours (default 24)RESULTS_RETENTION_HOURS— Keep result JSON files for N hours (default 720 = 30 days)CLOUD_RUN_URL— Deployed Cloud Run endpoint for remote PDF processingGCP_SA_KEY_PATH— Path to GCP service account key JSONGCS_BUCKET_NAME— Google Cloud Storage bucket for page imagesAZURE_TENANT_ID,AZURE_CLIENT_ID,AZURE_REDIRECT_URI— Azure AD (MSAL) SSO configuration
API / Endpoints
REST API (api.php base URL: http://localhost:8000)
| Method | Endpoint | Auth | Body/Params | Returns |
|---|---|---|---|---|
| POST | /api.php (action=upload) |
Bearer/X-API-Key | file (multipart) |
{"job_id": "...", "status": "pending"} |
| GET | /api.php?action=status&job_id=... |
Bearer/X-API-Key | N/A | {"status": "processing|complete|failed", "progress": 45} |
| GET | /api.php?action=result&job_id=... |
Bearer/X-API-Key | N/A | {"result": {...}} (full WC |
Timeline / Git History
| Date | Change |
|---|---|
| 2026-03-18 | Fix CP14 heading detection via RoleMap + manual pass support |
| 2026-03-18 | Persist adjusted score to server on Recalculate |
| 2026-03-18 | Address client feedback: WCAG badges, table grouping, history UX |
| 2026-03-16 | PDF report reflects adjusted score + manual pass |
| 2026-03-13 | Move document history to separate history.html page |
| 2026-03-13 | Fix history: read jobs from data.data.jobs |
Sessions
2026-04-14 – Project catalogued
Done: Added to Obsidian second brain with full details.
Change Log
| Date | Requested | Changed | Files |
|---|---|---|---|
| 2026-03-18 | Fix heading detection | CP14 via RoleMap + manual pass | enterprise_pdf_checker.py |
| 2026-03-13 | Separate history page | Move to history.html | history.html, api.php |