Replace the Redis queue + Python worker daemon with a synchronous HTTP call to a Cloud Run service, eliminating Redis and simplifying the infrastructure from 4 containers (web, worker, redis, postgres) to just web + postgres (with Cloud Run handling processing). - Add cloudrun_service.py: Flask app wrapping EnterprisePDFChecker with POST /check and GET /health endpoints, GCS image upload - Add Dockerfile.cloudrun + requirements-cloudrun.txt for Cloud Run image - Add cloudbuild.yaml for Cloud Build with custom Dockerfile - Rewrite api.php: remove all Redis code, add Cloud Run OIDC auth (getCloudRunToken), synchronous processing in handleCheck(), file-based rate limiting, GCS redirect in handleImage(), DB helper updateJobInDatabase() - Update js/upload.js: handle synchronous completed response from Cloud Run, increase poll timeout to 15 minutes - Update js/page-viewer.js: use GCS URLs directly for page images - Simplify docker-compose.yml and docker-compose.prod.yml: remove worker and redis services - Remove PHP Redis extension from Dockerfile.web - Set 900s timeouts across nginx, PHP-FPM, gunicorn, curl, and Cloud Run - Update cleanup.py: remove result_images pattern (now on GCS), add rate_limits cleanup - Update .env.example: replace Redis vars with Cloud Run/GCS config Cloud Run service deployed to: https://pdf-checker-bcb6ipdqka-uc.a.run.app GCS bucket: gs://optical-pdf-images (7-day lifecycle, public read) GCP project: optical-414516 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4.4 KiB
4.4 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
AI-powered PDF accessibility checker that validates documents against WCAG 2.1 Level A & AA standards. Combines traditional PDF analysis (pypdf, pdfplumber) with AI models (Anthropic Claude, Google Cloud Vision) for ~95% automated WCAG coverage. Branded for "Oliver" (Montserrat font, black/#FFC407 palette).
Commands
Testing
source venv/bin/activate
pytest tests/ -v # Run all tests (31 tests)
pytest tests/ --cov=. --cov-report=html # With coverage report
pytest tests/test_checker.py -v # Single test file
pytest tests/ -m "not integration" # Skip integration tests
Running Locally
source venv/bin/activate
php -S localhost:8000 # Start PHP dev server
Docker
docker-compose up # Development stack
docker-compose -f docker-compose.prod.yml up -d # Production stack
docker-compose exec worker pytest tests/ -v # Tests in container
CLI Usage
python enterprise_pdf_checker.py document.pdf --output report.json # Full check
python enterprise_pdf_checker.py document.pdf --quick # Skip AI checks
python pdf_remediation.py document.pdf --output fixed.pdf --all # Auto-remediate
Architecture
Three Interfaces
- Web UI (
index.html+js/+css/) — vanilla JS, drag-drop upload, visual inspector - REST API (
api.php) — PHP endpoints: upload, check, status, result, remediate, download - CLI (
enterprise_pdf_checker.py) — direct Python execution
Request Flow (Docker/Production)
api.phpreceives upload, validates viaauth.php, saves touploads/- Job pushed to Redis queue (
pdf:queue) and tracked in PostgreSQL worker.pydaemon pops jobs, runsEnterprisePDFChecker.check_all()- Results written to
results/{job_id}.result.json, DB updated - Client polls
api.php?action=statusthen fetches results
Key Source Files
| File | Purpose |
|---|---|
enterprise_pdf_checker.py |
Core engine — 30+ WCAG checks, AI image analysis, scoring |
api.php |
REST API — file handling, job queue integration, CORS |
auth.php |
Authentication — Bearer/X-API-Key, dev mode localhost bypass |
worker.py |
Background daemon — Redis queue consumer, graceful shutdown |
db_manager.py |
PostgreSQL ORM — jobs CRUD, audit logging |
redis_queue.py |
Redis operations — job queue, status tracking, rate limiting |
pdf_remediation.py |
Auto-fix — metadata, tagging, language tags |
retry_helper.py |
Exponential backoff for external API calls |
report_generator.py |
Result formatting and report generation |
logger_config.py |
Structured logging with rotation (10MB max) |
cleanup.py |
File retention cleanup (24h for uploads/results) |
Data Layer
- PostgreSQL —
jobstable (status, score, grade, result JSON),audit_logtable. Schema indb/init.sql - Redis — Job queue (
pdf:queue), status tracking (pdf:status:*), rate limiting (pdf:rate:*)
External APIs
- Anthropic Claude 3.5 Sonnet — alt text validation, image classification, text-in-images
- Google Cloud Vision — OCR, text detection
- veraPDF (optional) — PDF/UA-1 compliance validation
Frontend Structure
js/app.js (controller), js/upload.js (drag-drop), js/api.js (HTTP client), js/results.js (display), js/page-viewer.js (PDF inspector), js/batch.js (batch processing), js/utils.js (helpers)
Tech Stack
- Backend: Python 3.11 (processing), PHP 8.2 (API)
- Frontend: Vanilla HTML/CSS/JS
- Database: PostgreSQL 16, Redis 7
- Infrastructure: Docker, Nginx/Apache, PHP-FPM
- System deps: Tesseract OCR, Poppler, Ghostscript
Configuration
Environment variables via .env (see .env.example). Key settings:
ANTHROPIC_API_KEY/GOOGLE_API_KEY— AI API credentialsDEV_MODE=true— bypasses auth for localhost requestsDB_HOST,DB_PORT,REDIS_HOST,REDIS_PORT— infrastructure endpoints- Production uses ports 1220 (Redis) and 1221 (PostgreSQL) to avoid host conflicts
Testing
- pytest with markers:
integration,slow,api - Config in
pytest.ini - Fixtures in
tests/conftest.py - Sample PDFs in
Test_files/ - No linter currently configured