michael 4080638856 Migrate PDF processing from Redis worker to Google Cloud Run

Replace the Redis queue + Python worker daemon with a synchronous HTTP
call to a Cloud Run service, eliminating Redis and simplifying the
infrastructure from 4 containers (web, worker, redis, postgres) to just
web + postgres (with Cloud Run handling processing).

- Add cloudrun_service.py: Flask app wrapping EnterprisePDFChecker with
  POST /check and GET /health endpoints, GCS image upload
- Add Dockerfile.cloudrun + requirements-cloudrun.txt for Cloud Run image
- Add cloudbuild.yaml for Cloud Build with custom Dockerfile
- Rewrite api.php: remove all Redis code, add Cloud Run OIDC auth
  (getCloudRunToken), synchronous processing in handleCheck(), file-based
  rate limiting, GCS redirect in handleImage(), DB helper updateJobInDatabase()
- Update js/upload.js: handle synchronous completed response from Cloud Run,
  increase poll timeout to 15 minutes
- Update js/page-viewer.js: use GCS URLs directly for page images
- Simplify docker-compose.yml and docker-compose.prod.yml: remove worker
  and redis services
- Remove PHP Redis extension from Dockerfile.web
- Set 900s timeouts across nginx, PHP-FPM, gunicorn, curl, and Cloud Run
- Update cleanup.py: remove result_images pattern (now on GCS), add
  rate_limits cleanup
- Update .env.example: replace Redis vars with Cloud Run/GCS config

Cloud Run service deployed to:
  https://pdf-checker-bcb6ipdqka-uc.a.run.app
GCS bucket: gs://optical-pdf-images (7-day lifecycle, public read)
GCP project: optical-414516

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-25 14:50:38 -06:00

4.4 KiB

Raw Permalink Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

AI-powered PDF accessibility checker that validates documents against WCAG 2.1 Level A & AA standards. Combines traditional PDF analysis (pypdf, pdfplumber) with AI models (Anthropic Claude, Google Cloud Vision) for ~95% automated WCAG coverage. Branded for "Oliver" (Montserrat font, black/#FFC407 palette).

Commands

Testing

source venv/bin/activate
pytest tests/ -v                          # Run all tests (31 tests)
pytest tests/ --cov=. --cov-report=html   # With coverage report
pytest tests/test_checker.py -v           # Single test file
pytest tests/ -m "not integration"        # Skip integration tests

Running Locally

source venv/bin/activate
php -S localhost:8000                     # Start PHP dev server

Docker

docker-compose up                                      # Development stack
docker-compose -f docker-compose.prod.yml up -d        # Production stack
docker-compose exec worker pytest tests/ -v            # Tests in container

CLI Usage

python enterprise_pdf_checker.py document.pdf --output report.json   # Full check
python enterprise_pdf_checker.py document.pdf --quick                # Skip AI checks
python pdf_remediation.py document.pdf --output fixed.pdf --all      # Auto-remediate

Architecture

Three Interfaces

Web UI (index.html + js/ + css/) — vanilla JS, drag-drop upload, visual inspector
REST API (api.php) — PHP endpoints: upload, check, status, result, remediate, download
CLI (enterprise_pdf_checker.py) — direct Python execution

Request Flow (Docker/Production)

api.php receives upload, validates via auth.php, saves to uploads/
Job pushed to Redis queue (pdf:queue) and tracked in PostgreSQL
worker.py daemon pops jobs, runs EnterprisePDFChecker.check_all()
Results written to results/{job_id}.result.json, DB updated
Client polls api.php?action=status then fetches results

Key Source Files

File	Purpose
`enterprise_pdf_checker.py`	Core engine — 30+ WCAG checks, AI image analysis, scoring
`api.php`	REST API — file handling, job queue integration, CORS
`auth.php`	Authentication — Bearer/X-API-Key, dev mode localhost bypass
`worker.py`	Background daemon — Redis queue consumer, graceful shutdown
`db_manager.py`	PostgreSQL ORM — jobs CRUD, audit logging
`redis_queue.py`	Redis operations — job queue, status tracking, rate limiting
`pdf_remediation.py`	Auto-fix — metadata, tagging, language tags
`retry_helper.py`	Exponential backoff for external API calls
`report_generator.py`	Result formatting and report generation
`logger_config.py`	Structured logging with rotation (10MB max)
`cleanup.py`	File retention cleanup (24h for uploads/results)

Data Layer

PostgreSQL — jobs table (status, score, grade, result JSON), audit_log table. Schema in db/init.sql
Redis — Job queue (pdf:queue), status tracking (pdf:status:*), rate limiting (pdf:rate:*)

External APIs

Anthropic Claude 3.5 Sonnet — alt text validation, image classification, text-in-images
Google Cloud Vision — OCR, text detection
veraPDF (optional) — PDF/UA-1 compliance validation

Frontend Structure

js/app.js (controller), js/upload.js (drag-drop), js/api.js (HTTP client), js/results.js (display), js/page-viewer.js (PDF inspector), js/batch.js (batch processing), js/utils.js (helpers)

Tech Stack

Backend: Python 3.11 (processing), PHP 8.2 (API)
Frontend: Vanilla HTML/CSS/JS
Database: PostgreSQL 16, Redis 7
Infrastructure: Docker, Nginx/Apache, PHP-FPM
System deps: Tesseract OCR, Poppler, Ghostscript

Configuration

Environment variables via .env (see .env.example). Key settings:

ANTHROPIC_API_KEY / GOOGLE_API_KEY — AI API credentials
DEV_MODE=true — bypasses auth for localhost requests
DB_HOST, DB_PORT, REDIS_HOST, REDIS_PORT — infrastructure endpoints
Production uses ports 1220 (Redis) and 1221 (PostgreSQL) to avoid host conflicts

Testing

pytest with markers: integration, slow, api
Config in pytest.ini
Fixtures in tests/conftest.py
Sample PDFs in Test_files/
No linter currently configured

4.4 KiB Raw Permalink Blame History

CLAUDE.md

Project Overview

Commands

Testing

Running Locally

Docker

CLI Usage

Architecture

Three Interfaces

Request Flow (Docker/Production)

Key Source Files

Data Layer

External APIs

Frontend Structure

Tech Stack

Configuration

Testing

4.4 KiB

Raw Permalink Blame History