Migrate PDF processing from Redis worker to Google Cloud Run
Replace the Redis queue + Python worker daemon with a synchronous HTTP call to a Cloud Run service, eliminating Redis and simplifying the infrastructure from 4 containers (web, worker, redis, postgres) to just web + postgres (with Cloud Run handling processing). - Add cloudrun_service.py: Flask app wrapping EnterprisePDFChecker with POST /check and GET /health endpoints, GCS image upload - Add Dockerfile.cloudrun + requirements-cloudrun.txt for Cloud Run image - Add cloudbuild.yaml for Cloud Build with custom Dockerfile - Rewrite api.php: remove all Redis code, add Cloud Run OIDC auth (getCloudRunToken), synchronous processing in handleCheck(), file-based rate limiting, GCS redirect in handleImage(), DB helper updateJobInDatabase() - Update js/upload.js: handle synchronous completed response from Cloud Run, increase poll timeout to 15 minutes - Update js/page-viewer.js: use GCS URLs directly for page images - Simplify docker-compose.yml and docker-compose.prod.yml: remove worker and redis services - Remove PHP Redis extension from Dockerfile.web - Set 900s timeouts across nginx, PHP-FPM, gunicorn, curl, and Cloud Run - Update cleanup.py: remove result_images pattern (now on GCS), add rate_limits cleanup - Update .env.example: replace Redis vars with Cloud Run/GCS config Cloud Run service deployed to: https://pdf-checker-bcb6ipdqka-uc.a.run.app GCS bucket: gs://optical-pdf-images (7-day lifecycle, public read) GCP project: optical-414516 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
463b504d67
commit
4080638856
16 changed files with 722 additions and 223 deletions
13
.env.example
13
.env.example
|
|
@ -27,12 +27,13 @@ DB_NAME=pdf_checker
|
|||
DB_USER=pdf_checker
|
||||
DB_PASSWORD=change_me_in_production
|
||||
|
||||
# Redis - used for job queue in Docker setup
|
||||
REDIS_HOST=redis
|
||||
REDIS_PORT=6379
|
||||
|
||||
# Worker configuration
|
||||
WORKER_COUNT=2
|
||||
# Cloud Run - PDF processing service
|
||||
# Set this to your deployed Cloud Run URL (leave empty for local Python fallback)
|
||||
CLOUD_RUN_URL=https://pdf-checker-bcb6ipdqka-uc.a.run.app
|
||||
# Path to GCP service account key for authenticating to Cloud Run
|
||||
GCP_SA_KEY_PATH=./pdf-api-invoker-key.json
|
||||
# GCS bucket for page images
|
||||
GCS_BUCKET_NAME=optical-pdf-images
|
||||
|
||||
# Azure AD / MSAL Authentication
|
||||
AZURE_TENANT_ID=e519c2e6-bc6d-4fdf-8d9c-923c2f002385
|
||||
|
|
|
|||
8
.gitignore
vendored
8
.gitignore
vendored
|
|
@ -33,7 +33,13 @@ Thumbs.db
|
|||
|
||||
# Docker volumes (local data)
|
||||
pg-data/
|
||||
redis-data/
|
||||
|
||||
# GCP service account keys
|
||||
*-key.json
|
||||
*-credentials.json
|
||||
|
||||
# Rate limit data
|
||||
rate_limits/
|
||||
|
||||
# Coverage
|
||||
.coverage
|
||||
|
|
|
|||
100
CLAUDE.md
Normal file
100
CLAUDE.md
Normal file
|
|
@ -0,0 +1,100 @@
|
|||
# CLAUDE.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||
|
||||
## Project Overview
|
||||
|
||||
AI-powered PDF accessibility checker that validates documents against WCAG 2.1 Level A & AA standards. Combines traditional PDF analysis (pypdf, pdfplumber) with AI models (Anthropic Claude, Google Cloud Vision) for ~95% automated WCAG coverage. Branded for "Oliver" (Montserrat font, black/#FFC407 palette).
|
||||
|
||||
## Commands
|
||||
|
||||
### Testing
|
||||
```bash
|
||||
source venv/bin/activate
|
||||
pytest tests/ -v # Run all tests (31 tests)
|
||||
pytest tests/ --cov=. --cov-report=html # With coverage report
|
||||
pytest tests/test_checker.py -v # Single test file
|
||||
pytest tests/ -m "not integration" # Skip integration tests
|
||||
```
|
||||
|
||||
### Running Locally
|
||||
```bash
|
||||
source venv/bin/activate
|
||||
php -S localhost:8000 # Start PHP dev server
|
||||
```
|
||||
|
||||
### Docker
|
||||
```bash
|
||||
docker-compose up # Development stack
|
||||
docker-compose -f docker-compose.prod.yml up -d # Production stack
|
||||
docker-compose exec worker pytest tests/ -v # Tests in container
|
||||
```
|
||||
|
||||
### CLI Usage
|
||||
```bash
|
||||
python enterprise_pdf_checker.py document.pdf --output report.json # Full check
|
||||
python enterprise_pdf_checker.py document.pdf --quick # Skip AI checks
|
||||
python pdf_remediation.py document.pdf --output fixed.pdf --all # Auto-remediate
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
### Three Interfaces
|
||||
- **Web UI** (`index.html` + `js/` + `css/`) — vanilla JS, drag-drop upload, visual inspector
|
||||
- **REST API** (`api.php`) — PHP endpoints: upload, check, status, result, remediate, download
|
||||
- **CLI** (`enterprise_pdf_checker.py`) — direct Python execution
|
||||
|
||||
### Request Flow (Docker/Production)
|
||||
1. `api.php` receives upload, validates via `auth.php`, saves to `uploads/`
|
||||
2. Job pushed to Redis queue (`pdf:queue`) and tracked in PostgreSQL
|
||||
3. `worker.py` daemon pops jobs, runs `EnterprisePDFChecker.check_all()`
|
||||
4. Results written to `results/{job_id}.result.json`, DB updated
|
||||
5. Client polls `api.php?action=status` then fetches results
|
||||
|
||||
### Key Source Files
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `enterprise_pdf_checker.py` | Core engine — 30+ WCAG checks, AI image analysis, scoring |
|
||||
| `api.php` | REST API — file handling, job queue integration, CORS |
|
||||
| `auth.php` | Authentication — Bearer/X-API-Key, dev mode localhost bypass |
|
||||
| `worker.py` | Background daemon — Redis queue consumer, graceful shutdown |
|
||||
| `db_manager.py` | PostgreSQL ORM — jobs CRUD, audit logging |
|
||||
| `redis_queue.py` | Redis operations — job queue, status tracking, rate limiting |
|
||||
| `pdf_remediation.py` | Auto-fix — metadata, tagging, language tags |
|
||||
| `retry_helper.py` | Exponential backoff for external API calls |
|
||||
| `report_generator.py` | Result formatting and report generation |
|
||||
| `logger_config.py` | Structured logging with rotation (10MB max) |
|
||||
| `cleanup.py` | File retention cleanup (24h for uploads/results) |
|
||||
|
||||
### Data Layer
|
||||
- **PostgreSQL** — `jobs` table (status, score, grade, result JSON), `audit_log` table. Schema in `db/init.sql`
|
||||
- **Redis** — Job queue (`pdf:queue`), status tracking (`pdf:status:*`), rate limiting (`pdf:rate:*`)
|
||||
|
||||
### External APIs
|
||||
- **Anthropic Claude 3.5 Sonnet** — alt text validation, image classification, text-in-images
|
||||
- **Google Cloud Vision** — OCR, text detection
|
||||
- **veraPDF** (optional) — PDF/UA-1 compliance validation
|
||||
|
||||
### Frontend Structure
|
||||
`js/app.js` (controller), `js/upload.js` (drag-drop), `js/api.js` (HTTP client), `js/results.js` (display), `js/page-viewer.js` (PDF inspector), `js/batch.js` (batch processing), `js/utils.js` (helpers)
|
||||
|
||||
## Tech Stack
|
||||
- **Backend**: Python 3.11 (processing), PHP 8.2 (API)
|
||||
- **Frontend**: Vanilla HTML/CSS/JS
|
||||
- **Database**: PostgreSQL 16, Redis 7
|
||||
- **Infrastructure**: Docker, Nginx/Apache, PHP-FPM
|
||||
- **System deps**: Tesseract OCR, Poppler, Ghostscript
|
||||
|
||||
## Configuration
|
||||
Environment variables via `.env` (see `.env.example`). Key settings:
|
||||
- `ANTHROPIC_API_KEY` / `GOOGLE_API_KEY` — AI API credentials
|
||||
- `DEV_MODE=true` — bypasses auth for localhost requests
|
||||
- `DB_HOST`, `DB_PORT`, `REDIS_HOST`, `REDIS_PORT` — infrastructure endpoints
|
||||
- Production uses ports 1220 (Redis) and 1221 (PostgreSQL) to avoid host conflicts
|
||||
|
||||
## Testing
|
||||
- pytest with markers: `integration`, `slow`, `api`
|
||||
- Config in `pytest.ini`
|
||||
- Fixtures in `tests/conftest.py`
|
||||
- Sample PDFs in `Test_files/`
|
||||
- No linter currently configured
|
||||
29
Dockerfile.cloudrun
Normal file
29
Dockerfile.cloudrun
Normal file
|
|
@ -0,0 +1,29 @@
|
|||
FROM python:3.11-slim
|
||||
|
||||
# Install system dependencies for PDF processing
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
tesseract-ocr \
|
||||
tesseract-ocr-eng \
|
||||
poppler-utils \
|
||||
ghostscript \
|
||||
libgl1 \
|
||||
libglib2.0-0 \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Install Python dependencies
|
||||
COPY requirements-cloudrun.txt .
|
||||
RUN pip install --no-cache-dir -r requirements-cloudrun.txt
|
||||
|
||||
# Copy application code (no worker, redis_queue, or db_manager)
|
||||
COPY cloudrun_service.py .
|
||||
COPY enterprise_pdf_checker.py .
|
||||
COPY pdf_remediation.py .
|
||||
COPY logger_config.py .
|
||||
COPY retry_helper.py .
|
||||
|
||||
# Cloud Run sets $PORT; gunicorn binds to it
|
||||
# --workers 1 --threads 1: Cloud Run concurrency=1, one request at a time
|
||||
# --timeout 900: allow up to 15 minutes for large PDFs
|
||||
CMD exec gunicorn --bind :$PORT --workers 1 --threads 1 --timeout 900 cloudrun_service:app
|
||||
|
|
@ -4,12 +4,6 @@ FROM php:8.2-fpm-alpine
|
|||
RUN apk add --no-cache nginx python3 postgresql-dev && \
|
||||
docker-php-ext-install pdo pdo_pgsql
|
||||
|
||||
# Install php-redis via PECL
|
||||
RUN apk add --no-cache --virtual .build-deps $PHPIZE_DEPS && \
|
||||
pecl install redis && \
|
||||
docker-php-ext-enable redis && \
|
||||
apk del .build-deps
|
||||
|
||||
# Copy Nginx config
|
||||
COPY nginx.conf /etc/nginx/http.d/default.conf
|
||||
|
||||
|
|
|
|||
475
api.php
475
api.php
|
|
@ -1,8 +1,10 @@
|
|||
<?php
|
||||
/**
|
||||
* Enterprise PDF Accessibility Checker - API Backend
|
||||
*
|
||||
* Handles file uploads, job processing, and result retrieval
|
||||
*
|
||||
* Handles file uploads, sends PDFs to Cloud Run for processing,
|
||||
* and serves results. No Redis dependency — uses Cloud Run for
|
||||
* processing and file-based rate limiting.
|
||||
*/
|
||||
|
||||
// Load .env file if getenv doesn't work (Apache doesn't set env vars by default)
|
||||
|
|
@ -29,45 +31,53 @@ define('PYTHON_SCRIPT', __DIR__ . '/enterprise_pdf_checker.py');
|
|||
define('MAX_FILE_SIZE', 50 * 1024 * 1024); // 50MB
|
||||
define('ALLOWED_EXTENSIONS', ['pdf']);
|
||||
|
||||
// Redis configuration
|
||||
define('REDIS_HOST', getenv('REDIS_HOST') ?: 'localhost');
|
||||
define('REDIS_PORT', intval(getenv('REDIS_PORT') ?: 6379));
|
||||
define('REDIS_QUEUE', 'pdf:queue');
|
||||
define('REDIS_STATUS_PREFIX', 'pdf:status:');
|
||||
define('REDIS_RATE_PREFIX', 'pdf:rate:');
|
||||
// Cloud Run configuration
|
||||
define('CLOUD_RUN_URL', getenv('CLOUD_RUN_URL') ?: '');
|
||||
define('CLOUD_RUN_TIMEOUT', 900); // 15 minutes
|
||||
define('GCP_SA_KEY_PATH', getenv('GCP_SA_KEY_PATH') ?: __DIR__ . '/pdf-api-invoker-key.json');
|
||||
define('RATE_LIMIT_DIR', __DIR__ . '/rate_limits');
|
||||
|
||||
// Database configuration
|
||||
define('DB_HOST', getenv('DB_HOST') ?: 'localhost');
|
||||
define('DB_PORT', intval(getenv('DB_PORT') ?: 5432));
|
||||
define('DB_NAME', getenv('DB_NAME') ?: 'pdf_checker');
|
||||
define('DB_USER', getenv('DB_USER') ?: 'pdf_checker');
|
||||
define('DB_PASSWORD', getenv('DB_PASSWORD') ?: 'dev_password');
|
||||
|
||||
// Create directories if they don't exist
|
||||
if (!is_dir(UPLOAD_DIR)) mkdir(UPLOAD_DIR, 0755, true);
|
||||
if (!is_dir(RESULTS_DIR)) mkdir(RESULTS_DIR, 0755, true);
|
||||
if (!is_dir(RATE_LIMIT_DIR)) mkdir(RATE_LIMIT_DIR, 0755, true);
|
||||
|
||||
/**
|
||||
* Get Redis connection (lazy singleton)
|
||||
*/
|
||||
function getRedis() {
|
||||
static $redis = null;
|
||||
if ($redis === null) {
|
||||
$redis = new Redis();
|
||||
$redis->connect(REDIS_HOST, REDIS_PORT);
|
||||
}
|
||||
return $redis;
|
||||
}
|
||||
|
||||
/**
|
||||
* Check rate limit via Redis. Returns true if allowed.
|
||||
* Check rate limit via filesystem. Returns true if allowed.
|
||||
* Stores timestamps in JSON files per IP+action.
|
||||
*/
|
||||
function checkRateLimit($action, $limit, $window) {
|
||||
try {
|
||||
$redis = getRedis();
|
||||
$ip = $_SERVER['REMOTE_ADDR'] ?? 'unknown';
|
||||
$key = REDIS_RATE_PREFIX . $ip . ':' . $action;
|
||||
$current = $redis->incr($key);
|
||||
if ($current === 1) {
|
||||
$redis->expire($key, $window);
|
||||
$ip = $_SERVER['REMOTE_ADDR'] ?? 'unknown';
|
||||
$key = preg_replace('/[^a-zA-Z0-9_-]/', '_', $ip . '_' . $action);
|
||||
$file = RATE_LIMIT_DIR . '/' . $key . '.json';
|
||||
|
||||
$now = time();
|
||||
$timestamps = [];
|
||||
|
||||
if (file_exists($file)) {
|
||||
$data = json_decode(file_get_contents($file), true);
|
||||
if (is_array($data)) {
|
||||
// Filter to only timestamps within the window
|
||||
$timestamps = array_filter($data, function($ts) use ($now, $window) {
|
||||
return ($now - $ts) < $window;
|
||||
});
|
||||
}
|
||||
return $current <= $limit;
|
||||
} catch (Exception $e) {
|
||||
return true; // Allow if Redis is down
|
||||
}
|
||||
|
||||
if (count($timestamps) >= $limit) {
|
||||
return false;
|
||||
}
|
||||
|
||||
$timestamps[] = $now;
|
||||
file_put_contents($file, json_encode(array_values($timestamps)));
|
||||
return true;
|
||||
}
|
||||
|
||||
/**
|
||||
|
|
@ -80,6 +90,171 @@ function sanitizeJobId($job_id) {
|
|||
return $job_id;
|
||||
}
|
||||
|
||||
/**
|
||||
* Get an OIDC identity token for authenticating to Cloud Run.
|
||||
* Uses a GCP service account key to create a self-signed JWT,
|
||||
* then exchanges it for an identity token via Google's OAuth endpoint.
|
||||
*/
|
||||
function getCloudRunToken() {
|
||||
static $cachedToken = null;
|
||||
static $cachedExpiry = 0;
|
||||
|
||||
// Return cached token if still valid (with 5-min buffer)
|
||||
if ($cachedToken && time() < ($cachedExpiry - 300)) {
|
||||
return $cachedToken;
|
||||
}
|
||||
|
||||
$keyPath = GCP_SA_KEY_PATH;
|
||||
if (!file_exists($keyPath)) {
|
||||
throw new Exception("GCP service account key not found: $keyPath");
|
||||
}
|
||||
|
||||
$sa = json_decode(file_get_contents($keyPath), true);
|
||||
if (!$sa || !isset($sa['client_email']) || !isset($sa['private_key'])) {
|
||||
throw new Exception("Invalid service account key file");
|
||||
}
|
||||
|
||||
$now = time();
|
||||
$expiry = $now + 3600;
|
||||
|
||||
// Build JWT header and claims
|
||||
$header = base64url_encode(json_encode(['alg' => 'RS256', 'typ' => 'JWT']));
|
||||
$claims = base64url_encode(json_encode([
|
||||
'iss' => $sa['client_email'],
|
||||
'sub' => $sa['client_email'],
|
||||
'aud' => 'https://oauth2.googleapis.com/token',
|
||||
'iat' => $now,
|
||||
'exp' => $expiry,
|
||||
'target_audience' => CLOUD_RUN_URL,
|
||||
]));
|
||||
|
||||
// Sign with RSA-SHA256
|
||||
$signingInput = "$header.$claims";
|
||||
$signature = '';
|
||||
$privateKey = openssl_pkey_get_private($sa['private_key']);
|
||||
if (!$privateKey) {
|
||||
throw new Exception("Failed to parse service account private key");
|
||||
}
|
||||
openssl_sign($signingInput, $signature, $privateKey, OPENSSL_ALGO_SHA256);
|
||||
$jwt = $signingInput . '.' . base64url_encode($signature);
|
||||
|
||||
// Exchange JWT for identity token
|
||||
$ch = curl_init('https://oauth2.googleapis.com/token');
|
||||
curl_setopt_array($ch, [
|
||||
CURLOPT_POST => true,
|
||||
CURLOPT_POSTFIELDS => http_build_query([
|
||||
'grant_type' => 'urn:ietf:params:oauth:grant-type:jwt-bearer',
|
||||
'assertion' => $jwt,
|
||||
]),
|
||||
CURLOPT_RETURNTRANSFER => true,
|
||||
CURLOPT_TIMEOUT => 10,
|
||||
]);
|
||||
$response = curl_exec($ch);
|
||||
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
|
||||
curl_close($ch);
|
||||
|
||||
if ($httpCode !== 200) {
|
||||
throw new Exception("Failed to get identity token: HTTP $httpCode - $response");
|
||||
}
|
||||
|
||||
$tokenData = json_decode($response, true);
|
||||
if (!isset($tokenData['id_token'])) {
|
||||
throw new Exception("No id_token in response: $response");
|
||||
}
|
||||
|
||||
$cachedToken = $tokenData['id_token'];
|
||||
$cachedExpiry = $expiry;
|
||||
|
||||
return $cachedToken;
|
||||
}
|
||||
|
||||
/**
|
||||
* Base64url encode (no padding, URL-safe)
|
||||
*/
|
||||
function base64url_encode($data) {
|
||||
return rtrim(strtr(base64_encode($data), '+/', '-_'), '=');
|
||||
}
|
||||
|
||||
/**
|
||||
* Get PostgreSQL PDO connection (lazy singleton)
|
||||
*/
|
||||
function getDB() {
|
||||
static $pdo = null;
|
||||
if ($pdo === null) {
|
||||
$dsn = sprintf('pgsql:host=%s;port=%d;dbname=%s', DB_HOST, DB_PORT, DB_NAME);
|
||||
$pdo = new PDO($dsn, DB_USER, DB_PASSWORD, [
|
||||
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
|
||||
]);
|
||||
}
|
||||
return $pdo;
|
||||
}
|
||||
|
||||
/**
|
||||
* Insert or update a job record in PostgreSQL
|
||||
*/
|
||||
function updateJobInDatabase($job_id, $filename, $status, $results = null) {
|
||||
try {
|
||||
$pdo = getDB();
|
||||
|
||||
$score = null;
|
||||
$grade = null;
|
||||
$total_issues = null;
|
||||
$critical_count = null;
|
||||
$error_count = null;
|
||||
$warning_count = null;
|
||||
$result_json = null;
|
||||
$processing_time = null;
|
||||
|
||||
if ($results) {
|
||||
$score = $results['accessibility_score'] ?? null;
|
||||
$grade = $results['grade'] ?? null;
|
||||
$issues = $results['issues'] ?? [];
|
||||
$total_issues = count($issues);
|
||||
$critical_count = count(array_filter($issues, fn($i) => ($i['severity'] ?? '') === 'CRITICAL'));
|
||||
$error_count = count(array_filter($issues, fn($i) => ($i['severity'] ?? '') === 'ERROR'));
|
||||
$warning_count = count(array_filter($issues, fn($i) => ($i['severity'] ?? '') === 'WARNING'));
|
||||
$result_json = json_encode($results);
|
||||
$processing_time = $results['stats']['processing_time'] ?? null;
|
||||
}
|
||||
|
||||
$sql = "INSERT INTO jobs (job_id, filename, status, score, grade, total_issues,
|
||||
critical_count, error_count, warning_count, result_json, processing_time,
|
||||
completed_at)
|
||||
VALUES (:job_id, :filename, :status, :score, :grade, :total_issues,
|
||||
:critical_count, :error_count, :warning_count, :result_json::jsonb, :processing_time,
|
||||
CASE WHEN :status2 = 'completed' THEN NOW() ELSE NULL END)
|
||||
ON CONFLICT (job_id) DO UPDATE SET
|
||||
status = EXCLUDED.status,
|
||||
score = COALESCE(EXCLUDED.score, jobs.score),
|
||||
grade = COALESCE(EXCLUDED.grade, jobs.grade),
|
||||
total_issues = COALESCE(EXCLUDED.total_issues, jobs.total_issues),
|
||||
critical_count = COALESCE(EXCLUDED.critical_count, jobs.critical_count),
|
||||
error_count = COALESCE(EXCLUDED.error_count, jobs.error_count),
|
||||
warning_count = COALESCE(EXCLUDED.warning_count, jobs.warning_count),
|
||||
result_json = COALESCE(EXCLUDED.result_json, jobs.result_json),
|
||||
processing_time = COALESCE(EXCLUDED.processing_time, jobs.processing_time),
|
||||
completed_at = CASE WHEN EXCLUDED.status = 'completed' THEN NOW() ELSE jobs.completed_at END";
|
||||
|
||||
$stmt = $pdo->prepare($sql);
|
||||
$stmt->execute([
|
||||
':job_id' => $job_id,
|
||||
':filename' => $filename,
|
||||
':status' => $status,
|
||||
':score' => $score,
|
||||
':grade' => $grade,
|
||||
':total_issues' => $total_issues,
|
||||
':critical_count' => $critical_count,
|
||||
':error_count' => $error_count,
|
||||
':warning_count' => $warning_count,
|
||||
':result_json' => $result_json,
|
||||
':processing_time' => $processing_time,
|
||||
':status2' => $status,
|
||||
]);
|
||||
} catch (Exception $e) {
|
||||
error_log("DB update failed for $job_id: " . $e->getMessage());
|
||||
}
|
||||
}
|
||||
|
||||
// CORS headers for API
|
||||
$allowed_origins = [
|
||||
'https://ai-sandbox.oliver.solutions',
|
||||
|
|
@ -173,18 +348,18 @@ function handleUpload() {
|
|||
if (!isset($_FILES['pdf'])) {
|
||||
error('No file uploaded');
|
||||
}
|
||||
|
||||
|
||||
$file = $_FILES['pdf'];
|
||||
|
||||
|
||||
// Validate file
|
||||
if ($file['error'] !== UPLOAD_ERR_OK) {
|
||||
error('Upload error: ' . $file['error']);
|
||||
}
|
||||
|
||||
|
||||
if ($file['size'] > MAX_FILE_SIZE) {
|
||||
error('File too large. Max size: ' . (MAX_FILE_SIZE / 1024 / 1024) . 'MB');
|
||||
}
|
||||
|
||||
|
||||
$ext = strtolower(pathinfo($file['name'], PATHINFO_EXTENSION));
|
||||
if (!in_array($ext, ALLOWED_EXTENSIONS)) {
|
||||
error('Invalid file type. Only PDF files allowed.');
|
||||
|
|
@ -200,12 +375,12 @@ function handleUpload() {
|
|||
$job_id = 'pdf_' . bin2hex(random_bytes(16));
|
||||
$filename = $job_id . '.pdf';
|
||||
$filepath = UPLOAD_DIR . '/' . $filename;
|
||||
|
||||
|
||||
// Move file
|
||||
if (!move_uploaded_file($file['tmp_name'], $filepath)) {
|
||||
error('Failed to save file');
|
||||
}
|
||||
|
||||
|
||||
// Create job metadata
|
||||
$job_data = [
|
||||
'job_id' => $job_id,
|
||||
|
|
@ -215,12 +390,12 @@ function handleUpload() {
|
|||
'status' => 'uploaded',
|
||||
'filepath' => $filepath
|
||||
];
|
||||
|
||||
|
||||
file_put_contents(
|
||||
RESULTS_DIR . '/' . $job_id . '.meta.json',
|
||||
json_encode($job_data, JSON_PRETTY_PRINT)
|
||||
);
|
||||
|
||||
|
||||
success([
|
||||
'job_id' => $job_id,
|
||||
'filename' => $file['name'],
|
||||
|
|
@ -229,9 +404,11 @@ function handleUpload() {
|
|||
}
|
||||
|
||||
/**
|
||||
* Handle PDF accessibility check — push job to Redis queue
|
||||
* Handle PDF accessibility check — send PDF to Cloud Run synchronously
|
||||
*/
|
||||
function handleCheck() {
|
||||
set_time_limit(900); // Allow up to 15 minutes
|
||||
|
||||
$job_id = $_POST['job_id'] ?? '';
|
||||
|
||||
if (empty($job_id)) {
|
||||
|
|
@ -253,32 +430,98 @@ function handleCheck() {
|
|||
}
|
||||
|
||||
$job_data = json_decode(file_get_contents($meta_file), true);
|
||||
|
||||
$quick_mode = $_POST['quick_mode'] ?? false;
|
||||
|
||||
// Push job to Redis queue for worker processing
|
||||
try {
|
||||
$redis = getRedis();
|
||||
$payload = json_encode([
|
||||
'job_id' => $job_id,
|
||||
'pdf_path' => $job_data['filepath'],
|
||||
'original_filename' => $job_data['original_filename'] ?? '',
|
||||
'options' => [
|
||||
'quick_mode' => (bool)$quick_mode,
|
||||
],
|
||||
'queued_at' => time()
|
||||
]);
|
||||
$redis->lPush(REDIS_QUEUE, $payload);
|
||||
// Update meta to processing
|
||||
$job_data['status'] = 'processing';
|
||||
$job_data['started_at'] = date('Y-m-d H:i:s');
|
||||
file_put_contents($meta_file, json_encode($job_data, JSON_PRETTY_PRINT));
|
||||
|
||||
// Set initial status in Redis
|
||||
$redis->setex(REDIS_STATUS_PREFIX . $job_id, 86400, json_encode([
|
||||
'status' => 'queued',
|
||||
'progress' => 0,
|
||||
'message' => 'Waiting in queue',
|
||||
'updated_at' => time()
|
||||
]));
|
||||
} catch (Exception $e) {
|
||||
// Fallback to direct exec if Redis is unavailable (local dev without Docker)
|
||||
// If Cloud Run URL is configured, send to Cloud Run
|
||||
if (!empty(CLOUD_RUN_URL)) {
|
||||
try {
|
||||
$token = getCloudRunToken();
|
||||
$pdf_path = $job_data['filepath'];
|
||||
|
||||
if (!file_exists($pdf_path)) {
|
||||
error('PDF file not found on server');
|
||||
}
|
||||
|
||||
// Build multipart POST to Cloud Run
|
||||
$ch = curl_init(CLOUD_RUN_URL . '/check');
|
||||
$postFields = [
|
||||
'pdf' => new CURLFile($pdf_path, 'application/pdf', basename($pdf_path)),
|
||||
'job_id' => $job_id,
|
||||
'quick_mode' => $quick_mode ? 'true' : 'false',
|
||||
'original_filename' => $job_data['original_filename'] ?? '',
|
||||
];
|
||||
|
||||
curl_setopt_array($ch, [
|
||||
CURLOPT_POST => true,
|
||||
CURLOPT_POSTFIELDS => $postFields,
|
||||
CURLOPT_RETURNTRANSFER => true,
|
||||
CURLOPT_TIMEOUT => CLOUD_RUN_TIMEOUT,
|
||||
CURLOPT_HTTPHEADER => [
|
||||
'Authorization: Bearer ' . $token,
|
||||
],
|
||||
]);
|
||||
|
||||
$response = curl_exec($ch);
|
||||
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
|
||||
$curlError = curl_error($ch);
|
||||
curl_close($ch);
|
||||
|
||||
if ($curlError) {
|
||||
throw new Exception("Cloud Run request failed: $curlError");
|
||||
}
|
||||
|
||||
if ($httpCode !== 200) {
|
||||
$errorBody = json_decode($response, true);
|
||||
$errorMsg = $errorBody['error'] ?? "HTTP $httpCode";
|
||||
throw new Exception("Cloud Run returned error: $errorMsg");
|
||||
}
|
||||
|
||||
$result = json_decode($response, true);
|
||||
if (!$result || !isset($result['success'])) {
|
||||
throw new Exception("Invalid response from Cloud Run");
|
||||
}
|
||||
|
||||
if (!$result['success']) {
|
||||
throw new Exception($result['error'] ?? 'Unknown Cloud Run error');
|
||||
}
|
||||
|
||||
$checkResult = $result['data'];
|
||||
|
||||
// Write result JSON to disk
|
||||
$result_file = RESULTS_DIR . '/' . $job_id . '.result.json';
|
||||
file_put_contents($result_file, json_encode($checkResult, JSON_PRETTY_PRINT));
|
||||
|
||||
// Update meta
|
||||
$job_data['status'] = 'completed';
|
||||
$job_data['completed_at'] = date('Y-m-d H:i:s');
|
||||
file_put_contents($meta_file, json_encode($job_data, JSON_PRETTY_PRINT));
|
||||
|
||||
// Update PostgreSQL
|
||||
updateJobInDatabase($job_id, $job_data['original_filename'] ?? '', 'completed', $checkResult);
|
||||
|
||||
success([
|
||||
'job_id' => $job_id,
|
||||
'status' => 'completed',
|
||||
'message' => 'Check completed'
|
||||
]);
|
||||
|
||||
} catch (Exception $e) {
|
||||
// Mark as failed
|
||||
$job_data['status'] = 'failed';
|
||||
$job_data['error'] = $e->getMessage();
|
||||
file_put_contents($meta_file, json_encode($job_data, JSON_PRETTY_PRINT));
|
||||
|
||||
updateJobInDatabase($job_id, $job_data['original_filename'] ?? '', 'failed');
|
||||
|
||||
error('Processing failed: ' . $e->getMessage());
|
||||
}
|
||||
} else {
|
||||
// Fallback to local exec (development without Cloud Run)
|
||||
$pdf_path = $job_data['filepath'];
|
||||
$output_path = RESULTS_DIR . '/' . $job_id . '.result.json';
|
||||
$venv_python = __DIR__ . '/venv/bin/python3';
|
||||
|
|
@ -312,22 +555,17 @@ function handleCheck() {
|
|||
$error_log = RESULTS_DIR . '/' . $job_id . '.error.log';
|
||||
$cmd .= ' > ' . escapeshellarg($error_log) . ' 2>&1 &';
|
||||
exec($cmd, $output, $return_code);
|
||||
|
||||
success([
|
||||
'job_id' => $job_id,
|
||||
'status' => 'processing',
|
||||
'message' => 'Check started (local mode)'
|
||||
]);
|
||||
}
|
||||
|
||||
// Update meta file
|
||||
$job_data['status'] = 'queued';
|
||||
$job_data['started_at'] = date('Y-m-d H:i:s');
|
||||
file_put_contents($meta_file, json_encode($job_data, JSON_PRETTY_PRINT));
|
||||
|
||||
success([
|
||||
'job_id' => $job_id,
|
||||
'status' => 'queued',
|
||||
'message' => 'Check queued for processing'
|
||||
]);
|
||||
}
|
||||
|
||||
/**
|
||||
* Check job status — reads from Redis (real-time) with file fallback
|
||||
* Check job status — pure file-based
|
||||
*/
|
||||
function handleStatus() {
|
||||
$job_id = $_GET['job_id'] ?? '';
|
||||
|
|
@ -347,30 +585,15 @@ function handleStatus() {
|
|||
|
||||
$job_data = json_decode(file_get_contents($meta_file), true);
|
||||
|
||||
// Try Redis first for real-time progress
|
||||
try {
|
||||
$redis = getRedis();
|
||||
$redis_status = $redis->get(REDIS_STATUS_PREFIX . $job_id);
|
||||
if ($redis_status) {
|
||||
$status_data = json_decode($redis_status, true);
|
||||
$job_data['status'] = $status_data['status'];
|
||||
$job_data['progress'] = $status_data['progress'] ?? 0;
|
||||
$job_data['status_message'] = $status_data['message'] ?? '';
|
||||
}
|
||||
} catch (Exception $e) {
|
||||
// Redis unavailable — fall through to file-based check
|
||||
}
|
||||
|
||||
// File-based fallback: check if result exists
|
||||
// Check if result file exists (definitive completion signal)
|
||||
if (file_exists($result_file)) {
|
||||
$job_data['status'] = 'completed';
|
||||
$job_data['completed_at'] = date('Y-m-d H:i:s', filemtime($result_file));
|
||||
file_put_contents($meta_file, json_encode($job_data, JSON_PRETTY_PRINT));
|
||||
} else if (file_exists($error_log) && $job_data['status'] === 'processing') {
|
||||
$job_data['completed_at'] = $job_data['completed_at'] ?? date('Y-m-d H:i:s', filemtime($result_file));
|
||||
} else if (file_exists($error_log) && in_array($job_data['status'], ['processing', 'queued'])) {
|
||||
$error_content = file_get_contents($error_log);
|
||||
if (!empty($error_content)) {
|
||||
$started = strtotime($job_data['started_at'] ?? 'now');
|
||||
if (time() - $started > 300) {
|
||||
if (time() - $started > 900) {
|
||||
$job_data['status'] = 'failed';
|
||||
$job_data['error'] = 'Process timeout or error';
|
||||
$job_data['error_log'] = substr($error_content, -1000);
|
||||
|
|
@ -391,15 +614,15 @@ function handleResult() {
|
|||
error('Job ID required');
|
||||
}
|
||||
$job_id = sanitizeJobId($job_id);
|
||||
|
||||
|
||||
$result_file = RESULTS_DIR . '/' . $job_id . '.result.json';
|
||||
|
||||
|
||||
if (!file_exists($result_file)) {
|
||||
error('Results not found. Check may still be processing.');
|
||||
}
|
||||
|
||||
|
||||
$result = json_decode(file_get_contents($result_file), true);
|
||||
|
||||
|
||||
success($result);
|
||||
}
|
||||
|
||||
|
|
@ -408,26 +631,26 @@ function handleResult() {
|
|||
*/
|
||||
function handleList() {
|
||||
$jobs = [];
|
||||
|
||||
|
||||
$files = glob(RESULTS_DIR . '/*.meta.json');
|
||||
|
||||
|
||||
foreach ($files as $file) {
|
||||
$job_data = json_decode(file_get_contents($file), true);
|
||||
|
||||
|
||||
// Check if completed
|
||||
$result_file = str_replace('.meta.json', '.result.json', $file);
|
||||
if (file_exists($result_file)) {
|
||||
$job_data['status'] = 'completed';
|
||||
}
|
||||
|
||||
|
||||
$jobs[] = $job_data;
|
||||
}
|
||||
|
||||
|
||||
// Sort by upload time (newest first)
|
||||
usort($jobs, function($a, $b) {
|
||||
return strtotime($b['uploaded_at']) - strtotime($a['uploaded_at']);
|
||||
});
|
||||
|
||||
|
||||
success(['jobs' => $jobs]);
|
||||
}
|
||||
|
||||
|
|
@ -441,20 +664,20 @@ function handleDelete() {
|
|||
error('Job ID required');
|
||||
}
|
||||
$job_id = sanitizeJobId($job_id);
|
||||
|
||||
|
||||
$meta_file = RESULTS_DIR . '/' . $job_id . '.meta.json';
|
||||
|
||||
|
||||
if (!file_exists($meta_file)) {
|
||||
error('Job not found');
|
||||
}
|
||||
|
||||
|
||||
$job_data = json_decode(file_get_contents($meta_file), true);
|
||||
|
||||
|
||||
// Delete files
|
||||
@unlink($job_data['filepath']);
|
||||
@unlink($meta_file);
|
||||
@unlink(RESULTS_DIR . '/' . $job_id . '.result.json');
|
||||
|
||||
|
||||
success(['message' => 'Job deleted']);
|
||||
}
|
||||
|
||||
|
|
@ -484,6 +707,7 @@ function handleDebug() {
|
|||
'meta_exists' => file_exists($meta_file),
|
||||
'result_exists' => file_exists($result_file),
|
||||
'error_log_exists' => file_exists($error_log),
|
||||
'cloud_run_url' => CLOUD_RUN_URL ?: '(not configured — local mode)',
|
||||
'files' => []
|
||||
];
|
||||
|
||||
|
|
@ -508,7 +732,7 @@ function handleDebug() {
|
|||
}
|
||||
|
||||
/**
|
||||
* Serve page images
|
||||
* Serve page images — redirect to GCS URL or serve local file
|
||||
*/
|
||||
function handleImage() {
|
||||
$job_id = $_GET['job_id'] ?? '';
|
||||
|
|
@ -518,10 +742,28 @@ function handleImage() {
|
|||
error('Job ID and page number required');
|
||||
}
|
||||
$job_id = sanitizeJobId($job_id);
|
||||
$page_num = intval($page_num);
|
||||
|
||||
// Find the image file
|
||||
// Check result JSON for GCS URLs
|
||||
$result_file = RESULTS_DIR . '/' . $job_id . '.result.json';
|
||||
if (file_exists($result_file)) {
|
||||
$result = json_decode(file_get_contents($result_file), true);
|
||||
$page_images = $result['page_images'] ?? [];
|
||||
|
||||
// Check if the page image value is a URL (GCS)
|
||||
$image_value = $page_images[$page_num] ?? $page_images[strval($page_num)] ?? null;
|
||||
if ($image_value && (strpos($image_value, 'http://') === 0 || strpos($image_value, 'https://') === 0)) {
|
||||
// Redirect to GCS URL
|
||||
header('HTTP/1.1 302 Found');
|
||||
header('Location: ' . $image_value);
|
||||
header('Cache-Control: public, max-age=86400');
|
||||
exit;
|
||||
}
|
||||
}
|
||||
|
||||
// Fallback: serve local image file
|
||||
$images_dir = RESULTS_DIR . '/' . $job_id . '.result_images';
|
||||
$image_file = $images_dir . '/page_' . intval($page_num) . '.png';
|
||||
$image_file = $images_dir . '/page_' . $page_num . '.png';
|
||||
|
||||
if (!file_exists($image_file)) {
|
||||
http_response_code(404);
|
||||
|
|
@ -657,7 +899,6 @@ function handleStats() {
|
|||
'completed' => 0,
|
||||
'failed' => 0,
|
||||
'processing' => 0,
|
||||
'queue_length' => 0
|
||||
];
|
||||
|
||||
// Count jobs from meta files
|
||||
|
|
@ -675,14 +916,6 @@ function handleStats() {
|
|||
}
|
||||
}
|
||||
|
||||
// Get queue length from Redis
|
||||
try {
|
||||
$redis = getRedis();
|
||||
$stats['queue_length'] = $redis->lLen(REDIS_QUEUE);
|
||||
} catch (Exception $e) {
|
||||
// Redis unavailable
|
||||
}
|
||||
|
||||
success($stats);
|
||||
}
|
||||
|
||||
|
|
|
|||
15
cleanup.py
15
cleanup.py
|
|
@ -2,8 +2,9 @@
|
|||
"""
|
||||
PDF Accessibility Checker — File Cleanup
|
||||
|
||||
Deletes uploaded PDFs, result JSON files, result images, and error logs
|
||||
older than RETENTION_HOURS (default 24h).
|
||||
Deletes uploaded PDFs, result JSON files, error logs, and rate limit files
|
||||
older than RETENTION_HOURS (default 24h). Page images are on GCS with
|
||||
a 7-day lifecycle policy.
|
||||
|
||||
Usage:
|
||||
python cleanup.py # dry-run (show what would be deleted)
|
||||
|
|
@ -28,6 +29,7 @@ logger = logging.getLogger('cleanup')
|
|||
|
||||
UPLOADS_DIR = Path(os.getenv('UPLOADS_DIR', '/opt/pdf-accessibility/uploads'))
|
||||
RESULTS_DIR = Path(os.getenv('RESULTS_DIR', '/opt/pdf-accessibility/results'))
|
||||
RATE_LIMIT_DIR = Path(os.getenv('RATE_LIMIT_DIR', '/opt/pdf-accessibility/rate_limits'))
|
||||
RETENTION_HOURS = int(os.getenv('RETENTION_HOURS', '24'))
|
||||
|
||||
|
||||
|
|
@ -109,8 +111,13 @@ def main():
|
|||
total_deleted += d
|
||||
total_freed += f
|
||||
|
||||
# Clean results (JSON, error logs, image directories)
|
||||
d, f = cleanup_directory(RESULTS_DIR, ['*.result.json', '*.error.log', '*.result_images'], dry_run)
|
||||
# Clean results (JSON, error logs — page images are on GCS with 7-day lifecycle)
|
||||
d, f = cleanup_directory(RESULTS_DIR, ['*.result.json', '*.error.log', '*.meta.json'], dry_run)
|
||||
total_deleted += d
|
||||
total_freed += f
|
||||
|
||||
# Clean rate limit files
|
||||
d, f = cleanup_directory(RATE_LIMIT_DIR, ['*.json'], dry_run)
|
||||
total_deleted += d
|
||||
total_freed += f
|
||||
|
||||
|
|
|
|||
14
cloudbuild.yaml
Normal file
14
cloudbuild.yaml
Normal file
|
|
@ -0,0 +1,14 @@
|
|||
steps:
|
||||
- name: 'gcr.io/cloud-builders/docker'
|
||||
args:
|
||||
- 'build'
|
||||
- '-t'
|
||||
- 'us-central1-docker.pkg.dev/optical-414516/pdf-accessibility/checker:latest'
|
||||
- '-f'
|
||||
- 'Dockerfile.cloudrun'
|
||||
- '.'
|
||||
|
||||
images:
|
||||
- 'us-central1-docker.pkg.dev/optical-414516/pdf-accessibility/checker:latest'
|
||||
|
||||
timeout: '600s'
|
||||
136
cloudrun_service.py
Normal file
136
cloudrun_service.py
Normal file
|
|
@ -0,0 +1,136 @@
|
|||
#!/usr/bin/env python3
|
||||
"""
|
||||
PDF Accessibility Checker — Cloud Run HTTP Service
|
||||
|
||||
Flask app wrapping EnterprisePDFChecker for serverless execution.
|
||||
Receives PDF via multipart POST, runs checks, uploads page images to GCS,
|
||||
returns full result JSON.
|
||||
"""
|
||||
|
||||
import os
|
||||
import json
|
||||
import tempfile
|
||||
import logging
|
||||
from pathlib import Path
|
||||
|
||||
from flask import Flask, request, jsonify
|
||||
from google.cloud import storage
|
||||
|
||||
from enterprise_pdf_checker import EnterprisePDFChecker
|
||||
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s [cloudrun] %(levelname)s: %(message)s'
|
||||
)
|
||||
logger = logging.getLogger('cloudrun')
|
||||
|
||||
app = Flask(__name__)
|
||||
|
||||
GCS_BUCKET_NAME = os.getenv('GCS_BUCKET_NAME', 'optical-pdf-images')
|
||||
|
||||
|
||||
def upload_images_to_gcs(images_dir: Path, job_id: str) -> dict:
|
||||
"""Upload page images to GCS and return {page_num: public_url} mapping."""
|
||||
client = storage.Client()
|
||||
bucket = client.bucket(GCS_BUCKET_NAME)
|
||||
page_images = {}
|
||||
|
||||
for image_file in sorted(images_dir.glob('page_*.png')):
|
||||
# Extract page number from filename (page_1.png -> 1)
|
||||
page_num = int(image_file.stem.split('_')[1])
|
||||
blob_name = f"{job_id}/{image_file.name}"
|
||||
blob = bucket.blob(blob_name)
|
||||
blob.upload_from_filename(str(image_file), content_type='image/png')
|
||||
# Bucket has uniform bucket-level access with allUsers objectViewer,
|
||||
# so objects are public by default — no need for blob.make_public()
|
||||
public_url = f"https://storage.googleapis.com/{GCS_BUCKET_NAME}/{blob_name}"
|
||||
page_images[page_num] = public_url
|
||||
logger.info("Uploaded %s -> %s", image_file.name, public_url)
|
||||
|
||||
return page_images
|
||||
|
||||
|
||||
@app.route('/check', methods=['POST'])
|
||||
def check_pdf():
|
||||
"""Accept multipart PDF upload, run accessibility checks, return results."""
|
||||
pdf_file = request.files.get('pdf')
|
||||
if not pdf_file:
|
||||
return jsonify({'success': False, 'error': 'No PDF file provided'}), 400
|
||||
|
||||
job_id = request.form.get('job_id', 'unknown')
|
||||
quick_mode = request.form.get('quick_mode', 'false').lower() in ('true', '1', 'yes')
|
||||
original_filename = request.form.get('original_filename', pdf_file.filename or 'document.pdf')
|
||||
|
||||
logger.info("Received job %s: %s (quick=%s)", job_id, original_filename, quick_mode)
|
||||
|
||||
tmp_pdf = None
|
||||
images_dir = None
|
||||
|
||||
try:
|
||||
# Save uploaded PDF to temp file
|
||||
tmp_pdf = tempfile.NamedTemporaryFile(suffix='.pdf', delete=False)
|
||||
pdf_file.save(tmp_pdf)
|
||||
tmp_pdf.close()
|
||||
|
||||
# Run accessibility checks
|
||||
config = {
|
||||
'anthropic_api_key': os.getenv('ANTHROPIC_API_KEY'),
|
||||
'google_api_key': os.getenv('GOOGLE_API_KEY'),
|
||||
}
|
||||
|
||||
checker = EnterprisePDFChecker(tmp_pdf.name, config, quick_mode=quick_mode)
|
||||
checker.check_all()
|
||||
|
||||
# Generate page images to a temp directory
|
||||
images_dir = tempfile.mkdtemp(prefix='pdf_images_')
|
||||
images_path = Path(images_dir)
|
||||
checker._generate_page_images(images_path)
|
||||
|
||||
# Get results before uploading images (page_images has local filenames)
|
||||
results = checker.to_dict()
|
||||
|
||||
# Upload images to GCS and replace local filenames with public URLs
|
||||
if checker.page_images:
|
||||
gcs_urls = upload_images_to_gcs(images_path, job_id)
|
||||
results['page_images'] = gcs_urls
|
||||
|
||||
# Add grade based on score
|
||||
score = results.get('accessibility_score', 0)
|
||||
if score >= 90:
|
||||
results['grade'] = 'A'
|
||||
elif score >= 80:
|
||||
results['grade'] = 'B'
|
||||
elif score >= 70:
|
||||
results['grade'] = 'C'
|
||||
elif score >= 60:
|
||||
results['grade'] = 'D'
|
||||
else:
|
||||
results['grade'] = 'F'
|
||||
|
||||
logger.info("Job %s completed: score=%s grade=%s issues=%d",
|
||||
job_id, results['accessibility_score'],
|
||||
results['grade'], results['total_issues'])
|
||||
|
||||
return jsonify({'success': True, 'data': results})
|
||||
|
||||
except Exception as e:
|
||||
logger.error("Job %s failed: %s", job_id, str(e), exc_info=True)
|
||||
return jsonify({'success': False, 'error': str(e)}), 500
|
||||
|
||||
finally:
|
||||
# Clean up temp files
|
||||
if tmp_pdf and os.path.exists(tmp_pdf.name):
|
||||
os.unlink(tmp_pdf.name)
|
||||
if images_dir and os.path.exists(images_dir):
|
||||
import shutil
|
||||
shutil.rmtree(images_dir, ignore_errors=True)
|
||||
|
||||
|
||||
@app.route('/health', methods=['GET'])
|
||||
def health():
|
||||
return jsonify({'status': 'ok'})
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
port = int(os.getenv('PORT', 8080))
|
||||
app.run(host='0.0.0.0', port=port, debug=False)
|
||||
|
|
@ -1,50 +1,9 @@
|
|||
# Production Docker Compose — worker + Redis + PostgreSQL
|
||||
# Production Docker Compose — PostgreSQL only
|
||||
# Apache/Nginx on host serves PHP + frontend files natively
|
||||
# Redis on 1220, PostgreSQL on 1221 to avoid host conflicts
|
||||
# PDF processing handled by Cloud Run (no local worker)
|
||||
# PostgreSQL on 1221 to avoid host conflicts
|
||||
|
||||
services:
|
||||
worker:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile.worker
|
||||
volumes:
|
||||
- ${WEB_DIR:-/opt/pdf-accessibility}/uploads:${WEB_DIR:-/opt/pdf-accessibility}/uploads
|
||||
- ${WEB_DIR:-/opt/pdf-accessibility}/results:${WEB_DIR:-/opt/pdf-accessibility}/results
|
||||
- ./logs:/app/logs
|
||||
depends_on:
|
||||
redis:
|
||||
condition: service_healthy
|
||||
postgres:
|
||||
condition: service_healthy
|
||||
environment:
|
||||
- REDIS_HOST=redis
|
||||
- REDIS_PORT=6379
|
||||
- DB_HOST=postgres
|
||||
- DB_PORT=5432
|
||||
- DB_NAME=${DB_NAME:-pdf_checker}
|
||||
- DB_USER=${DB_USER:-pdf_checker}
|
||||
- DB_PASSWORD=${DB_PASSWORD:-dev_password}
|
||||
- RESULTS_DIR=${WEB_DIR:-/opt/pdf-accessibility}/results
|
||||
- UPLOADS_DIR=${WEB_DIR:-/opt/pdf-accessibility}/uploads
|
||||
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY:-}
|
||||
- GOOGLE_API_KEY=${GOOGLE_API_KEY:-}
|
||||
deploy:
|
||||
replicas: ${WORKER_COUNT:-2}
|
||||
restart: unless-stopped
|
||||
|
||||
redis:
|
||||
image: redis:7-alpine
|
||||
ports:
|
||||
- "127.0.0.1:1220:6379"
|
||||
volumes:
|
||||
- redis-data:/data
|
||||
healthcheck:
|
||||
test: ["CMD", "redis-cli", "ping"]
|
||||
interval: 10s
|
||||
timeout: 3s
|
||||
retries: 3
|
||||
restart: unless-stopped
|
||||
|
||||
postgres:
|
||||
image: postgres:16-alpine
|
||||
ports:
|
||||
|
|
@ -64,5 +23,4 @@ services:
|
|||
restart: unless-stopped
|
||||
|
||||
volumes:
|
||||
redis-data:
|
||||
pg-data:
|
||||
|
|
|
|||
|
|
@ -9,42 +9,11 @@ services:
|
|||
- pdf-uploads:/app/uploads
|
||||
- pdf-results:/app/results
|
||||
depends_on:
|
||||
redis:
|
||||
condition: service_healthy
|
||||
postgres:
|
||||
condition: service_healthy
|
||||
env_file: .env
|
||||
restart: unless-stopped
|
||||
|
||||
worker:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile.worker
|
||||
volumes:
|
||||
- pdf-uploads:/app/uploads
|
||||
- pdf-results:/app/results
|
||||
- pdf-logs:/app/logs
|
||||
depends_on:
|
||||
redis:
|
||||
condition: service_healthy
|
||||
postgres:
|
||||
condition: service_healthy
|
||||
env_file: .env
|
||||
deploy:
|
||||
replicas: ${WORKER_COUNT:-2}
|
||||
restart: unless-stopped
|
||||
|
||||
redis:
|
||||
image: redis:7-alpine
|
||||
volumes:
|
||||
- redis-data:/data
|
||||
healthcheck:
|
||||
test: ["CMD", "redis-cli", "ping"]
|
||||
interval: 10s
|
||||
timeout: 3s
|
||||
retries: 3
|
||||
restart: unless-stopped
|
||||
|
||||
postgres:
|
||||
image: postgres:16-alpine
|
||||
volumes:
|
||||
|
|
@ -64,6 +33,4 @@ services:
|
|||
volumes:
|
||||
pdf-uploads:
|
||||
pdf-results:
|
||||
pdf-logs:
|
||||
redis-data:
|
||||
pg-data:
|
||||
|
|
|
|||
|
|
@ -5,6 +5,9 @@ set -e
|
|||
# By default PHP-FPM clears the environment; this disables that behavior
|
||||
echo 'clear_env = no' >> /usr/local/etc/php-fpm.d/www.conf
|
||||
|
||||
# 15-minute timeout for Cloud Run PDF processing
|
||||
echo 'request_terminate_timeout = 900' >> /usr/local/etc/php-fpm.d/www.conf
|
||||
|
||||
# Start PHP-FPM in background
|
||||
php-fpm -D
|
||||
|
||||
|
|
|
|||
|
|
@ -47,7 +47,13 @@ function loadVisualPage(pageNum) {
|
|||
|
||||
const img = document.getElementById('pageImage');
|
||||
img.onload = () => drawMarkers(pageNum);
|
||||
img.src = `api.php?action=image&job_id=${currentJobId}&page=${pageNum}`;
|
||||
// Use GCS URL directly if available, otherwise fall back to api.php
|
||||
const imageUrl = currentPageData.page_images[pageNum];
|
||||
if (imageUrl && (imageUrl.startsWith('http://') || imageUrl.startsWith('https://'))) {
|
||||
img.src = imageUrl;
|
||||
} else {
|
||||
img.src = `api.php?action=image&job_id=${currentJobId}&page=${pageNum}`;
|
||||
}
|
||||
}
|
||||
|
||||
function drawMarkers(pageNum) {
|
||||
|
|
|
|||
20
js/upload.js
20
js/upload.js
|
|
@ -78,13 +78,21 @@ async function beginCheck() {
|
|||
if (quickMode) addLog('Quick mode enabled — skipping expensive checks', 'info');
|
||||
|
||||
try {
|
||||
updateProgress(30, 'Starting analysis...');
|
||||
updateProgress(30, 'Analyzing PDF (this may take a few minutes)...');
|
||||
const result = await startCheck(currentJobId, quickMode);
|
||||
|
||||
if (result.success) {
|
||||
updateProgress(35, 'Analysis queued');
|
||||
addLog('Job queued for processing', 'success');
|
||||
pollJobStatus();
|
||||
if (result.data && result.data.status === 'completed') {
|
||||
// Synchronous Cloud Run response — results are ready
|
||||
updateProgress(98, 'Loading results...');
|
||||
addLog('Analysis complete!', 'success');
|
||||
loadResults();
|
||||
} else {
|
||||
// Async/local mode fallback — poll for status
|
||||
updateProgress(35, 'Analysis started');
|
||||
addLog('Job processing...', 'success');
|
||||
pollJobStatus();
|
||||
}
|
||||
} else {
|
||||
addLog('Check failed: ' + result.error, 'error');
|
||||
alert('Check failed: ' + result.error);
|
||||
|
|
@ -142,9 +150,9 @@ async function pollJobStatus() {
|
|||
if (data.error_log) addLog('Error: ' + data.error_log.substring(0, 500), 'error');
|
||||
document.getElementById('progressContainer').style.display = 'none';
|
||||
alert('Analysis failed. Check the error log for details.');
|
||||
} else if (pollCount > 150) {
|
||||
} else if (pollCount > 450) {
|
||||
clearInterval(pollInterval);
|
||||
addLog('Analysis timed out after 5 minutes', 'error');
|
||||
addLog('Analysis timed out after 15 minutes', 'error');
|
||||
addLog('Try using Quick Mode for faster results', 'info');
|
||||
document.getElementById('progressContainer').style.display = 'none';
|
||||
}
|
||||
|
|
|
|||
|
|
@ -17,6 +17,10 @@ server {
|
|||
fastcgi_index index.php;
|
||||
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
|
||||
include fastcgi_params;
|
||||
|
||||
# 15-minute timeout for Cloud Run PDF processing
|
||||
fastcgi_read_timeout 900s;
|
||||
fastcgi_send_timeout 900s;
|
||||
}
|
||||
|
||||
# Serve page images from results
|
||||
|
|
|
|||
33
requirements-cloudrun.txt
Normal file
33
requirements-cloudrun.txt
Normal file
|
|
@ -0,0 +1,33 @@
|
|||
# Cloud Run PDF Accessibility Checker - Python Dependencies
|
||||
|
||||
# Core PDF processing
|
||||
pypdf>=4.0.0
|
||||
pdfplumber>=0.11.0
|
||||
|
||||
# Image processing
|
||||
Pillow>=10.0.0
|
||||
pdf2image>=1.16.0
|
||||
|
||||
# OCR
|
||||
pytesseract>=0.3.10
|
||||
|
||||
# Scientific computing
|
||||
numpy>=1.24.0
|
||||
|
||||
# NLP and readability
|
||||
textblob>=0.17.1
|
||||
|
||||
# Google Cloud APIs
|
||||
google-cloud-vision>=3.4.0
|
||||
google-cloud-documentai>=2.20.0
|
||||
|
||||
# Anthropic Claude API
|
||||
anthropic>=0.18.0
|
||||
|
||||
# Additional utilities
|
||||
python-dotenv>=1.0.0
|
||||
|
||||
# Cloud Run specific
|
||||
flask>=3.0.0
|
||||
gunicorn>=21.2.0
|
||||
google-cloud-storage>=2.14.0
|
||||
Loading…
Add table
Reference in a new issue