GCP Deployment & 30s Load Balancer Timeout

GCP's HTTP(S) Load Balancer drops connections after 30 seconds. Long-running AI tasks that relied on WebSocket streaming were breaking in production.

Key Takeaways

GCP HTTP(S) LB has a hard 30-second timeout on backend connections
WebSocket connections through GCP LB are killed at 30s → client sees disconnect
Fix: replace WebSocket delivery with HTTP polling (client polls /status endpoint)
This affects both Mod Comms and Semblance — same root cause, same fix
Task results must be stored server-side (DB or Redis) so the client can retrieve them on poll

The Problem

Client → WebSocket → GCP LB → FastAPI/Quart backend
                              [30 seconds pass]
           GCP LB KILLS CONNECTION ← AI task not done yet
Client sees error, no result delivered

The Fix: HTTP Polling

1. Client POSTs task → backend returns { task_id }
2. Client polls GET /tasks/{task_id}/status every 2s
3. Backend stores result in DB/Redis when task completes
4. On 'complete' status, client fetches full result

# FastAPI endpoint pattern
@router.post("/analyze")
async def start_analysis(proof: UploadFile):
    task_id = str(uuid4())
    background_tasks.add_task(run_ai_analysis, task_id, proof)
    return {"task_id": task_id}

@router.get("/tasks/{task_id}/status")
async def get_status(task_id: str):
    result = await db.get_task(task_id)
    return {"status": result.status, "data": result.data if result.done else None}

// Frontend polling
async function pollTask(taskId) {
  const { status, data } = await api.get(`/tasks/${taskId}/status`)
  if (status === 'complete') return data
  await sleep(2000)
  return pollTask(taskId)
}

Projects Affected

01 Projects/modcomms/Mod Comms — switched WebSocket → REST polling (2026-03-18, critical fix)
01 Projects/semblance/Semblance — migrated all async LLM routes (2026-03-23, critical fix)

Server Details

Both Mod Comms and Semblance deploy to GCP
Sandbox NotebookLM and Cinema Studio Pro are on ai-sandbox.oliver.solutions (optical-web-1) — different infrastructure, different rules

Secondary GCP Notes

WebSocket keepalive heartbeat should be ≤25s (Mod Comms tuned to 10s)
Consider Cloud Run for serverless deployments — has configurable timeout up to 3600s
GCP LB timeout is configurable in Backend Service settings, but changing it requires infra access

Gotchas & Lessons

Increasing WebSocket heartbeat (25s → 10s) was not enough — had to drop WebSocket entirely
Don't assume WebSocket works through GCP LB without testing with real LB (not just local)
Any future GCP project with AI tasks >5s should default to HTTP polling architecture
Socket.IO fallback (polling mode) is NOT the same as HTTP polling for results — Socket.IO still goes through LB

wiki/architecture/multi-agent-ai-systems — the pattern that triggers long tasks
wiki/tech-patterns/redis-celery-worker-queue — storing results for polling
wiki/tech-patterns/python-ai-agents — the long-running AI calls

3.4 KiB Raw Blame History

GCP Deployment & 30s Load Balancer Timeout

Key Takeaways

The Problem

The Fix: HTTP Polling

Projects Affected

Server Details

Secondary GCP Notes

Gotchas & Lessons

Related

3.4 KiB

Raw Blame History