obsidian/wiki/architecture/adr-log.md
2026-04-27 18:22:09 +01:00

6.8 KiB
Raw Blame History

title description tags created updated
Architecture Decision Records (ADR) Why specific tech choices were made at Oliver Agency — prevents relitigating decisions and documents constraints
architecture
decisions
adr
2026-04-27 2026-04-27

Architecture Decision Records

Decisions made and why. Prevents relitigating the same choices. Each record: decision, context, alternatives considered, rationale.

Key Takeaways

  • Most Oliver stack choices are driven by server constraints (GCP 30s LB timeout) and team familiarity
  • Docker Compose is deliberately chosen over k8s for operational simplicity at this scale
  • FastAPI over Django/Flask: async performance + auto-generated OpenAPI docs are worth the smaller ecosystem
  • HTTP polling over WebSockets is a hard constraint, not a preference

ADR-001: HTTP Polling over WebSockets

Date: 2026-03 (from Mod Comms incident) Status: Active — applies to ALL Oliver projects

Decision: Never use WebSockets for long-running task communication. Use HTTP polling with a job table.

Context: Mod Comms was deployed on GCP behind a load balancer. WebSocket connections were dropped after exactly 30 seconds. The LB timeout is not configurable without GCP support escalation.

Pattern:

POST /api/jobs → {job_id}
GET  /api/jobs/{id} → {status: pending|done, result?}
Frontend polls every 2s

Applies to: All projects on optical-dev (Apache) and GCP. optical-web-1 (direct systemd) is less affected but polling is still safer.

See wiki/architecture/gcp-deployment-lb-timeout.


ADR-002: Docker Compose over Kubernetes

Date: ~2025 Status: Active

Decision: Single-server Docker Compose for all Oliver project deployments.

Context: Oliver Agency projects are internal tools and client portals, not public-scale services. Each project runs on one server with 13 services.

Alternatives: k8s (Minikube, GKE), Docker Swarm, bare systemd.

Rationale:

  • k8s adds ~3 days of ops overhead per project for no benefit at this scale
  • Docker Compose is understood by entire team
  • Rollbacks: docker compose up -d with previous image tag
  • optical-dev already runs 15+ Compose projects without issues

Exceptions: Hotfolder daemons on box-cli-01 use plain systemd (CentOS 7, no Docker).


ADR-003: FastAPI over Django/Flask

Date: ~2024 Status: Active

Decision: FastAPI as the default Python backend framework.

Rationale:

  • Async-first: handles concurrent AI API calls without blocking
  • Auto-generated OpenAPI docs (/docs) — zero effort API documentation
  • Pydantic models: input validation + serialization in one place
  • Performance: competitive with Node.js for I/O-bound workloads
  • Type hints throughout → fewer runtime errors

When to deviate:

  • Admin CRUD with lots of forms → Django (but Oliver doesn't have these)
  • Very simple one-endpoint proxy → Flask is fine

ADR-004: React + Vite over Vue / Angular / SvelteKit

Date: ~2024 Status: Active

Decision: React 18 + Vite as the standard frontend stack.

Rationale:

  • Team familiar with React; no training cost
  • Vite: fast HMR, simple base config for subpath deploys
  • React ecosystem: Shadcn/UI, Zustand, React Query all solid
  • TypeScript + Vite: first-class support

When to deviate:

  • No interactivity needed → plain HTML/JS (3M Portal, Ferrero AC Tool)
  • Next.js needed → when SSR, image optimization, or complex routing required

ADR-005: Azure AD / MSAL as Auth Standard

Date: ~2024 Status: Active

Decision: Azure AD SSO for all Oliver internal authenticated tools.

Context: Oliver Agency has a Microsoft 365 tenant. All employees have Azure AD accounts.

Pattern: MSAL.js PKCE in frontend (delegated flow) + JWKS token validation in FastAPI backend.

Local dev bypass: DISABLE_AUTH=true env var skips auth middleware. Never in production.

Alternatives: Auth0 (cost, external dependency), custom JWT (reinventing the wheel), Keycloak (infra overhead).

See wiki/tech-patterns/azure-ad-msal-auth.


ADR-006: Cost Tracker on Every AI Project

Date: 2026-04 (ai-cost-tracker launch) Status: Active

Decision: Every Oliver project making AI API calls must integrate ai-cost-tracker with preflight + record.

Context: AI API costs (Gemini, Claude, OpenAI) can spike unpredictably. Without tracking, budget overruns only discovered on monthly bill.

Integration cost: ~30 minutes per project (3 env vars + 2 HTTP calls).

Enforcement: preflight() returns allow: false if budget exceeded — prevents runaway costs.

See wiki/tech-patterns/cost-tracker-integration.


ADR-007: Apache Single-Vhost Subpath Pattern on optical-dev

Date: 2026-04 (documented from Barclays Banner Builder) Status: Active

Decision: All projects on optical-dev share one Apache vhost. Each project gets a subpath (/project-name/), not a subdomain.

Context: optical-dev has one public IP. Subdomain-per-project requires DNS management and SSL certificates. Subpath requires only Apache config fragments.

Constraints:

  • React apps must use VITE_BASE_PATH and React Router basename
  • All API calls must include the subpath prefix
  • Include directive order matters — specific paths before catch-alls

See wiki/architecture/optical-dev-server-deploy.


ADR-008: Gemini over GPT for Barclays / GCP Projects

Date: 2026-03 (Mod Comms) Status: Active for GCP-deployed projects

Decision: Prefer Google Gemini as AI provider for projects deployed on GCP.

Rationale: Google-to-Google latency advantage. GCP service account auth is simpler than API key rotation. Gemini Pro + Flash fallback gives cost/quality control.

When to use Claude/OpenAI instead: Client specifies it (PIMCO uses Claude API), or task requires better coding ability, or project is on optical-web-1 / optical-dev (neutral infrastructure).


ADR-009: Node.js Proxy for One2Edit / Simple Portals

Date: ~2024 Status: Active

Decision: Use Node.js + vanilla JS (no framework, no build step) for simple CORS proxy portals.

Context: One2Edit API doesn't support CORS. H&M and 3M portals need to proxy requests to oliver.one2edit.com.

Rationale: No build pipeline = easier to deploy and debug. Vanilla JS works fine for 3-page portals. Node.js express proxy is 30 lines.

Pattern: Static files served by Node + /api/* proxied to external API. See wiki/tech-patterns/nodejs-vanilla-proxy.