Add Dev/Prod migration + SSO plan

Captures the four-phase plan to move HM QC from the shared sandbox to
dedicated Dev/Prod servers with Azure AD SSO, mirroring the AI QC sibling
project's pattern. Includes locked-in decisions (URL path, branch strategy,
shared Entra app, fresh-start data), file-by-file lift list from AI QC,
phased checklist, and the IT ticket text. Action deferred to late April.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
nickviljoen 2026-04-28 21:09:37 +02:00
parent fc11a98a95
commit e69f077c79

View file

@ -0,0 +1,165 @@
# Dev/Prod Migration + SSO Plan
**Status:** Drafted 2026-04-28. IT ticket submitted to add redirect URIs to the shared Entra app. Action deferred to later in the week.
**Goal:** Move the HM AI QC tool from the shared sandbox onto dedicated Dev and Prod servers, replace local username/password auth with Azure AD SSO, and put a proper deploy script in place.
## Decisions (locked in)
| | Value |
|---|---|
| URL path on both servers | `hm-aiqc` |
| Dev URL | `https://optical-dev.oliver.solutions/hm-aiqc/` |
| Prod URL | `https://optical-prod.oliver.solutions/hm-aiqc/` |
| Branch strategy | `develop` → Dev, `main` (tagged) → Prod |
| Auth flow | Browser MSAL.js PKCE popup → server validates JWT → httpOnly cookie (mirrors AI QC) |
| Entra app | Shared with AI QC. Tenant `e519c2e6-bc6d-4fdf-8d9c-923c2f002385`, Client `9079054c-9620-4757-a256-23413042f1ef` |
| Authorization | Replicate AI QC's `user_access.json` pattern (admins + allowed users), simplified to drop AI QC's "clients" multi-tenancy |
| Bootstrap admin | `nick.viljoen@brandtech.plus` (same as AI QC) |
| Database migrations | Bootstrap Flask-Migrate / Alembic before any server cutover |
| Sandbox→Dev/Prod data | Fresh start everywhere |
| Box folder IDs / LLM keys | Shared across environments (no separate creds available) |
| Apache config | Placed manually by Nick via SSH on each server |
| Local auth fallback | **Removed entirely** — if MS login is down, the app is unreachable. Acceptable trade-off; matches AI QC. |
## Reference: sibling project to copy from
`/Users/nickviljoen/Desktop/AI_QC_Bitbucket/ai_qc/` — already deployed to the new servers with this exact auth/deploy pattern. Key files to lift:
| Source | Target | Notes |
|---|---|---|
| `backend/auth_middleware.py` (245 lines) | `core/auth/middleware.py` | Replace cookie name `ai_qc_auth_token``hm_aiqc_auth_token` |
| `backend/jwt_validator.py` (196 lines) | `core/auth/jwt_validator.py` | Lift verbatim |
| `backend/user_access.py` (243 lines) | `core/auth/user_access.py` | Strip out `clients` machinery — HM QC has no multi-tenancy |
| `backend/scripts/deploy.sh` (174 lines) | `deploy.sh` | Adapt for Docker (compose build/up + flask db upgrade instead of pip + systemctl) |
| `backend/scripts/rollback.sh` | `deploy/rollback.sh` | Adapt similarly |
| `backend/scripts/health-check.sh` | `deploy/health-check.sh` | Tiny — copy as-is |
| `frontend/index.html:2174-2287` (MSAL JS) | `templates/login.html` + `templates/base.html` | Lift the JS, place in templates |
| MSAL CDN script tag from `frontend/index.html:7-15` | `templates/base.html` head | `<script src="https://alcdn.msauth.net/browser/2.35.0/js/msal-browser.min.js">` |
| Auth routes from `backend/api_server.py:5313-5366` | `core/auth/routes.py` | `/auth/login`, `/auth/logout`, `/auth/status` |
**Key differences AI QC → HM QC** (so don't blind-copy):
- AI QC = bare-metal venv + systemd + single `index.html` SPA at `/var/www/html/ai_qc/`
- HM QC = Docker Compose + gunicorn + Flask templates (no separate frontend, no `/var/www/html/` involvement)
The IT-supplied "delete frontend files / build frontend / copy to web directory" steps are SPA-specific and **don't apply to HM QC** — they were generic Oliver-standard guidance that fit AI QC, not us.
## Phases
### Phase 0 — Repo prep (`develop` branch)
- [ ] `git checkout -b develop` from `main`, push to Bitbucket
- [ ] Add `core/health/` blueprint with `GET /health` returning `{status: "ok", db: <bool>}` — deploy script smoke-tests this
- [ ] Bootstrap Flask-Migrate (Alembic):
- `flask db init`
- Generate initial migration matching current schema (`flask db migrate -m "Initial schema"`)
- Review the autogenerated migration; manually fix anything Alembic missed (the ad-hoc `ALTER TABLE usage_logs ADD COLUMN input_tokens` etc. should already be reflected in models)
- Replace `db.create_all()` and the manual ALTER patches in `app.py` with `flask db upgrade` invoked from container entrypoint
- [ ] Add `wait_for_db.py` helper (exists for SQLite as a no-op, but useful framework if we move to Postgres later)
- [ ] Smoke-test locally: drop the SQLite file, run `flask db upgrade`, confirm tables match
### Phase 1 — SSO (still on `develop`)
- [ ] **Replace** `core/auth/jwt_validator.py` with AI QC's
- [ ] **Replace** `core/auth/middleware.py` with AI QC's `auth_middleware.py` (rename cookie)
- [ ] **New** `core/auth/user_access.py` — simplified version of AI QC's (admins + allowed users only, no `clients`). Stored at `database/user_access.json` so it lives with the SQLite DB and gets backed up together
- [ ] **Replace** `core/auth/routes.py` body with three POST endpoints (`/auth/login`, `/auth/logout`, `/auth/status`) lifted from AI QC's `api_server.py:5313-5366`
- [ ] **Replace** `templates/login.html` with MSAL.js popup version (lift JS from AI QC)
- [ ] **Update** `templates/base.html` — add MSAL CDN script tag, sign-in/out button, auth status check on load
- [ ] **Update** `app.py` — wire `AuthMiddleware(app)`, switch `@require_auth` decorators across all blueprints from local middleware to `@auth.require_auth`
- [ ] **Update** usage logging — confirm `g.user.email` flows through to `usage_logs.username`
- [ ] **Update** `.env.example`:
- Remove: `AUTH_USERS`, `SESSION_COOKIE_PATH`
- Add: `AZURE_TENANT_ID`, `AZURE_CLIENT_ID`, `ENVIRONMENT=development|production`, `BOOTSTRAP_ADMIN_EMAIL`
- [ ] **Delete** `auth_middleware.py` (legacy top-level), `jwt_validator.py` (legacy top-level), `deploy/generate_password.py`, the local-login form code in `core/auth/routes.py`
- [ ] Smoke-test locally — does the MSAL popup work against `http://localhost:5050/`? (Need a localhost redirect URI registered in Entra; AI QC handles this with `window.location.hostname === 'localhost'` branch)
### Phase 2 — Deploy machinery (still on `develop`)
- [ ] **`deploy.sh`** — Docker-aware adaptation of AI QC's:
```
Usage:
deploy.sh dev → fetch origin/develop, deploy HEAD
deploy.sh prod <tag> → fetch tag, deploy
deploy.sh {dev|prod} --dry-run
Steps:
1. Validate APP_DIR is a git repo and .env exists
2. git fetch --tags --prune
3. Compute current vs target rev; bail if same
4. Show changelog of commits about to apply; confirm
5. Write .last_deploy_rollback (current rev)
6. git reset --hard <target>
7. docker compose build (uses cache — only changed layers rebuild)
8. docker compose run --rm web flask db upgrade
9. docker compose up -d
10. Poll http://127.0.0.1:5050/health for up to 30s
11. On health fail: git reset --hard <previous>, docker compose up -d, verify, exit 1
```
The IT spec's "delete frontend files / build frontend / copy to /var/www/html/" steps are explicitly skipped (with comment explaining why — Flask templates, not SPA).
- [ ] **`deploy/rollback.sh`** — `rollback.sh last` reverts to checkpoint, `rollback.sh <hash>` to specific commit. Same Docker compose dance.
- [ ] **`deploy/health-check.sh`** — tiny `curl -sf http://127.0.0.1:5050/health` wrapper.
- [ ] **`deploy/apache-dev.conf`** — Apache `Location` block for `https://optical-dev.oliver.solutions/hm-aiqc/`. Sets `X-Script-Name: /hm-aiqc` (for `wsgi.py`'s `ReverseProxied` middleware) and proxies to gunicorn on `127.0.0.1:5050`.
- [ ] **`deploy/apache-prod.conf`** — same, for `optical-prod`.
- [ ] **`.env.dev.example`** and **`.env.prod.example`** — starter envs to copy and fill on each server.
### Phase 3 — Dev cutover
1. [ ] Confirm IT ticket has been actioned (redirect URIs added to Entra app)
2. [ ] Merge `develop` PR into itself or just push final commits — keep `develop` HEAD as the dev deploy target
3. [ ] **On Dev server (`optical-dev.oliver.solutions`)**:
```
sudo mkdir -p /opt/hm-aiqc
sudo chown $(whoami):$(whoami) /opt/hm-aiqc
git clone git@bitbucket.org:zlalani/hm_ai_qc_report_tool.git /opt/hm-aiqc
cd /opt/hm-aiqc
git checkout develop
cp deploy/.env.dev.example .env # and fill in keys/secrets
cp /path/to/box_config.json config/box_config.json # transfer Box creds
./deploy.sh dev
```
4. [ ] Place `deploy/apache-dev.conf` block in the Apache vhost; `sudo systemctl reload apache2`
5. [ ] Test `https://optical-dev.oliver.solutions/hm-aiqc/` — login flow, then run a sample HM QC, Video QC, Master Match
6. [ ] Verify Usage tab logs `nick.viljoen@oliver.agency` (or whichever email signs in)
### Phase 4 — Prod cutover
1. [ ] Once Dev has been green for a while, merge `develop``main`
2. [ ] Tag `v3.0.0` on `main` (major bump — auth is breaking; `v2.x` was local-auth)
3. [ ] **On Prod server**: same clone steps as Phase 3, but `./deploy.sh prod v3.0.0`
4. [ ] Place `deploy/apache-prod.conf`; reload
5. [ ] Test `https://optical-prod.oliver.solutions/hm-aiqc/`
6. [ ] **Sandbox decommission** — leave running ~1 week as fallback, then `docker compose down` on `optical-web-1`. Don't delete data immediately; archive `database/` and `storage/` to Box first
## Things to handle during/after Phase 4
- [ ] **Database backups** — cron a daily `sqlite3 .backup database/backups/qc_platform_$(date +%F).db` on Prod, retain 30 days
- [ ] **Disk monitoring**`storage/` grows monotonically; add a weekly cron that warns if `> 10 GB`
- [ ] **Log rotation** — Docker JSON log driver max-size + max-file in `docker-compose.yml`
- [ ] **CSRF protection** — currently absent on POST forms; add Flask-WTF or manual tokens once auth is real
- [ ] **Cookie flags** — confirm `SESSION_COOKIE_SECURE=True`, `SESSION_COOKIE_SAMESITE=Lax`, `SESSION_COOKIE_HTTPONLY=True` set in production env
## IT ticket text (already submitted)
```
Subject: Add redirect URIs to existing Entra app for HM AI QC tool
Please add the following redirect URIs to the existing app registration:
- Tenant: e519c2e6-bc6d-4fdf-8d9c-923c2f002385
- Client ID: 9079054c-9620-4757-a256-23413042f1ef (the AI QC app)
Platform: Single-page application (SPA) — same as the existing AI QC URIs
Add these redirect URIs:
- https://optical-dev.oliver.solutions/hm-aiqc/
- https://optical-prod.oliver.solutions/hm-aiqc/
No changes to permissions, scopes, or platform type are required —
these are siblings of the existing
https://ai-sandbox.oliver.solutions/hm-ai-qc-report/ URI.
Both URIs use the same MSAL.js PKCE popup flow as AI QC.
```