diff --git a/deploy/DEV_CUTOVER_RUNBOOK.md b/deploy/DEV_CUTOVER_RUNBOOK.md new file mode 100644 index 0000000..e48f16f --- /dev/null +++ b/deploy/DEV_CUTOVER_RUNBOOK.md @@ -0,0 +1,236 @@ +# Dev cutover runbook — `optical-dev.oliver.solutions` + +Self-contained "SSH in and paste these" instructions for the first +deploy of HM AI QC to the new Dev server. Phase 3 of +`deploy/DEV_PROD_MIGRATION_PLAN.md`. + +**Estimated time:** 20 minutes if everything works first try. + +--- + +## 0 — Prereqs (verify before starting) + +| Check | How | +|---|---| +| Entra redirect URIs added | Confirmed 2026-05-09 — `https://optical-dev.oliver.solutions/hm-aiqc/` and `https://optical-prod.oliver.solutions/hm-aiqc/` are registered as SPA URIs on app `9079054c-9620-4757-a256-23413042f1ef`. | +| `develop` branch pushed | `git ls-remote origin develop` from your laptop shows commit `e772095` (Phase 2). | +| You have SSH access to `optical-dev.oliver.solutions` | `ssh optical-dev.oliver.solutions` lands you in. | +| You have docker permissions on the server | `docker ps` works without sudo. | +| You have sudo for Apache reload | `sudo systemctl status apache2` works. | + +If anything's missing, stop and resolve before proceeding. + +--- + +## 1 — SSH in and clone the repo + +```bash +ssh optical-dev.oliver.solutions + +sudo mkdir -p /opt/hm-aiqc +sudo chown $(whoami):$(whoami) /opt/hm-aiqc + +git clone git@bitbucket.org:zlalani/hm_ai_qc_report_tool.git /opt/hm-aiqc +cd /opt/hm-aiqc +git checkout develop +git log -1 --oneline # should print: e772095 Phase 2: deploy machinery for Dev/Prod cutover +``` + +If `git clone` fails on auth, you'll need a deploy key on the server first +(an SSH key on `optical-dev` whose public half is added to Bitbucket as a +read-only deploy key for this repo). Same key approach as AI QC. + +--- + +## 2 — Create the `.env` file + +```bash +cp deploy/.env.dev.example .env + +# Generate a real Flask SECRET_KEY +python3 -c 'import secrets; print(secrets.token_urlsafe(48))' +# Paste that value into SECRET_KEY= + +# Fill in real LLM API keys +$EDITOR .env +``` + +Required keys to fill in (the `.env.dev.example` placeholders): +- `SECRET_KEY` — from the python3 one-liner above +- `OPENAI_API_KEY` — copy from sandbox `optical-web-1:/opt/hm_ai_qc/hm_ai_qc_report_tool/.env` +- `GOOGLE_API_KEY` — same source +- `ANTHROPIC_API_KEY` — same source + +Confirm tenant/client IDs match the SSO plan: +```bash +grep "AZURE_" .env +# AZURE_TENANT_ID=e519c2e6-bc6d-4fdf-8d9c-923c2f002385 +# AZURE_CLIENT_ID=9079054c-9620-4757-a256-23413042f1ef +``` + +--- + +## 3 — Drop in the Box config + +The Box service-account JSON lives outside the repo (gitignored): + +```bash +mkdir -p config +scp optical-web-1:/opt/hm_ai_qc/hm_ai_qc_report_tool/config/box_config.json \ + optical-dev.oliver.solutions:/opt/hm-aiqc/config/box_config.json + +# On the dev server, lock it down: +chmod 600 /opt/hm-aiqc/config/box_config.json +``` + +Or if you can't ssh-copy directly, copy it to your laptop with `scp` and re-upload. + +--- + +## 4 — Run the deploy + +```bash +cd /opt/hm-aiqc +./deploy.sh dev --dry-run # preview — no changes +./deploy.sh dev # actual deploy, prompts y/N to confirm +``` + +The deploy script will: +1. Save current HEAD to `.last_deploy_rollback` (empty on first run, that's fine — it'll save the initial clone HEAD) +2. `git fetch` + `git reset --hard origin/develop` +3. `docker compose build` (first run pulls the python:3.11-slim base image — slow, ~2–5 min) +4. `docker compose up -d` (entrypoint runs `wait_for_db.py` → `flask db upgrade` → gunicorn) +5. Poll `http://127.0.0.1:5050/health` every 2s for up to 60s +6. If `/health` returns 200 with `{"status":"ok","db":true}` → done +7. If not → auto-rollback to the saved checkpoint and exit non-zero + +Watch the logs in another terminal if you want to see the boot: +```bash +docker compose logs -f web +``` + +--- + +## 5 — Place the Apache config and reload + +Find the existing `optical-dev.oliver.solutions` virtual host: +```bash +grep -rn "optical-dev.oliver.solutions" /etc/apache2/sites-available/ +``` + +Open that file (likely `/etc/apache2/sites-available/optical-dev.conf` or similar) +and paste the contents of `/opt/hm-aiqc/deploy/apache-dev.conf` inside the +`` block (the HTTPS one). Save. + +Verify and reload: +```bash +sudo apache2ctl configtest # should print "Syntax OK" +sudo systemctl reload apache2 +``` + +If `configtest` complains about missing modules: +```bash +sudo a2enmod proxy proxy_http headers rewrite +sudo systemctl reload apache2 +``` + +--- + +## 6 — Smoke test the public URL + +From your **laptop**: + +```bash +curl -i https://optical-dev.oliver.solutions/hm-aiqc/health +# Expected: HTTP/1.1 200 OK + {"status":"ok","db":true} +``` + +If `/health` returns 200, open in a browser: + +``` +https://optical-dev.oliver.solutions/hm-aiqc/ +``` + +You should be redirected to `/auth/login-page`. Click **Sign in with Microsoft**, +complete the popup with your `*.brandtech.plus` or `*.oliver.agency` work account. + +After login, run through each module to confirm: + +| Module | Quick test | +|---|---| +| Reporting | Tab loads, "Previous Box Reports" populates | +| HM QC | Upload one image, run, confirm score | +| Video QC | Upload one short MP4, run, confirm score | +| Video Master | Enter a known campaign name, confirm matches preview | +| Campaigns | List loads (will be empty on fresh start) | +| Usage | Tab loads, your email appears in the user filter | + +The Usage tab is the last item because it's the proof that `g.user.email` +flows into `usage_logs.user` correctly — the whole point of Phase 1. + +--- + +## 7 — If something goes wrong + +### Deploy script reports rollback + +The script already restored the previous code state. Look at the container logs: +```bash +docker compose logs --tail=200 web +``` + +Most likely causes: +- Missing env var → fix `.env` and re-run `./deploy.sh dev` +- Migration error → check `flask db upgrade` output in logs; restore DB if needed (see "Database backup" below) +- Box config missing → confirm `config/box_config.json` exists + +### `/health` returns 503 (db: false) + +Container is up but can't reach SQLite. Check that `./database` volume is +mounted and writable: +```bash +docker compose exec web ls -la /app/database +docker compose exec web touch /app/database/_test_write && docker compose exec web rm /app/database/_test_write +``` + +### MSAL popup spins forever / "AADSTS50011" redirect URI mismatch + +Confirm the URL in the address bar exactly matches the Entra-registered SPA URI: +- Must be `https://optical-dev.oliver.solutions/hm-aiqc/` (trailing slash) +- Apache `RewriteRule` in `apache-dev.conf` adds the trailing slash if you visit + `/hm-aiqc` without one — verify that 301 fires. + +### "AADSTS70001: Application not found in tenant" + +The user is signed into a *different* Microsoft tenant. They must use a +`*.brandtech.plus` / `*.oliver.agency` work account, not a personal Microsoft +account. + +### Manual rollback + +```bash +cd /opt/hm-aiqc +./deploy/rollback.sh last # back to the checkpoint +# or +./deploy/rollback.sh # back to a specific commit +``` + +### Database backup (before risky migrations) + +```bash +cd /opt/hm-aiqc +sqlite3 database/qc_platform.db ".backup database/backups/qc_platform_$(date +%F_%H%M).db" +``` + +--- + +## 8 — Hand-off + +Once Dev is green and the smoke tests pass: + +1. Tell the team the URL: `https://optical-dev.oliver.solutions/hm-aiqc/` +2. **Don't** decommission the sandbox yet (`https://ai-sandbox.oliver.solutions/hm-ai-qc-report/`) + — leave it running until Prod is also live + soaked for ~1 week. +3. When ready to ship Prod: tag `v3.0.0` on `main`, then repeat this runbook + on `optical-prod.oliver.solutions` with `./deploy.sh prod v3.0.0` and + `deploy/apache-prod.conf`. Same steps, different host.