hm_ai_qc_report_tool/deploy/DEV_CUTOVER_RUNBOOK.md
nickviljoen 458c75311e Phase 3 prep: add Dev cutover runbook
Self-contained SSH-and-paste guide covering clone → .env → deploy → Apache
reload → smoke test. Includes troubleshooting for the most likely failure
modes (MSAL redirect mismatch, missing env vars, /health 503).
2026-05-09 14:42:12 +02:00

7.4 KiB
Raw Permalink Blame History

Dev cutover runbook — optical-dev.oliver.solutions

Self-contained "SSH in and paste these" instructions for the first deploy of HM AI QC to the new Dev server. Phase 3 of deploy/DEV_PROD_MIGRATION_PLAN.md.

Estimated time: 20 minutes if everything works first try.


0 — Prereqs (verify before starting)

Check How
Entra redirect URIs added Confirmed 2026-05-09 — https://optical-dev.oliver.solutions/hm-aiqc/ and https://optical-prod.oliver.solutions/hm-aiqc/ are registered as SPA URIs on app 9079054c-9620-4757-a256-23413042f1ef.
develop branch pushed git ls-remote origin develop from your laptop shows commit e772095 (Phase 2).
You have SSH access to optical-dev.oliver.solutions ssh optical-dev.oliver.solutions lands you in.
You have docker permissions on the server docker ps works without sudo.
You have sudo for Apache reload sudo systemctl status apache2 works.

If anything's missing, stop and resolve before proceeding.


1 — SSH in and clone the repo

ssh optical-dev.oliver.solutions

sudo mkdir -p /opt/hm-aiqc
sudo chown $(whoami):$(whoami) /opt/hm-aiqc

git clone git@bitbucket.org:zlalani/hm_ai_qc_report_tool.git /opt/hm-aiqc
cd /opt/hm-aiqc
git checkout develop
git log -1 --oneline   # should print: e772095 Phase 2: deploy machinery for Dev/Prod cutover

If git clone fails on auth, you'll need a deploy key on the server first (an SSH key on optical-dev whose public half is added to Bitbucket as a read-only deploy key for this repo). Same key approach as AI QC.


2 — Create the .env file

cp deploy/.env.dev.example .env

# Generate a real Flask SECRET_KEY
python3 -c 'import secrets; print(secrets.token_urlsafe(48))'
# Paste that value into SECRET_KEY=

# Fill in real LLM API keys
$EDITOR .env

Required keys to fill in (the .env.dev.example placeholders):

  • SECRET_KEY — from the python3 one-liner above
  • OPENAI_API_KEY — copy from sandbox optical-web-1:/opt/hm_ai_qc/hm_ai_qc_report_tool/.env
  • GOOGLE_API_KEY — same source
  • ANTHROPIC_API_KEY — same source

Confirm tenant/client IDs match the SSO plan:

grep "AZURE_" .env
# AZURE_TENANT_ID=e519c2e6-bc6d-4fdf-8d9c-923c2f002385
# AZURE_CLIENT_ID=9079054c-9620-4757-a256-23413042f1ef

3 — Drop in the Box config

The Box service-account JSON lives outside the repo (gitignored):

mkdir -p config
scp optical-web-1:/opt/hm_ai_qc/hm_ai_qc_report_tool/config/box_config.json \
    optical-dev.oliver.solutions:/opt/hm-aiqc/config/box_config.json

# On the dev server, lock it down:
chmod 600 /opt/hm-aiqc/config/box_config.json

Or if you can't ssh-copy directly, copy it to your laptop with scp and re-upload.


4 — Run the deploy

cd /opt/hm-aiqc
./deploy.sh dev --dry-run    # preview — no changes
./deploy.sh dev              # actual deploy, prompts y/N to confirm

The deploy script will:

  1. Save current HEAD to .last_deploy_rollback (empty on first run, that's fine — it'll save the initial clone HEAD)
  2. git fetch + git reset --hard origin/develop
  3. docker compose build (first run pulls the python:3.11-slim base image — slow, ~25 min)
  4. docker compose up -d (entrypoint runs wait_for_db.pyflask db upgrade → gunicorn)
  5. Poll http://127.0.0.1:5050/health every 2s for up to 60s
  6. If /health returns 200 with {"status":"ok","db":true} → done
  7. If not → auto-rollback to the saved checkpoint and exit non-zero

Watch the logs in another terminal if you want to see the boot:

docker compose logs -f web

5 — Place the Apache config and reload

Find the existing optical-dev.oliver.solutions virtual host:

grep -rn "optical-dev.oliver.solutions" /etc/apache2/sites-available/

Open that file (likely /etc/apache2/sites-available/optical-dev.conf or similar) and paste the contents of /opt/hm-aiqc/deploy/apache-dev.conf inside the <VirtualHost *:443> block (the HTTPS one). Save.

Verify and reload:

sudo apache2ctl configtest          # should print "Syntax OK"
sudo systemctl reload apache2

If configtest complains about missing modules:

sudo a2enmod proxy proxy_http headers rewrite
sudo systemctl reload apache2

6 — Smoke test the public URL

From your laptop:

curl -i https://optical-dev.oliver.solutions/hm-aiqc/health
# Expected: HTTP/1.1 200 OK + {"status":"ok","db":true}

If /health returns 200, open in a browser:

https://optical-dev.oliver.solutions/hm-aiqc/

You should be redirected to /auth/login-page. Click Sign in with Microsoft, complete the popup with your *.brandtech.plus or *.oliver.agency work account.

After login, run through each module to confirm:

Module Quick test
Reporting Tab loads, "Previous Box Reports" populates
HM QC Upload one image, run, confirm score
Video QC Upload one short MP4, run, confirm score
Video Master Enter a known campaign name, confirm matches preview
Campaigns List loads (will be empty on fresh start)
Usage Tab loads, your email appears in the user filter

The Usage tab is the last item because it's the proof that g.user.email flows into usage_logs.user correctly — the whole point of Phase 1.


7 — If something goes wrong

Deploy script reports rollback

The script already restored the previous code state. Look at the container logs:

docker compose logs --tail=200 web

Most likely causes:

  • Missing env var → fix .env and re-run ./deploy.sh dev
  • Migration error → check flask db upgrade output in logs; restore DB if needed (see "Database backup" below)
  • Box config missing → confirm config/box_config.json exists

/health returns 503 (db: false)

Container is up but can't reach SQLite. Check that ./database volume is mounted and writable:

docker compose exec web ls -la /app/database
docker compose exec web touch /app/database/_test_write && docker compose exec web rm /app/database/_test_write

MSAL popup spins forever / "AADSTS50011" redirect URI mismatch

Confirm the URL in the address bar exactly matches the Entra-registered SPA URI:

  • Must be https://optical-dev.oliver.solutions/hm-aiqc/ (trailing slash)
  • Apache RewriteRule in apache-dev.conf adds the trailing slash if you visit /hm-aiqc without one — verify that 301 fires.

"AADSTS70001: Application not found in tenant"

The user is signed into a different Microsoft tenant. They must use a *.brandtech.plus / *.oliver.agency work account, not a personal Microsoft account.

Manual rollback

cd /opt/hm-aiqc
./deploy/rollback.sh last      # back to the checkpoint
# or
./deploy/rollback.sh <sha>     # back to a specific commit

Database backup (before risky migrations)

cd /opt/hm-aiqc
sqlite3 database/qc_platform.db ".backup database/backups/qc_platform_$(date +%F_%H%M).db"

8 — Hand-off

Once Dev is green and the smoke tests pass:

  1. Tell the team the URL: https://optical-dev.oliver.solutions/hm-aiqc/
  2. Don't decommission the sandbox yet (https://ai-sandbox.oliver.solutions/hm-ai-qc-report/) — leave it running until Prod is also live + soaked for ~1 week.
  3. When ready to ship Prod: tag v3.0.0 on main, then repeat this runbook on optical-prod.oliver.solutions with ./deploy.sh prod v3.0.0 and deploy/apache-prod.conf. Same steps, different host.