hm_ai_qc_report_tool/deploy/PROD_CUTOVER_RUNBOOK.md
nickviljoen 3f124318f9 Phase 4 prep: add Prod cutover runbook
Mirror of DEV_CUTOVER_RUNBOOK.md adjusted for optical-prod:
* Step 1 covers the merge-to-main + tag v3.0.0 done from the laptop
  before any server-side work begins.
* Reflects the gotchas we hit on Dev (per-repo Bitbucket Access Key,
  docker group membership, SSH alias, .env from .env.prod.example,
  Apache Include pattern).
* Adds post-cutover housekeeping: daily DB backup cron, disk-usage
  warning cron, and the sandbox decommission steps for optical-web-1
  after the 1-week soak.
2026-05-09 20:47:05 +02:00

11 KiB
Raw Permalink Blame History

Prod cutover runbook — optical-prod.oliver.solutions

Self-contained "SSH in and paste these" instructions for deploying HM AI QC to Prod. Phase 4 of deploy/DEV_PROD_MIGRATION_PLAN.md.

Estimated time: 30 minutes if everything works first try (the Dev gotchas we already hit are noted inline so this run should be smoother).

Pre-condition: Dev (https://optical-dev.oliver.solutions/hm-aiqc/) has been green and validated end-to-end. All commits going to Prod are already on develop.


0 — Prereqs

Check How
Entra redirect URIs https://optical-prod.oliver.solutions/hm-aiqc/ is registered as an SPA URI on app 9079054c-9620-4757-a256-23413042f1ef (confirmed 2026-05-09).
Dev is healthy curl -i https://optical-dev.oliver.solutions/hm-aiqc/health returns 200 with {"db":true,"status":"ok"}.
You have SSH access to optical-prod.oliver.solutions ssh optical-prod.oliver.solutions lands you in.
You have permission to talk to Apache sudo systemctl status apache2 works.

If anything's missing, stop and resolve before proceeding.


1 — From your laptop: merge to main and tag v3.0.0

cd /Users/nickviljoen/Desktop/HM_QC_Bitbucket/hm_ai_qc_report_tool

git fetch origin
git checkout main
git pull --ff-only origin main
git merge --no-ff origin/develop -m "Merge develop into main for v3.0.0 release"
git push origin main

git tag -a v3.0.0 -m "v3.0.0 — Azure AD SSO, Dev/Prod migration"
git push origin v3.0.0

Verify the tag landed:

git ls-remote --tags origin v3.0.0
# Should print one line: <sha>  refs/tags/v3.0.0

2 — SSH to Prod and prepare the deploy key

Bitbucket Access Keys are scoped per-repo, so the Dev server's key won't authorize Prod. Generate a fresh key for Prod, register it on the repo, and add an SSH config alias.

ssh optical-prod.oliver.solutions

# 1. Generate a fresh ed25519 key for this repo on this server
ssh-keygen -t ed25519 -f ~/.ssh/bitbucket_hm_aiqc \
    -C "hm_aiqc deploy key (optical-prod)" -N ""

# 2. Show the public key — copy the whole line that starts with "ssh-ed25519 ..."
cat ~/.ssh/bitbucket_hm_aiqc.pub

# 3. Append the SSH alias (paste-safe one-liner)
printf '\nHost bitbucket-hm-aiqc\n    HostName bitbucket.org\n    User git\n    IdentityFile ~/.ssh/bitbucket_hm_aiqc\n    IdentitiesOnly yes\n' >> ~/.ssh/config

In your browser:

  1. Go to https://bitbucket.org/zlalani/hm_ai_qc_report_tool/admin/access-keys/
  2. Add key → Label: optical-prod (read-only) → paste the public key → Save
  3. Leave it as read-only (the deploy script never pushes from the server).

Verify on the prod server:

ssh -T git@bitbucket-hm-aiqc
# Expected: a line saying you're logged in (or 'authenticated via ssh key' message)

3 — Clone the repo at /opt/hm-aiqc

sudo mkdir -p /opt/hm-aiqc
sudo chown $(whoami):$(whoami) /opt/hm-aiqc

# IMPORTANT: trailing dot — clone INTO /opt/hm-aiqc, not into a subdir
git clone git@bitbucket-hm-aiqc:zlalani/hm_ai_qc_report_tool.git /opt/hm-aiqc
cd /opt/hm-aiqc

git checkout v3.0.0
git log -1 --oneline      # should show the v3.0.0 tag commit

4 — Docker permissions (one-time)

If docker ps fails with "permission denied", add yourself to the docker group. This was needed on Dev too — likely needed here.

docker ps                                # check if it works first
sudo usermod -aG docker $USER
exec newgrp docker                       # activate group in the current shell
docker ps                                # confirm — should now work

5 — Create the .env file

cd /opt/hm-aiqc
cp deploy/.env.prod.example .env

# Generate a real Flask SECRET_KEY (DIFFERENT from the Dev key)
python3 -c 'import secrets; print(secrets.token_urlsafe(48))'
# Paste that value into SECRET_KEY=

# Edit and fill in the LLM API keys
$EDITOR .env

Required values:

  • SECRET_KEY — fresh, from the python3 one-liner above (do NOT reuse Dev's)
  • OPENAI_API_KEY
  • GOOGLE_API_KEY
  • ANTHROPIC_API_KEY
  • BOX_CAMPAIGNS_FOLDER_ID=156182880490 — already correct in the example, but double-check after copying
  • ENVIRONMENT=production — already correct in the example

Confirm Azure config:

grep -E "AZURE_|ENVIRONMENT|BOX_" .env
# AZURE_TENANT_ID=e519c2e6-bc6d-4fdf-8d9c-923c2f002385
# AZURE_CLIENT_ID=9079054c-9620-4757-a256-23413042f1ef
# ENVIRONMENT=production
# BOX_REPORT_FOLDER_ID=133295752718
# BOX_CAMPAIGNS_FOLDER_ID=156182880490

Source the LLM keys from one of:

  • The Dev server (/opt/hm-aiqc/.env on optical-dev) if Dev and Prod share keys (current plan).
  • The sandbox (optical-web-1:/opt/hm_ai_qc/hm_ai_qc_report_tool/.env) as a backup source.

6 — Drop in the Box service-account config

mkdir -p config

# Easiest: scp from your laptop. Replace <local-path> with where you have it.
scp <local-path>/box_config.json optical-prod.oliver.solutions:/opt/hm-aiqc/config/box_config.json

# On the prod server, lock it down:
chmod 600 /opt/hm-aiqc/config/box_config.json

Or copy from another server if scp-ing directly between servers is allowed:

scp optical-dev.oliver.solutions:/opt/hm-aiqc/config/box_config.json \
    optical-prod.oliver.solutions:/opt/hm-aiqc/config/box_config.json
chmod 600 /opt/hm-aiqc/config/box_config.json

7 — Run the deploy

cd /opt/hm-aiqc
./deploy.sh prod v3.0.0 --dry-run    # preview
./deploy.sh prod v3.0.0              # actual deploy, prompts y/N to confirm

The script will:

  1. Save current HEAD to .last_deploy_rollback
  2. Auto-detect "no running container" and proceed (no --force needed)
  3. docker compose build (first run pulls python:3.11-slim — slow, ~25 min)
  4. docker compose up -d — entrypoint runs wait_for_db.pyflask db upgrade → gunicorn
  5. Poll http://127.0.0.1:5050/health for up to 60s
  6. If healthy → done. If not → auto-rollback (no-op on first deploy since there's no previous state, exits non-zero).

Watch the logs in another terminal if you want to see boot:

docker compose logs -f web

When done, sanity-check from the server itself:

curl -sf http://127.0.0.1:5050/health
# Expected: {"db":true,"status":"ok"}

8 — Place the Apache snippet and reload

Find the prod vhost:

grep -rln "optical-prod.oliver.solutions" /etc/apache2/sites-available/ /etc/apache2/sites-enabled/

Edit the file (likely /etc/apache2/sites-enabled/optical-prod.oliver.solutions.conf) and add a single Include line inside the <VirtualHost *:80> block, next to the existing per-app Includes (the Dev vhost has the same pattern — match it):

    Include /opt/hm-aiqc/deploy/apache-hm-aiqc.conf

Verify and reload:

sudo apache2ctl configtest          # should print "Syntax OK"
sudo systemctl reload apache2

If configtest complains about missing modules, enable them once:

sudo a2enmod proxy proxy_http headers rewrite
sudo systemctl reload apache2

9 — Smoke test the public URL

From your laptop:

curl -i https://optical-prod.oliver.solutions/hm-aiqc/health
# Expected: HTTP/1.1 200 OK + {"status":"ok","db":true}

If /health returns 200, open in a browser:

https://optical-prod.oliver.solutions/hm-aiqc/

Expected flow:

  1. Redirect to /auth/login-page
  2. Sign in with Microsoft popup → complete with your work account
  3. Land on the Reporting page with your name in the top-right

Run the module-by-module checklist (same as Dev):

Module Quick test
Reporting Tab loads, Box folder hint visible, search a known job number
HM QC Upload one image, run, confirm score
Video QC Upload one short MP4, run, confirm score
Video Master Search a known campaign title, confirm preview
Campaigns List loads (will be empty — fresh DB)
Usage Your email appears in the user filter for the runs above (proof Phase 1's g.user.email threading works on Prod)

10 — If something goes wrong

deploy.sh reports rollback

The script already restored the previous code state (or, on first deploy, has nothing to roll back to and just exited non-zero). Look at the container logs:

docker compose logs --tail=200 web

Most likely causes — these are the same as Dev:

  • Missing env var → fix .env and re-run ./deploy.sh prod v3.0.0 --force
  • Migration error → check flask db upgrade output in logs
  • Box config missing → confirm config/box_config.json exists

/health returns 503 (db: false)

Container is up but can't reach SQLite. Check the volume mount:

docker compose exec web ls -la /app/database
docker compose exec web sh -c 'touch /app/database/_test_write && rm /app/database/_test_write'

MSAL popup → AADSTS50011 redirect URI mismatch

Confirm the URL in the address bar exactly matches the Entra-registered SPA URI:

  • Must be https://optical-prod.oliver.solutions/hm-aiqc/ (trailing slash)
  • The apache-hm-aiqc.conf RewriteRule adds the trailing slash if you visit /hm-aiqc without one — verify that 301 fires.

"AADSTS70001: Application not found in tenant"

The signed-in account is in a different Microsoft tenant. Use a *.brandtech.plus / *.oliver.agency work account, not personal MS.

Manual rollback

cd /opt/hm-aiqc
./deploy/rollback.sh last      # back to the checkpoint
# or
./deploy/rollback.sh <sha>     # back to a specific commit

Database backup before risky operations

cd /opt/hm-aiqc
mkdir -p database/backups
sqlite3 database/qc_platform.db ".backup database/backups/qc_platform_$(date +%F_%H%M).db"

11 — Post-cutover (within 24 hours)

  • Tell the team the new Prod URL: https://optical-prod.oliver.solutions/hm-aiqc/
  • Don't decommission the sandbox yet. Leave https://ai-sandbox.oliver.solutions/hm-ai-qc-report/ running on optical-web-1 as a fallback.
  • Set up daily DB backups (cron): bash crontab -e # Add: 0 2 * * * cd /opt/hm-aiqc && sqlite3 database/qc_platform.db ".backup database/backups/qc_platform_$(date +\%F).db" && find database/backups -name "qc_platform_*.db" -mtime +30 -delete
  • Add a weekly disk-usage warning (cron): bash 0 9 * * 1 du -sh /opt/hm-aiqc/storage 2>/dev/null | awk '$1 ~ /[0-9]+G/ && $1+0 > 10 {print "WARN: storage "$1" — consider archiving"}' | mail -s "[hm-aiqc] storage size" you@oliver.agency

12 — Sandbox decommission (after ~1 week of Prod soak)

Only do this once Prod has been stable for at least a week and the team has confirmed no traffic still goes to the sandbox.

# On optical-web-1:
ssh optical-web-1.oliver.solutions
cd /opt/hm_ai_qc/hm_ai_qc_report_tool

# 1. Archive the database and storage to Box first
tar czf /tmp/hm_ai_qc_sandbox_archive_$(date +%F).tar.gz database storage
# Upload /tmp/hm_ai_qc_sandbox_archive_*.tar.gz to Box manually
#   → /CAMPAIGNS/_archive/hm_ai_qc_sandbox/  (or wherever fits)

# 2. Stop and remove the container
docker compose down

# 3. Coordinate with whoever owns the sandbox vhost to remove the
#    /hm-ai-qc-report Location block from
#    optical-web-1's Apache config

Once the Apache block is removed and Apache reloaded, the sandbox URL returns 404 — decommission complete.


Done

Prod is live, sandbox archived, Phase 4 complete.