Mirror of DEV_CUTOVER_RUNBOOK.md adjusted for optical-prod: * Step 1 covers the merge-to-main + tag v3.0.0 done from the laptop before any server-side work begins. * Reflects the gotchas we hit on Dev (per-repo Bitbucket Access Key, docker group membership, SSH alias, .env from .env.prod.example, Apache Include pattern). * Adds post-cutover housekeeping: daily DB backup cron, disk-usage warning cron, and the sandbox decommission steps for optical-web-1 after the 1-week soak.
11 KiB
Prod cutover runbook — optical-prod.oliver.solutions
Self-contained "SSH in and paste these" instructions for deploying HM AI
QC to Prod. Phase 4 of deploy/DEV_PROD_MIGRATION_PLAN.md.
Estimated time: 30 minutes if everything works first try (the Dev gotchas we already hit are noted inline so this run should be smoother).
Pre-condition: Dev (https://optical-dev.oliver.solutions/hm-aiqc/)
has been green and validated end-to-end. All commits going to Prod are
already on develop.
0 — Prereqs
| Check | How |
|---|---|
| Entra redirect URIs | https://optical-prod.oliver.solutions/hm-aiqc/ is registered as an SPA URI on app 9079054c-9620-4757-a256-23413042f1ef (confirmed 2026-05-09). |
| Dev is healthy | curl -i https://optical-dev.oliver.solutions/hm-aiqc/health returns 200 with {"db":true,"status":"ok"}. |
You have SSH access to optical-prod.oliver.solutions |
ssh optical-prod.oliver.solutions lands you in. |
| You have permission to talk to Apache | sudo systemctl status apache2 works. |
If anything's missing, stop and resolve before proceeding.
1 — From your laptop: merge to main and tag v3.0.0
cd /Users/nickviljoen/Desktop/HM_QC_Bitbucket/hm_ai_qc_report_tool
git fetch origin
git checkout main
git pull --ff-only origin main
git merge --no-ff origin/develop -m "Merge develop into main for v3.0.0 release"
git push origin main
git tag -a v3.0.0 -m "v3.0.0 — Azure AD SSO, Dev/Prod migration"
git push origin v3.0.0
Verify the tag landed:
git ls-remote --tags origin v3.0.0
# Should print one line: <sha> refs/tags/v3.0.0
2 — SSH to Prod and prepare the deploy key
Bitbucket Access Keys are scoped per-repo, so the Dev server's key won't authorize Prod. Generate a fresh key for Prod, register it on the repo, and add an SSH config alias.
ssh optical-prod.oliver.solutions
# 1. Generate a fresh ed25519 key for this repo on this server
ssh-keygen -t ed25519 -f ~/.ssh/bitbucket_hm_aiqc \
-C "hm_aiqc deploy key (optical-prod)" -N ""
# 2. Show the public key — copy the whole line that starts with "ssh-ed25519 ..."
cat ~/.ssh/bitbucket_hm_aiqc.pub
# 3. Append the SSH alias (paste-safe one-liner)
printf '\nHost bitbucket-hm-aiqc\n HostName bitbucket.org\n User git\n IdentityFile ~/.ssh/bitbucket_hm_aiqc\n IdentitiesOnly yes\n' >> ~/.ssh/config
In your browser:
- Go to https://bitbucket.org/zlalani/hm_ai_qc_report_tool/admin/access-keys/
- Add key → Label:
optical-prod (read-only)→ paste the public key → Save - Leave it as read-only (the deploy script never pushes from the server).
Verify on the prod server:
ssh -T git@bitbucket-hm-aiqc
# Expected: a line saying you're logged in (or 'authenticated via ssh key' message)
3 — Clone the repo at /opt/hm-aiqc
sudo mkdir -p /opt/hm-aiqc
sudo chown $(whoami):$(whoami) /opt/hm-aiqc
# IMPORTANT: trailing dot — clone INTO /opt/hm-aiqc, not into a subdir
git clone git@bitbucket-hm-aiqc:zlalani/hm_ai_qc_report_tool.git /opt/hm-aiqc
cd /opt/hm-aiqc
git checkout v3.0.0
git log -1 --oneline # should show the v3.0.0 tag commit
4 — Docker permissions (one-time)
If docker ps fails with "permission denied", add yourself to the
docker group. This was needed on Dev too — likely needed here.
docker ps # check if it works first
sudo usermod -aG docker $USER
exec newgrp docker # activate group in the current shell
docker ps # confirm — should now work
5 — Create the .env file
cd /opt/hm-aiqc
cp deploy/.env.prod.example .env
# Generate a real Flask SECRET_KEY (DIFFERENT from the Dev key)
python3 -c 'import secrets; print(secrets.token_urlsafe(48))'
# Paste that value into SECRET_KEY=
# Edit and fill in the LLM API keys
$EDITOR .env
Required values:
SECRET_KEY— fresh, from the python3 one-liner above (do NOT reuse Dev's)OPENAI_API_KEYGOOGLE_API_KEYANTHROPIC_API_KEYBOX_CAMPAIGNS_FOLDER_ID=156182880490— already correct in the example, but double-check after copyingENVIRONMENT=production— already correct in the example
Confirm Azure config:
grep -E "AZURE_|ENVIRONMENT|BOX_" .env
# AZURE_TENANT_ID=e519c2e6-bc6d-4fdf-8d9c-923c2f002385
# AZURE_CLIENT_ID=9079054c-9620-4757-a256-23413042f1ef
# ENVIRONMENT=production
# BOX_REPORT_FOLDER_ID=133295752718
# BOX_CAMPAIGNS_FOLDER_ID=156182880490
Source the LLM keys from one of:
- The Dev server (
/opt/hm-aiqc/.envonoptical-dev) if Dev and Prod share keys (current plan). - The sandbox (
optical-web-1:/opt/hm_ai_qc/hm_ai_qc_report_tool/.env) as a backup source.
6 — Drop in the Box service-account config
mkdir -p config
# Easiest: scp from your laptop. Replace <local-path> with where you have it.
scp <local-path>/box_config.json optical-prod.oliver.solutions:/opt/hm-aiqc/config/box_config.json
# On the prod server, lock it down:
chmod 600 /opt/hm-aiqc/config/box_config.json
Or copy from another server if scp-ing directly between servers is allowed:
scp optical-dev.oliver.solutions:/opt/hm-aiqc/config/box_config.json \
optical-prod.oliver.solutions:/opt/hm-aiqc/config/box_config.json
chmod 600 /opt/hm-aiqc/config/box_config.json
7 — Run the deploy
cd /opt/hm-aiqc
./deploy.sh prod v3.0.0 --dry-run # preview
./deploy.sh prod v3.0.0 # actual deploy, prompts y/N to confirm
The script will:
- Save current HEAD to
.last_deploy_rollback - Auto-detect "no running container" and proceed (no
--forceneeded) docker compose build(first run pullspython:3.11-slim— slow, ~2–5 min)docker compose up -d— entrypoint runswait_for_db.py→flask db upgrade→ gunicorn- Poll
http://127.0.0.1:5050/healthfor up to 60s - If healthy → done. If not → auto-rollback (no-op on first deploy since there's no previous state, exits non-zero).
Watch the logs in another terminal if you want to see boot:
docker compose logs -f web
When done, sanity-check from the server itself:
curl -sf http://127.0.0.1:5050/health
# Expected: {"db":true,"status":"ok"}
8 — Place the Apache snippet and reload
Find the prod vhost:
grep -rln "optical-prod.oliver.solutions" /etc/apache2/sites-available/ /etc/apache2/sites-enabled/
Edit the file (likely /etc/apache2/sites-enabled/optical-prod.oliver.solutions.conf)
and add a single Include line inside the <VirtualHost *:80> block,
next to the existing per-app Includes (the Dev vhost has the same
pattern — match it):
Include /opt/hm-aiqc/deploy/apache-hm-aiqc.conf
Verify and reload:
sudo apache2ctl configtest # should print "Syntax OK"
sudo systemctl reload apache2
If configtest complains about missing modules, enable them once:
sudo a2enmod proxy proxy_http headers rewrite
sudo systemctl reload apache2
9 — Smoke test the public URL
From your laptop:
curl -i https://optical-prod.oliver.solutions/hm-aiqc/health
# Expected: HTTP/1.1 200 OK + {"status":"ok","db":true}
If /health returns 200, open in a browser:
https://optical-prod.oliver.solutions/hm-aiqc/
Expected flow:
- Redirect to
/auth/login-page - Sign in with Microsoft popup → complete with your work account
- Land on the Reporting page with your name in the top-right
Run the module-by-module checklist (same as Dev):
| Module | Quick test |
|---|---|
| Reporting | Tab loads, Box folder hint visible, search a known job number |
| HM QC | Upload one image, run, confirm score |
| Video QC | Upload one short MP4, run, confirm score |
| Video Master | Search a known campaign title, confirm preview |
| Campaigns | List loads (will be empty — fresh DB) |
| Usage | Your email appears in the user filter for the runs above (proof Phase 1's g.user.email threading works on Prod) |
10 — If something goes wrong
deploy.sh reports rollback
The script already restored the previous code state (or, on first deploy, has nothing to roll back to and just exited non-zero). Look at the container logs:
docker compose logs --tail=200 web
Most likely causes — these are the same as Dev:
- Missing env var → fix
.envand re-run./deploy.sh prod v3.0.0 --force - Migration error → check
flask db upgradeoutput in logs - Box config missing → confirm
config/box_config.jsonexists
/health returns 503 (db: false)
Container is up but can't reach SQLite. Check the volume mount:
docker compose exec web ls -la /app/database
docker compose exec web sh -c 'touch /app/database/_test_write && rm /app/database/_test_write'
MSAL popup → AADSTS50011 redirect URI mismatch
Confirm the URL in the address bar exactly matches the Entra-registered SPA URI:
- Must be
https://optical-prod.oliver.solutions/hm-aiqc/(trailing slash) - The
apache-hm-aiqc.confRewriteRuleadds the trailing slash if you visit/hm-aiqcwithout one — verify that 301 fires.
"AADSTS70001: Application not found in tenant"
The signed-in account is in a different Microsoft tenant. Use a
*.brandtech.plus / *.oliver.agency work account, not personal MS.
Manual rollback
cd /opt/hm-aiqc
./deploy/rollback.sh last # back to the checkpoint
# or
./deploy/rollback.sh <sha> # back to a specific commit
Database backup before risky operations
cd /opt/hm-aiqc
mkdir -p database/backups
sqlite3 database/qc_platform.db ".backup database/backups/qc_platform_$(date +%F_%H%M).db"
11 — Post-cutover (within 24 hours)
- Tell the team the new Prod URL:
https://optical-prod.oliver.solutions/hm-aiqc/ - Don't decommission the sandbox yet. Leave
https://ai-sandbox.oliver.solutions/hm-ai-qc-report/running onoptical-web-1as a fallback. - Set up daily DB backups (cron):
bash crontab -e # Add: 0 2 * * * cd /opt/hm-aiqc && sqlite3 database/qc_platform.db ".backup database/backups/qc_platform_$(date +\%F).db" && find database/backups -name "qc_platform_*.db" -mtime +30 -delete - Add a weekly disk-usage warning (cron):
bash 0 9 * * 1 du -sh /opt/hm-aiqc/storage 2>/dev/null | awk '$1 ~ /[0-9]+G/ && $1+0 > 10 {print "WARN: storage "$1" — consider archiving"}' | mail -s "[hm-aiqc] storage size" you@oliver.agency
12 — Sandbox decommission (after ~1 week of Prod soak)
Only do this once Prod has been stable for at least a week and the team has confirmed no traffic still goes to the sandbox.
# On optical-web-1:
ssh optical-web-1.oliver.solutions
cd /opt/hm_ai_qc/hm_ai_qc_report_tool
# 1. Archive the database and storage to Box first
tar czf /tmp/hm_ai_qc_sandbox_archive_$(date +%F).tar.gz database storage
# Upload /tmp/hm_ai_qc_sandbox_archive_*.tar.gz to Box manually
# → /CAMPAIGNS/_archive/hm_ai_qc_sandbox/ (or wherever fits)
# 2. Stop and remove the container
docker compose down
# 3. Coordinate with whoever owns the sandbox vhost to remove the
# /hm-ai-qc-report Location block from
# optical-web-1's Apache config
Once the Apache block is removed and Apache reloaded, the sandbox URL returns 404 — decommission complete.
Done
Prod is live, sandbox archived, Phase 4 complete.