# Prod cutover runbook — `optical-prod.oliver.solutions` Self-contained "SSH in and paste these" instructions for deploying HM AI QC to Prod. Phase 4 of `deploy/DEV_PROD_MIGRATION_PLAN.md`. **Estimated time:** 30 minutes if everything works first try (the Dev gotchas we already hit are noted inline so this run should be smoother). **Pre-condition:** Dev (`https://optical-dev.oliver.solutions/hm-aiqc/`) has been green and validated end-to-end. All commits going to Prod are already on `develop`. --- ## 0 — Prereqs | Check | How | |---|---| | Entra redirect URIs | `https://optical-prod.oliver.solutions/hm-aiqc/` is registered as an SPA URI on app `9079054c-9620-4757-a256-23413042f1ef` (confirmed 2026-05-09). | | Dev is healthy | `curl -i https://optical-dev.oliver.solutions/hm-aiqc/health` returns 200 with `{"db":true,"status":"ok"}`. | | You have SSH access to `optical-prod.oliver.solutions` | `ssh optical-prod.oliver.solutions` lands you in. | | You have permission to talk to Apache | `sudo systemctl status apache2` works. | If anything's missing, stop and resolve before proceeding. --- ## 1 — From your laptop: merge to main and tag v3.0.0 ```bash cd /Users/nickviljoen/Desktop/HM_QC_Bitbucket/hm_ai_qc_report_tool git fetch origin git checkout main git pull --ff-only origin main git merge --no-ff origin/develop -m "Merge develop into main for v3.0.0 release" git push origin main git tag -a v3.0.0 -m "v3.0.0 — Azure AD SSO, Dev/Prod migration" git push origin v3.0.0 ``` Verify the tag landed: ```bash git ls-remote --tags origin v3.0.0 # Should print one line: refs/tags/v3.0.0 ``` --- ## 2 — SSH to Prod and prepare the deploy key Bitbucket Access Keys are scoped per-repo, so the Dev server's key won't authorize Prod. Generate a fresh key for Prod, register it on the repo, and add an SSH config alias. ```bash ssh optical-prod.oliver.solutions # 1. Generate a fresh ed25519 key for this repo on this server ssh-keygen -t ed25519 -f ~/.ssh/bitbucket_hm_aiqc \ -C "hm_aiqc deploy key (optical-prod)" -N "" # 2. Show the public key — copy the whole line that starts with "ssh-ed25519 ..." cat ~/.ssh/bitbucket_hm_aiqc.pub # 3. Append the SSH alias (paste-safe one-liner) printf '\nHost bitbucket-hm-aiqc\n HostName bitbucket.org\n User git\n IdentityFile ~/.ssh/bitbucket_hm_aiqc\n IdentitiesOnly yes\n' >> ~/.ssh/config ``` In your browser: 1. Go to **https://bitbucket.org/zlalani/hm_ai_qc_report_tool/admin/access-keys/** 2. **Add key** → Label: `optical-prod (read-only)` → paste the public key → Save 3. Leave it as read-only (the deploy script never pushes from the server). Verify on the prod server: ```bash ssh -T git@bitbucket-hm-aiqc # Expected: a line saying you're logged in (or 'authenticated via ssh key' message) ``` --- ## 3 — Clone the repo at /opt/hm-aiqc ```bash sudo mkdir -p /opt/hm-aiqc sudo chown $(whoami):$(whoami) /opt/hm-aiqc # IMPORTANT: trailing dot — clone INTO /opt/hm-aiqc, not into a subdir git clone git@bitbucket-hm-aiqc:zlalani/hm_ai_qc_report_tool.git /opt/hm-aiqc cd /opt/hm-aiqc git checkout v3.0.0 git log -1 --oneline # should show the v3.0.0 tag commit ``` --- ## 4 — Docker permissions (one-time) If `docker ps` fails with "permission denied", add yourself to the `docker` group. This was needed on Dev too — likely needed here. ```bash docker ps # check if it works first sudo usermod -aG docker $USER exec newgrp docker # activate group in the current shell docker ps # confirm — should now work ``` --- ## 5 — Create the .env file ```bash cd /opt/hm-aiqc cp deploy/.env.prod.example .env # Generate a real Flask SECRET_KEY (DIFFERENT from the Dev key) python3 -c 'import secrets; print(secrets.token_urlsafe(48))' # Paste that value into SECRET_KEY= # Edit and fill in the LLM API keys $EDITOR .env ``` Required values: - `SECRET_KEY` — fresh, from the python3 one-liner above (do NOT reuse Dev's) - `OPENAI_API_KEY` - `GOOGLE_API_KEY` - `ANTHROPIC_API_KEY` - `BOX_CAMPAIGNS_FOLDER_ID=156182880490` — already correct in the example, but double-check after copying - `ENVIRONMENT=production` — already correct in the example Confirm Azure config: ```bash grep -E "AZURE_|ENVIRONMENT|BOX_" .env # AZURE_TENANT_ID=e519c2e6-bc6d-4fdf-8d9c-923c2f002385 # AZURE_CLIENT_ID=9079054c-9620-4757-a256-23413042f1ef # ENVIRONMENT=production # BOX_REPORT_FOLDER_ID=133295752718 # BOX_CAMPAIGNS_FOLDER_ID=156182880490 ``` Source the LLM keys from one of: - The Dev server (`/opt/hm-aiqc/.env` on `optical-dev`) if Dev and Prod share keys (current plan). - The sandbox (`optical-web-1:/opt/hm_ai_qc/hm_ai_qc_report_tool/.env`) as a backup source. --- ## 6 — Drop in the Box service-account config ```bash mkdir -p config # Easiest: scp from your laptop. Replace with where you have it. scp /box_config.json optical-prod.oliver.solutions:/opt/hm-aiqc/config/box_config.json # On the prod server, lock it down: chmod 600 /opt/hm-aiqc/config/box_config.json ``` Or copy from another server if scp-ing directly between servers is allowed: ```bash scp optical-dev.oliver.solutions:/opt/hm-aiqc/config/box_config.json \ optical-prod.oliver.solutions:/opt/hm-aiqc/config/box_config.json chmod 600 /opt/hm-aiqc/config/box_config.json ``` --- ## 7 — Run the deploy ```bash cd /opt/hm-aiqc ./deploy.sh prod v3.0.0 --dry-run # preview ./deploy.sh prod v3.0.0 # actual deploy, prompts y/N to confirm ``` The script will: 1. Save current HEAD to `.last_deploy_rollback` 2. Auto-detect "no running container" and proceed (no `--force` needed) 3. `docker compose build` (first run pulls `python:3.11-slim` — slow, ~2–5 min) 4. `docker compose up -d` — entrypoint runs `wait_for_db.py` → `flask db upgrade` → gunicorn 5. Poll `http://127.0.0.1:5050/health` for up to 60s 6. If healthy → done. If not → auto-rollback (no-op on first deploy since there's no previous state, exits non-zero). Watch the logs in another terminal if you want to see boot: ```bash docker compose logs -f web ``` When done, sanity-check from the server itself: ```bash curl -sf http://127.0.0.1:5050/health # Expected: {"db":true,"status":"ok"} ``` --- ## 8 — Place the Apache snippet and reload Find the prod vhost: ```bash grep -rln "optical-prod.oliver.solutions" /etc/apache2/sites-available/ /etc/apache2/sites-enabled/ ``` Edit the file (likely `/etc/apache2/sites-enabled/optical-prod.oliver.solutions.conf`) and add a single Include line inside the `` block, next to the existing per-app Includes (the Dev vhost has the same pattern — match it): ```apache Include /opt/hm-aiqc/deploy/apache-hm-aiqc.conf ``` Verify and reload: ```bash sudo apache2ctl configtest # should print "Syntax OK" sudo systemctl reload apache2 ``` If `configtest` complains about missing modules, enable them once: ```bash sudo a2enmod proxy proxy_http headers rewrite sudo systemctl reload apache2 ``` --- ## 9 — Smoke test the public URL From your **laptop**: ```bash curl -i https://optical-prod.oliver.solutions/hm-aiqc/health # Expected: HTTP/1.1 200 OK + {"status":"ok","db":true} ``` If `/health` returns 200, open in a browser: ``` https://optical-prod.oliver.solutions/hm-aiqc/ ``` Expected flow: 1. Redirect to `/auth/login-page` 2. **Sign in with Microsoft** popup → complete with your work account 3. Land on the Reporting page with your name in the top-right Run the module-by-module checklist (same as Dev): | Module | Quick test | |---|---| | Reporting | Tab loads, Box folder hint visible, search a known job number | | HM QC | Upload one image, run, confirm score | | Video QC | Upload one short MP4, run, confirm score | | Video Master | Search a known campaign title, confirm preview | | Campaigns | List loads (will be empty — fresh DB) | | Usage | **Your email appears in the user filter** for the runs above (proof Phase 1's `g.user.email` threading works on Prod) | --- ## 10 — If something goes wrong ### deploy.sh reports rollback The script already restored the previous code state (or, on first deploy, has nothing to roll back to and just exited non-zero). Look at the container logs: ```bash docker compose logs --tail=200 web ``` Most likely causes — these are the same as Dev: - Missing env var → fix `.env` and re-run `./deploy.sh prod v3.0.0 --force` - Migration error → check `flask db upgrade` output in logs - Box config missing → confirm `config/box_config.json` exists ### `/health` returns 503 (db: false) Container is up but can't reach SQLite. Check the volume mount: ```bash docker compose exec web ls -la /app/database docker compose exec web sh -c 'touch /app/database/_test_write && rm /app/database/_test_write' ``` ### MSAL popup → AADSTS50011 redirect URI mismatch Confirm the URL in the address bar exactly matches the Entra-registered SPA URI: - Must be `https://optical-prod.oliver.solutions/hm-aiqc/` (trailing slash) - The `apache-hm-aiqc.conf` `RewriteRule` adds the trailing slash if you visit `/hm-aiqc` without one — verify that 301 fires. ### "AADSTS70001: Application not found in tenant" The signed-in account is in a different Microsoft tenant. Use a `*.brandtech.plus` / `*.oliver.agency` work account, not personal MS. ### Manual rollback ```bash cd /opt/hm-aiqc ./deploy/rollback.sh last # back to the checkpoint # or ./deploy/rollback.sh # back to a specific commit ``` ### Database backup before risky operations ```bash cd /opt/hm-aiqc mkdir -p database/backups sqlite3 database/qc_platform.db ".backup database/backups/qc_platform_$(date +%F_%H%M).db" ``` --- ## 11 — Post-cutover (within 24 hours) - [ ] Tell the team the new Prod URL: `https://optical-prod.oliver.solutions/hm-aiqc/` - [ ] **Don't decommission the sandbox yet.** Leave `https://ai-sandbox.oliver.solutions/hm-ai-qc-report/` running on `optical-web-1` as a fallback. - [ ] Set up daily DB backups (cron): ```bash crontab -e # Add: 0 2 * * * cd /opt/hm-aiqc && sqlite3 database/qc_platform.db ".backup database/backups/qc_platform_$(date +\%F).db" && find database/backups -name "qc_platform_*.db" -mtime +30 -delete ``` - [ ] Add a weekly disk-usage warning (cron): ```bash 0 9 * * 1 du -sh /opt/hm-aiqc/storage 2>/dev/null | awk '$1 ~ /[0-9]+G/ && $1+0 > 10 {print "WARN: storage "$1" — consider archiving"}' | mail -s "[hm-aiqc] storage size" you@oliver.agency ``` ## 12 — Sandbox decommission (after ~1 week of Prod soak) Only do this once Prod has been stable for at least a week and the team has confirmed no traffic still goes to the sandbox. ```bash # On optical-web-1: ssh optical-web-1.oliver.solutions cd /opt/hm_ai_qc/hm_ai_qc_report_tool # 1. Archive the database and storage to Box first tar czf /tmp/hm_ai_qc_sandbox_archive_$(date +%F).tar.gz database storage # Upload /tmp/hm_ai_qc_sandbox_archive_*.tar.gz to Box manually # → /CAMPAIGNS/_archive/hm_ai_qc_sandbox/ (or wherever fits) # 2. Stop and remove the container docker compose down # 3. Coordinate with whoever owns the sandbox vhost to remove the # /hm-ai-qc-report Location block from # optical-web-1's Apache config ``` Once the Apache block is removed and Apache reloaded, the sandbox URL returns 404 — decommission complete. --- ## Done Prod is live, sandbox archived, Phase 4 complete.