Phase 3 prep: add Dev cutover runbook
Self-contained SSH-and-paste guide covering clone → .env → deploy → Apache reload → smoke test. Includes troubleshooting for the most likely failure modes (MSAL redirect mismatch, missing env vars, /health 503).
This commit is contained in:
parent
e772095158
commit
458c75311e
1 changed files with 236 additions and 0 deletions
236
deploy/DEV_CUTOVER_RUNBOOK.md
Normal file
236
deploy/DEV_CUTOVER_RUNBOOK.md
Normal file
|
|
@ -0,0 +1,236 @@
|
|||
# Dev cutover runbook — `optical-dev.oliver.solutions`
|
||||
|
||||
Self-contained "SSH in and paste these" instructions for the first
|
||||
deploy of HM AI QC to the new Dev server. Phase 3 of
|
||||
`deploy/DEV_PROD_MIGRATION_PLAN.md`.
|
||||
|
||||
**Estimated time:** 20 minutes if everything works first try.
|
||||
|
||||
---
|
||||
|
||||
## 0 — Prereqs (verify before starting)
|
||||
|
||||
| Check | How |
|
||||
|---|---|
|
||||
| Entra redirect URIs added | Confirmed 2026-05-09 — `https://optical-dev.oliver.solutions/hm-aiqc/` and `https://optical-prod.oliver.solutions/hm-aiqc/` are registered as SPA URIs on app `9079054c-9620-4757-a256-23413042f1ef`. |
|
||||
| `develop` branch pushed | `git ls-remote origin develop` from your laptop shows commit `e772095` (Phase 2). |
|
||||
| You have SSH access to `optical-dev.oliver.solutions` | `ssh optical-dev.oliver.solutions` lands you in. |
|
||||
| You have docker permissions on the server | `docker ps` works without sudo. |
|
||||
| You have sudo for Apache reload | `sudo systemctl status apache2` works. |
|
||||
|
||||
If anything's missing, stop and resolve before proceeding.
|
||||
|
||||
---
|
||||
|
||||
## 1 — SSH in and clone the repo
|
||||
|
||||
```bash
|
||||
ssh optical-dev.oliver.solutions
|
||||
|
||||
sudo mkdir -p /opt/hm-aiqc
|
||||
sudo chown $(whoami):$(whoami) /opt/hm-aiqc
|
||||
|
||||
git clone git@bitbucket.org:zlalani/hm_ai_qc_report_tool.git /opt/hm-aiqc
|
||||
cd /opt/hm-aiqc
|
||||
git checkout develop
|
||||
git log -1 --oneline # should print: e772095 Phase 2: deploy machinery for Dev/Prod cutover
|
||||
```
|
||||
|
||||
If `git clone` fails on auth, you'll need a deploy key on the server first
|
||||
(an SSH key on `optical-dev` whose public half is added to Bitbucket as a
|
||||
read-only deploy key for this repo). Same key approach as AI QC.
|
||||
|
||||
---
|
||||
|
||||
## 2 — Create the `.env` file
|
||||
|
||||
```bash
|
||||
cp deploy/.env.dev.example .env
|
||||
|
||||
# Generate a real Flask SECRET_KEY
|
||||
python3 -c 'import secrets; print(secrets.token_urlsafe(48))'
|
||||
# Paste that value into SECRET_KEY=
|
||||
|
||||
# Fill in real LLM API keys
|
||||
$EDITOR .env
|
||||
```
|
||||
|
||||
Required keys to fill in (the `.env.dev.example` placeholders):
|
||||
- `SECRET_KEY` — from the python3 one-liner above
|
||||
- `OPENAI_API_KEY` — copy from sandbox `optical-web-1:/opt/hm_ai_qc/hm_ai_qc_report_tool/.env`
|
||||
- `GOOGLE_API_KEY` — same source
|
||||
- `ANTHROPIC_API_KEY` — same source
|
||||
|
||||
Confirm tenant/client IDs match the SSO plan:
|
||||
```bash
|
||||
grep "AZURE_" .env
|
||||
# AZURE_TENANT_ID=e519c2e6-bc6d-4fdf-8d9c-923c2f002385
|
||||
# AZURE_CLIENT_ID=9079054c-9620-4757-a256-23413042f1ef
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3 — Drop in the Box config
|
||||
|
||||
The Box service-account JSON lives outside the repo (gitignored):
|
||||
|
||||
```bash
|
||||
mkdir -p config
|
||||
scp optical-web-1:/opt/hm_ai_qc/hm_ai_qc_report_tool/config/box_config.json \
|
||||
optical-dev.oliver.solutions:/opt/hm-aiqc/config/box_config.json
|
||||
|
||||
# On the dev server, lock it down:
|
||||
chmod 600 /opt/hm-aiqc/config/box_config.json
|
||||
```
|
||||
|
||||
Or if you can't ssh-copy directly, copy it to your laptop with `scp` and re-upload.
|
||||
|
||||
---
|
||||
|
||||
## 4 — Run the deploy
|
||||
|
||||
```bash
|
||||
cd /opt/hm-aiqc
|
||||
./deploy.sh dev --dry-run # preview — no changes
|
||||
./deploy.sh dev # actual deploy, prompts y/N to confirm
|
||||
```
|
||||
|
||||
The deploy script will:
|
||||
1. Save current HEAD to `.last_deploy_rollback` (empty on first run, that's fine — it'll save the initial clone HEAD)
|
||||
2. `git fetch` + `git reset --hard origin/develop`
|
||||
3. `docker compose build` (first run pulls the python:3.11-slim base image — slow, ~2–5 min)
|
||||
4. `docker compose up -d` (entrypoint runs `wait_for_db.py` → `flask db upgrade` → gunicorn)
|
||||
5. Poll `http://127.0.0.1:5050/health` every 2s for up to 60s
|
||||
6. If `/health` returns 200 with `{"status":"ok","db":true}` → done
|
||||
7. If not → auto-rollback to the saved checkpoint and exit non-zero
|
||||
|
||||
Watch the logs in another terminal if you want to see the boot:
|
||||
```bash
|
||||
docker compose logs -f web
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5 — Place the Apache config and reload
|
||||
|
||||
Find the existing `optical-dev.oliver.solutions` virtual host:
|
||||
```bash
|
||||
grep -rn "optical-dev.oliver.solutions" /etc/apache2/sites-available/
|
||||
```
|
||||
|
||||
Open that file (likely `/etc/apache2/sites-available/optical-dev.conf` or similar)
|
||||
and paste the contents of `/opt/hm-aiqc/deploy/apache-dev.conf` inside the
|
||||
`<VirtualHost *:443>` block (the HTTPS one). Save.
|
||||
|
||||
Verify and reload:
|
||||
```bash
|
||||
sudo apache2ctl configtest # should print "Syntax OK"
|
||||
sudo systemctl reload apache2
|
||||
```
|
||||
|
||||
If `configtest` complains about missing modules:
|
||||
```bash
|
||||
sudo a2enmod proxy proxy_http headers rewrite
|
||||
sudo systemctl reload apache2
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6 — Smoke test the public URL
|
||||
|
||||
From your **laptop**:
|
||||
|
||||
```bash
|
||||
curl -i https://optical-dev.oliver.solutions/hm-aiqc/health
|
||||
# Expected: HTTP/1.1 200 OK + {"status":"ok","db":true}
|
||||
```
|
||||
|
||||
If `/health` returns 200, open in a browser:
|
||||
|
||||
```
|
||||
https://optical-dev.oliver.solutions/hm-aiqc/
|
||||
```
|
||||
|
||||
You should be redirected to `/auth/login-page`. Click **Sign in with Microsoft**,
|
||||
complete the popup with your `*.brandtech.plus` or `*.oliver.agency` work account.
|
||||
|
||||
After login, run through each module to confirm:
|
||||
|
||||
| Module | Quick test |
|
||||
|---|---|
|
||||
| Reporting | Tab loads, "Previous Box Reports" populates |
|
||||
| HM QC | Upload one image, run, confirm score |
|
||||
| Video QC | Upload one short MP4, run, confirm score |
|
||||
| Video Master | Enter a known campaign name, confirm matches preview |
|
||||
| Campaigns | List loads (will be empty on fresh start) |
|
||||
| Usage | Tab loads, your email appears in the user filter |
|
||||
|
||||
The Usage tab is the last item because it's the proof that `g.user.email`
|
||||
flows into `usage_logs.user` correctly — the whole point of Phase 1.
|
||||
|
||||
---
|
||||
|
||||
## 7 — If something goes wrong
|
||||
|
||||
### Deploy script reports rollback
|
||||
|
||||
The script already restored the previous code state. Look at the container logs:
|
||||
```bash
|
||||
docker compose logs --tail=200 web
|
||||
```
|
||||
|
||||
Most likely causes:
|
||||
- Missing env var → fix `.env` and re-run `./deploy.sh dev`
|
||||
- Migration error → check `flask db upgrade` output in logs; restore DB if needed (see "Database backup" below)
|
||||
- Box config missing → confirm `config/box_config.json` exists
|
||||
|
||||
### `/health` returns 503 (db: false)
|
||||
|
||||
Container is up but can't reach SQLite. Check that `./database` volume is
|
||||
mounted and writable:
|
||||
```bash
|
||||
docker compose exec web ls -la /app/database
|
||||
docker compose exec web touch /app/database/_test_write && docker compose exec web rm /app/database/_test_write
|
||||
```
|
||||
|
||||
### MSAL popup spins forever / "AADSTS50011" redirect URI mismatch
|
||||
|
||||
Confirm the URL in the address bar exactly matches the Entra-registered SPA URI:
|
||||
- Must be `https://optical-dev.oliver.solutions/hm-aiqc/` (trailing slash)
|
||||
- Apache `RewriteRule` in `apache-dev.conf` adds the trailing slash if you visit
|
||||
`/hm-aiqc` without one — verify that 301 fires.
|
||||
|
||||
### "AADSTS70001: Application not found in tenant"
|
||||
|
||||
The user is signed into a *different* Microsoft tenant. They must use a
|
||||
`*.brandtech.plus` / `*.oliver.agency` work account, not a personal Microsoft
|
||||
account.
|
||||
|
||||
### Manual rollback
|
||||
|
||||
```bash
|
||||
cd /opt/hm-aiqc
|
||||
./deploy/rollback.sh last # back to the checkpoint
|
||||
# or
|
||||
./deploy/rollback.sh <sha> # back to a specific commit
|
||||
```
|
||||
|
||||
### Database backup (before risky migrations)
|
||||
|
||||
```bash
|
||||
cd /opt/hm-aiqc
|
||||
sqlite3 database/qc_platform.db ".backup database/backups/qc_platform_$(date +%F_%H%M).db"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8 — Hand-off
|
||||
|
||||
Once Dev is green and the smoke tests pass:
|
||||
|
||||
1. Tell the team the URL: `https://optical-dev.oliver.solutions/hm-aiqc/`
|
||||
2. **Don't** decommission the sandbox yet (`https://ai-sandbox.oliver.solutions/hm-ai-qc-report/`)
|
||||
— leave it running until Prod is also live + soaked for ~1 week.
|
||||
3. When ready to ship Prod: tag `v3.0.0` on `main`, then repeat this runbook
|
||||
on `optical-prod.oliver.solutions` with `./deploy.sh prod v3.0.0` and
|
||||
`deploy/apache-prod.conf`. Same steps, different host.
|
||||
Loading…
Add table
Reference in a new issue