Rewrite README to reflect current architecture

Old README still described the nightly scheduler container, didn't
cover backfill / thumbnails / admin gating / multi-token search /
the API endpoints, and pointed at fields that no longer exist on
events. Comprehensive rewrite covering: what the app does today,
architecture diagram, repo layout, local quickstart, full env-var
reference, operations (run/backfill/inspect), API surface, MSAL
setup steps, deploy script + manual vhost Include, the two-table
schema, troubleshooting, and the legacy systemd path preserved at
the end for reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
DJP 2026-05-11 17:30:10 -04:00
parent 04440d661d
commit 88a0adcfbf

612
README.md
View file

@ -1,189 +1,364 @@
# Marriott Box Asset Tagger
Batch-processes images **and videos** in a Box folder, analyzes them with Gemini AI, and writes structured metadata back to Box using the `marriottUsa` metadata template. Videos use Box's 480p MP4 proxy representations to keep bandwidth and Gemini token usage manageable.
AI-driven metadata tagging for images and videos stored in a Marriott Box folder, with a searchable Postgres audit log and a React SPA on top. Gemini analyses each asset against the `marriottUsa` metadata template; the resulting structured metadata, description, and (for videos) scene breakdown are written back to Box. Every Gemini call is also persisted to a local Postgres so it can be searched, audited, and re-displayed without round-tripping Box.
Every Gemini-analysed file is also written to a Postgres `tagging_events` table for search/audit, and there's a small React SPA on top (search across all logged fields, click through to the original Box file, trigger an on-demand tagging pass).
---
## Components
## What you can do
- **`scheduler.py` (tagger container)** — APScheduler that fires `main.main()` on a cron (default daily 02:00).
- **`api.py` (api container, FastAPI)** — search, list runs, kick off ad-hoc runs in a background thread.
- **`db.py` + `schema.sql`** — Postgres logging layer.
- **`frontend/` (Vite + React + TS)** — single-page UI, served by Apache from `/var/www/html/marriott-tagging/` in prod. Auth via `@azure/msal-react` with a `VITE_DEV_AUTH_BYPASS` switch.
- **Trigger a tagging pass** from the SPA's **Run now** button — admin-only. Walks the configured Box folder, skips files already in the local DB, sends new ones to Gemini, validates against the live Box template schema, writes metadata + description (and scene-breakdown comments for videos) back to Box, and inserts a `tagging_events` row per file Gemini saw.
- **Backfill from Box** — admin-only. Walks the Box folder and mirrors any existing `marriottUsa` metadata into the local DB (status = `backfilled`). No Gemini calls, no Box writes. Use this after first deploy, after restoring a wiped DB, or to refresh thumbnails. Re-runnable safely.
- **Search** the request log across every text + JSON field (file name, folder path, description, validated metadata, raw Gemini response, scene breakdown, status, file ID, and the consolidated `search_terms` blob). Multi-word queries are AND'd across tokens; each token also fuzzy-matches via `pg_trgm` similarity so `femalle` still finds "female".
- **See thumbnails** inline in the search results — Box's pre-generated 160×160 JPG for each file is cached in Postgres (`file_assets.thumbnail_bytes`).
- **Click through to Box** on every result — the `box_url` is synthesised per row.
- **Azure AD SSO** for sign-in, with a `DEV_AUTH_BYPASS` switch for local dev and `ADMIN_EMAILS` allowlist gating the destructive endpoints.
## Setup
The cron-driven nightly scheduler that used to fire passes automatically has been **removed**. The tool is manual-only: a human clicks Run now (or POSTs `/api/runs`). This keeps Box and Gemini API costs predictable as the folder grows. `scheduler.py` remains in the repo if you want to wire cron back in.
### 1. Clone and create virtual environment
---
```bash
cd Marriott_Box_Asset_Tagging
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
```
### 2. Box JWT credentials
Download your Box app's JWT config from the [Box Developer Console](https://app.box.com/developers/console) and save it as `box_config.json` in the project root.
The service account must have:
- Access to folder `370595013246`
- Permission to read/write metadata using the `marriottUsa` template
### 3. Gemini API key
Add your key to `.env`:
## Architecture
```
GEMINI_API_KEY=your_key_here
Apache (shared vhost on optical-dev.oliver.solutions)
├──── /marriott-tagging/api/* ──┐
│ ▼
│ ┌──────────────────┐
│ │ api container │
│ │ (uvicorn, │
│ │ FastAPI) │
│ │ │
│ │ • /api/health │
│ │ • /api/me │
│ │ • /api/events │
│ │ • /api/runs │ ──┐
│ │ • /api/backfill │ ──┤ background
│ │ • /api/files/ │ │ thread runs
│ │ {id}/thumb │ │ main._run_pass
│ └──────────────────┘ │ / _run_backfill
│ │ │ which call →
└──── /marriott-tagging/* ◀────┘ │
(static SPA from ▼
/var/www/html/ ┌──────────────────┐
marriott-tagging/) │ Box API │
│ Gemini API │
└──────────────────┘
┌──────────────────┐
│ db container │
│ (Postgres 16, │
│ bound to │
│ 127.0.0.1) │
│ │
│ • tagging_events │
│ • file_assets │
└──────────────────┘
```
Get a key at [Google AI Studio](https://aistudio.google.com/apikey).
**Containers:** `db` + `api`. They share a Docker network and a named volume (`marriott-tagging_pgdata`).
**Outside the container set:** Apache (host), built SPA at `/var/www/html/marriott-tagging/`, the shared vhost include that proxies `/marriott-tagging/api/` to the api container.
## Usage
---
```bash
source env/bin/activate
python main.py
```
## Repo layout
The script will:
1. Authenticate with Box and Gemini
2. Fetch the `marriottUsa` template schema (fields, types, allowed values)
3. Build a dynamic Gemini prompt from the schema
4. Recursively list all image and video files in the target folder
5. For each image: download, resize, analyze with Gemini, validate metadata, write metadata + description to Box
6. For each video: fetch the 480p MP4 proxy from Box, analyze with Gemini, write metadata + description + a scene-breakdown comment to Box
7. Print a summary of results
| Path | Purpose |
|---|---|
| `main.py` | The tagging pipeline — Box client, Gemini calls, validation, Box writes, Postgres logging, thumbnail fetch. `_run_pass(...)` for normal passes; `_run_backfill(...)` for the Box → DB mirror. |
| `api.py` | FastAPI app — search, run-trigger, backfill-trigger, thumbnail-serve. Background threads do the actual tagging/backfill work so the request returns immediately. |
| `auth.py` | Azure AD JWT validation against the tenant JWKS + the `DEV_AUTH_BYPASS` short-circuit. Exposes `require_auth` and `require_admin` FastAPI dependencies. |
| `db.py` | psycopg3 helpers — `get_conn`, `ensure_schema`, `log_event`, `upsert_file_asset`, `get_thumbnail`, `is_file_already_tagged`. Defensive — DB errors never crash a tagging pass. |
| `schema.sql` | `tagging_events`, `file_assets`, indexes, `pg_trgm` extension. Applied idempotently on api startup via the FastAPI lifespan handler. |
| `scheduler.py` | APScheduler entry point — kept for archival / opt-back-in. Not currently used; the compose file no longer wires up a `tagger` service. |
| `frontend/` | Vite + React + TS SPA. `src/App.tsx` is the main page; `src/auth.tsx` does MSAL with the bypass switch; `src/api.ts` is the client. |
| `Dockerfile` | python:3.12-slim, non-root `appuser`. Same image runs the api container (and could run the scheduler if reactivated). |
| `docker-compose.yml` | `name: marriott-tagging` pinned. db (postgres:16) + api (built from Dockerfile). All host ports bound to `127.0.0.1`. |
| `deploy/deploy.sh` | Idempotent server deploy: port auto-pick, git pull, rebuild, SPA build via one-shot node:20-alpine, rsync to `/var/www/html/`, `/api/health` poll. |
| `deploy/apache-marriott-tagging.conf.tmpl` | Apache vhost include — proxy `/marriott-tagging/api/`, alias `/marriott-tagging` to the SPA web-root, SPA fallback. `__API_PORT__` rendered by `deploy.sh`. |
| `marriott-tagger.service` / `.timer` | Legacy systemd path. Not used in Docker mode. |
## Run with Docker
---
Brings up Postgres, the scheduler (`tagger`), and the FastAPI backend (`api`). The frontend is built separately by `deploy/deploy.sh` (or `npm run dev` locally) and consumed by the API.
## Quick start — local dev (macOS / Linux)
### 1. Prereqs
- Docker Desktop (or Docker Engine + Compose v2)
- `box_config.json` in the project root
- A `.env` copied from `.env.example`, filled in
- Docker Desktop or Docker Engine with Compose v2
- Node 20+ (for `npm run dev`)
- `box_config.json` in the repo root (JWT config from the Box Developer Console)
- `.env` from `.env.example`
```bash
cp .env.example .env
# minimum: GEMINI_API_KEY, POSTGRES_PASSWORD
# leave DEV_AUTH_BYPASS=true for now if you don't have Azure IDs ready
# At minimum: set GEMINI_API_KEY and POSTGRES_PASSWORD
$EDITOR .env
```
### 2. Build and start the backend services
### 2. Bring up Postgres + API
```bash
docker compose up --build -d
```
This brings up three services:
- `db` — Postgres 16, named volume `pgdata`, port bound to `127.0.0.1:${POSTGRES_HOST_PORT:-5432}`.
- `tagger` — runs `scheduler.py` (cron-driven Gemini passes).
- `api` — runs `uvicorn api:app` on container port 8000, published to `127.0.0.1:${MARRIOTT_API_PORT:-8004}`.
This starts:
- `db` — Postgres 16, named volume `pgdata`, host port `127.0.0.1:${POSTGRES_HOST_PORT:-5432}`.
- `api``uvicorn api:app`, host port `127.0.0.1:${MARRIOTT_API_PORT:-8004}`.
### 3. Run the frontend (local dev)
Check health:
```bash
curl -s http://127.0.0.1:8004/api/health | jq
```
### 3. Run the SPA
```bash
cd frontend
npm install
npm run dev # http://localhost:5173
npm run dev # http://localhost:5173
```
Vite proxies `/api/*` to the FastAPI host port (default `8004`). With `VITE_DEV_AUTH_BYPASS=true` you'll be auto-signed-in as the dev user.
Vite proxies `/api/*` to `127.0.0.1:${MARRIOTT_API_PORT:-8004}`. With the default `VITE_DEV_AUTH_BYPASS=true` you're auto-signed-in as the dev user.
### 4. Fire a tagging pass
### 4. Try a backfill
Three ways:
- **UI** — open the SPA and click **Run now**. Polls live until done.
- **API**`curl -X POST http://127.0.0.1:8004/api/runs` (DEV_AUTH_BYPASS=true) or with a Bearer token in prod.
- **Container**`docker compose exec tagger python main.py` (bypasses the API entirely).
In the SPA click **Backfill from Box**. The active panel polls every 2 s and shows each file as it's processed. Thumbnails appear inline as rows land.
### 5. Inspect the DB
---
## Configuration reference
All variables live in `.env` (gitignored). `.env.example` has the full list with comments.
### Required to start
| Variable | Purpose |
|---|---|
| `GEMINI_API_KEY` | Google AI Studio key for Gemini calls. |
| `POSTGRES_USER` / `POSTGRES_PASSWORD` / `POSTGRES_DB` | DB creds. The compose file uses these to create the role + database. |
### Ports (auto-managed by `deploy.sh` on the server)
| Variable | Default | Range scanned by deploy.sh |
|---|---|---|
| `POSTGRES_HOST_PORT` | `5432` | 5435 5499 |
| `MARRIOTT_API_PORT` | `8004` | 8003 8099 |
Both bound to `127.0.0.1` only — Postgres and the FastAPI process are never on the public internet. Apache reverse-proxies to `MARRIOTT_API_PORT`.
### Auth
| Variable | Purpose |
|---|---|
| `DEV_AUTH_BYPASS` | `true` skips MSAL entirely; the api treats every caller as `DEV_AUTH_EMAIL`. Defaults to `true` to keep dev/first-deploy unblocked. |
| `DEV_AUTH_EMAIL` / `DEV_AUTH_NAME` | Identity stamped on requests when bypassed. |
| `DEV_AUTH_IS_ADMIN` | `true` (default) keeps the bypass user as admin; flip to `false` to preview the read-only UX. |
| `AZURE_TENANT_ID` / `AZURE_CLIENT_ID` | Your Azure AD app registration. Backend uses them to validate JWTs (JWKS fetch + `aud`/`iss` check). |
| `ADMIN_EMAILS` | Comma-separated allowlist that gates `POST /api/runs` and `POST /api/backfill`. Case-insensitive. Members see the destructive buttons in the SPA; everyone else gets read-only search. |
| `VITE_DEV_AUTH_BYPASS` / `VITE_AZURE_TENANT_ID` / `VITE_AZURE_CLIENT_ID` | Frontend mirrors. Baked into the SPA bundle at build time — changing them requires a re-build (`deploy.sh` handles this). |
| `VITE_PUBLIC_BASE` | Used by Vite for the SPA's `base` (asset prefix) AND by MSAL as the redirect-URI root. In local dev: `http://localhost:5173`. On the server, `deploy.sh` overrides with the prod URL automatically. |
### Behavioural
| Variable | Purpose |
|---|---|
| `CORS_ORIGINS` | Comma-separated. Only set in local dev when Vite is on `:5173` and FastAPI on host `:8004`. Empty in prod (Apache makes them same-origin). |
| `TZ` | Container timezone. Defaults to `UTC`. |
| `SCHEDULE_CRON` / `RUN_AT_STARTUP` | Read by `scheduler.py` only. Unused by default (no scheduler container in compose). |
### Pipeline tuning (`main.py` constants)
Not in `.env` — edited at the top of `main.py`:
| Setting | Default | Description |
|---|---|---|
| `BOX_FOLDER_ID` | varies | Root Box folder to scan recursively. |
| `METADATA_TEMPLATE_KEY` | `marriottUsa` | Box metadata template key. |
| `GEMINI_MODEL` | `gemini-2.5-flash` | Model used for both image + video analysis. |
| `EXCLUDED_FOLDER_PREFIXES` | `("z_", "zz_", "zzz_")` | Subfolder names to skip. |
| `GEMINI_DELAY` / `GEMINI_VIDEO_DELAY` | `7` / `10` s | Per-call rate-limit sleep. |
| `MAX_IMAGE_SIZE` | `1000 px` | Longest side after resize before sending to Gemini. |
| `VIDEO_SIZE_LIMIT_INLINE` | `20 MB` | Below this, Gemini gets the video inline; above, the File API is used. |
| `VIDEO_SOURCE_SIZE_LIMIT` | `5 GB` | Skip videos with source file above this. |
| `VIDEO_PROXY_SIZE_LIMIT` | `400 MB` | Skip videos with 480p proxy above this. |
| `MAX_FILES_PER_RUN` | `200` | Hard cap on newly-tagged files per pass. |
| `MAX_RUN_DURATION` | `4 h` | Hard wall-clock cap per pass. |
| `DESCRIPTION_MAX_LENGTH` | `255` | Box description field char limit. |
| `SKIP_ALREADY_TAGGED` | `True` | Toggles the DB-based skip check. |
| `THUMBNAIL_DIM` | `160` | Pixel dimension for cached thumbnails. |
---
## Operations
### Trigger a tagging pass
- **From the SPA** — click **Run now**. UI polls live; events stream into the active panel.
- **From a shell** (works with `DEV_AUTH_BYPASS=true`):
```bash
curl -X POST http://127.0.0.1:8004/api/runs
```
- **From inside the api container** (bypasses the API entirely):
```bash
docker compose exec api python main.py
```
### Trigger a backfill
- **From the SPA** — click **Backfill from Box** (admin-only; confirms first).
- **From a shell**:
```bash
curl -X POST http://127.0.0.1:8004/api/backfill
```
Backfill is idempotent: re-running won't duplicate `tagging_events` rows, and `file_assets` rows are upserted (preserving previously-captured thumbnails if today's fetch fails).
### Inspect the DB
```bash
docker compose exec db psql -U marriott marriott_tagging -c '\d tagging_events'
docker compose exec db psql -U marriott marriott_tagging -c \
"SELECT status, count(*) FROM tagging_events GROUP BY status;"
docker compose exec db psql -U marriott marriott_tagging
```
### Auth: enabling Azure AD SSO
```sql
-- Row counts by status
SELECT status, count(*) FROM tagging_events GROUP BY status;
1. Register (or reuse) an Azure AD app. Redirect URI:
- Local dev: `http://localhost:5173`
- Prod: `https://optical-dev.oliver.solutions/marriott-tagging/`
2. Expose an API with scope `access_as_user` whose audience is the same client ID.
3. Fill `.env`:
-- Recent events
SELECT created_at, media_type, file_name, status
FROM tagging_events ORDER BY created_at DESC LIMIT 20;
-- Thumbnail coverage
SELECT count(*) AS total,
count(*) FILTER (WHERE thumbnail_bytes IS NOT NULL) AS with_thumb,
avg(octet_length(thumbnail_bytes))::int AS avg_bytes
FROM file_assets;
-- All events for a given run
SELECT file_name, status, error_message
FROM tagging_events
WHERE run_id = '<uuid>'
ORDER BY created_at;
```
From your laptop (via SSH tunnel — Postgres isn't on the public internet):
```bash
ssh -L 55432:127.0.0.1:5435 user@optical-dev.oliver.solutions
psql postgresql://marriott:<password>@127.0.0.1:55432/marriott_tagging
```
### Logs
```bash
docker compose logs -f api # API + background tagging/backfill threads
docker compose logs -f db
```
---
## API reference
All endpoints behind `/api`. With `DEV_AUTH_BYPASS=true` no token is needed; with SSO enabled, include `Authorization: Bearer <access_token>`.
| Method | Path | Auth | Description |
|---|---|---|---|
| GET | `/api/health` | none | Liveness + DB-reachable check + auth-config summary. |
| GET | `/api/me` | required | `{ oid, name, email, dev, is_admin }`. SPA uses `is_admin` to hide the destructive buttons. |
| GET | `/api/events?q=…&limit=…` | required | Search. Whitespace-tokenises `q`; each token must match (substring OR pg_trgm similarity > 0.3) across the searched columns. Results ranked by summed similarity. `limit` 1-500 (default 100). |
| POST | `/api/runs` | admin | Kicks off a tagging pass in a daemon thread. Returns `{ run_id, state: "running", started_by }`. |
| GET | `/api/runs?limit=…` | required | Recent runs from `tagging_events`, grouped by `run_id`, with counts and live state if still running. |
| GET | `/api/runs/{run_id}/events` | required | Per-event detail for a single run. Includes `live_state` (`running` / `completed` / `failed`) and `live_error`. |
| POST | `/api/backfill` | admin | Kicks off a backfill in a daemon thread. Same response shape as `/api/runs`. |
| GET | `/api/files/{file_id}/thumbnail` | required | Streams the cached JPG thumbnail (`Cache-Control: max-age=86400`) or 404. |
Every event in `/api/events` / `/api/runs/{id}/events` includes a synthesised `box_url` (`https://app.box.com/file/<file_id>`) and a `has_thumbnail` boolean. The frontend builds the thumbnail URL via `thumbnailUrl(file_id)` which respects the SPA's base prefix.
---
## Auth setup
### Dev / first deploy
Keep `DEV_AUTH_BYPASS=true` and `VITE_DEV_AUTH_BYPASS=true`. Every request authenticates as `DEV_AUTH_EMAIL`, and the dev user is admin by default (toggle `DEV_AUTH_IS_ADMIN=false` to test the read-only UX).
### Enabling Azure AD SSO
1. **Azure AD app registration** (reuse an existing one if you have it).
- Redirect URIs (Single-page application platform):
- Local: `http://localhost:5173`
- Prod: `https://optical-dev.oliver.solutions/marriott-tagging/`
- **Expose an API** with scope `access_as_user` whose Application ID URI is `api://<client-id>`.
2. **Backend `.env`** (the api container):
```
DEV_AUTH_BYPASS=false
AZURE_TENANT_ID=...
AZURE_CLIENT_ID=...
VITE_DEV_AUTH_BYPASS=false
VITE_AZURE_TENANT_ID=...
VITE_AZURE_CLIENT_ID=...
AZURE_TENANT_ID=<tenant-uuid>
AZURE_CLIENT_ID=<client-uuid>
ADMIN_EMAILS=alice@oliver.agency,bob@oliver.agency
```
3. **Frontend `.env`** (baked into the SPA at build time):
```
VITE_DEV_AUTH_BYPASS=false
VITE_AZURE_TENANT_ID=<tenant-uuid>
VITE_AZURE_CLIENT_ID=<client-uuid>
```
4. Rebuild + redeploy:
```bash
./deploy/deploy.sh
docker compose up -d --force-recreate api
```
4. `docker compose up -d --force-recreate api` and rebuild the SPA (`deploy.sh` does this on the server; locally `cd frontend && npm run build`).
Backend validates JWT signature against the tenant's JWKS, checks `aud == AZURE_CLIENT_ID` and `iss` matches one of the tenant URLs. With bypass=true, every request is logged as the `DEV_AUTH_EMAIL` user.
Backend validation: fetches the tenant's JWKS, verifies the RS256 signature, checks `aud == AZURE_CLIENT_ID` and `iss` matches one of the tenant issuer URLs. Admin gating: the email claim (`preferred_username` / `upn` / `email`) must match an entry in `ADMIN_EMAILS` (case-insensitive).
### Stop / tear down
---
```bash
docker compose down # stops containers, keeps the DB volume
docker compose down -v # also deletes the DB volume (destroys data)
```
## Server deployment — optical-dev.oliver.solutions
### Notes
- Postgres failures never stop the tagger — `db.log_event` swallows errors. Box is the source of truth for "already tagged".
- The `marriott-tagger.service` / `.timer` files below remain for the older systemd deployment path; the Docker path is the recommended one. Don't run both on the same host.
## Server Deployment (Docker — optical-dev.oliver.solutions)
This is the recommended path on the shared `optical-dev.oliver.solutions` dev server. Apps live under `/opt/<slug>/` with an idempotent `deploy/deploy.sh`. Mirrors the OSOP / adeo split-build pattern: backend in Docker, SPA built and served by Apache from `/var/www/html/marriott-tagging/`.
Mirrors the OSOP / adeo split-build pattern: backend in Docker, SPA built and served by Apache.
**Public URL:** `https://optical-dev.oliver.solutions/marriott-tagging/`
### First-time setup
```bash
# 1. Clone into /opt
sudo git clone git@bitbucket.org:zlalani/marriott-box-image-video-tagging.git \
/opt/marriott-box-image-video-tagging
sudo chown -R "$USER:$USER" /opt/marriott-box-image-video-tagging
cd /opt/marriott-box-image-video-tagging
# 2. Drop credentials in place (NOT in git)
cp .env.example .env
$EDITOR .env # GEMINI_API_KEY, POSTGRES_PASSWORD,
# Azure IDs (or DEV_AUTH_BYPASS=true)
$EDITOR box_config.json # paste the Box JWT config
$EDITOR .env # fill required values
$EDITOR box_config.json # paste Box JWT config
# 3. Deploy
./deploy/deploy.sh
```
The script will:
- Sanity-check `.env`, `box_config.json`, docker, git, compose v2.
- Pick free host ports — Postgres (default 5435, range 5435-5499) and API (default 8004, range 8003-8099) — persisted to `.env`.
- Render `deploy/apache-marriott-tagging.conf` from `.tmpl` with the picked API port.
- `git pull --ff-only`, `docker compose build`, `docker compose up -d` (db + tagger + api).
- Build the Vite SPA in a one-shot `node:20` container; rsync `frontend/dist/` to `/var/www/html/marriott-tagging/`.
- Poll `/api/health` until ready and verify the tagger container is running.
- Print the Apache `Include` line you need to add to the shared vhost.
`deploy.sh` will:
1. Sanity-check `.env`, `box_config.json`, docker, git, compose v2.
2. Auto-pick free host ports (`POSTGRES_HOST_PORT` 5435-5499, `MARRIOTT_API_PORT` 8003-8099), persisting choices back to `.env`.
3. Render `deploy/apache-marriott-tagging.conf` from the `.tmpl` with the picked api port.
4. `git pull --ff-only`, `docker compose build`, `docker compose up -d`.
5. Build the Vite SPA in a one-shot `node:20-alpine` container (with `VITE_PUBLIC_BASE=https://optical-dev.oliver.solutions/marriott-tagging`), rsync `frontend/dist/` to `/var/www/html/marriott-tagging/`.
6. Poll `/api/health` until ready; verify the api container is running.
7. Print the Apache `Include` line to add to the shared vhost.
### One-time vhost step (manual)
Add **inside** `</VirtualHost>` of `/etc/apache2/sites-enabled/optical-dev.oliver.solutions.conf`:
**One-time vhost step (manual):**
Edit `/etc/apache2/sites-enabled/optical-dev.oliver.solutions.conf` and add **inside** `</VirtualHost>`:
```apache
Include /opt/marriott-box-image-video-tagging/deploy/apache-marriott-tagging.conf
```
Then:
```bash
sudo apachectl configtest && sudo systemctl reload apache2
```
The deploy script intentionally does NOT touch the shared vhost — it's shared across many apps, and a per-app script editing it risks breaking others.
### Re-deploying
```bash
@ -192,143 +367,168 @@ cd /opt/marriott-box-image-video-tagging
```
Flags:
- `--no-pull` skip `git pull`
- `--no-build` skip `docker compose build`
- `--no-frontend` skip Vite build + SPA sync
- `--run-now` also fire a tagging pass via `/api/runs` (works with DEV_AUTH_BYPASS=true)
- `--logs` tail scheduler logs after deploy
### Verifying it ran
| Flag | Effect |
|---|---|
| `--no-pull` | Skip `git pull` (deploy whatever is in the working tree). |
| `--no-build` | Skip `docker compose build` (faster when only env / config changed). |
| `--no-frontend` | Skip Vite build + SPA sync. |
| `--run-now` | Also POST `/api/runs` to fire a tagging pass immediately (only works with `DEV_AUTH_BYPASS=true`). |
| `--logs` | Tail api logs after deploy. |
### Common follow-ups
- **Code changed but container kept the old image:** `docker compose up -d --build --force-recreate api`.
- **SPA changed but you don't want to rebuild the Python image:** `./deploy/deploy.sh --no-build`.
- **Schema added/changed:** the api lifespan handler runs `ensure_schema` on startup, so a recreated api container applies it. New tables / indexes / extensions land automatically.
---
## Database schema
### `tagging_events` (append-only log)
One row per file the tagger sent to Gemini OR mirrored from Box. Skipped-as-already-tagged files are not logged.
| Column | Type | Notes |
|---|---|---|
| `id` | bigserial PK | |
| `run_id` | uuid NOT NULL | UUID per tagging/backfill pass — groups rows belonging to one run. |
| `created_at` | timestamptz NOT NULL | |
| `file_id`, `file_name`, `folder_path` | text | Box identifiers + display. |
| `media_type` | text | `image` or `video`. |
| `gemini_model` | text | E.g. `gemini-2.5-flash`. |
| `prompt` | text | Full prompt sent to Gemini (null for backfilled rows). |
| `raw_response` | jsonb | Untouched Gemini response (null for backfilled rows). |
| `description` | text | Description written to Box. |
| `scenes` | jsonb | Video scene breakdown. |
| `validated_metadata` | jsonb | Cleaned dict actually written to Box. |
| `metadata_write_success`, `description_write_success`, `scene_comment_write_success` | boolean | Per Box write. |
| `status` | text | `success`, `backfilled`, `gemini_error`, `validation_error`, `metadata_write_error`. |
| `error_message` | text | Free-form error if `status` is an error. |
| `duration_ms` | int | Gemini-call elapsed time (null for backfilled rows). |
Indexes: `run_id`, `file_id`, `created_at DESC`.
### `file_assets` (per-file state)
One row per Box file_id, upserted by both the tagging pass and backfill.
| Column | Type | Notes |
|---|---|---|
| `file_id` | text PK | Matches `tagging_events.file_id`. |
| `thumbnail_bytes` | bytea | Box's 160×160 JPG. ~10-20 KB. |
| `thumbnail_content_type` | text | E.g. `image/jpeg`. |
| `thumbnail_size` | int | 160 today. |
| `search_terms` | text | Lowercased, whitespace-normalised text blob: file_name + folder + description + metadata values. |
| `updated_at` | timestamptz | |
Index: `updated_at DESC`. Extension: `pg_trgm` (for fuzzy search via `similarity()`).
---
## Troubleshooting
### Blank page at the deployed URL
Asset paths baked with the wrong base. View-source the page; if the `<script>` tag reads `src="/assets/..."` instead of `src="/marriott-tagging/assets/..."`, your `VITE_PUBLIC_BASE` was misset at build time. `deploy.sh` now overrides this with the prod URL automatically — `git pull && ./deploy/deploy.sh --no-build` rebuilds the SPA.
### 404 on a new API endpoint
The api container is running an old image. Force a recreate:
```bash
# Scheduler logs (next cron-fired pass is at SCHEDULE_CRON; default 02:00 UTC)
docker compose logs -f tagger
# API logs
docker compose logs -f api
# Postgres request log
docker compose exec db psql -U marriott marriott_tagging -c \
"SELECT status, count(*) FROM tagging_events GROUP BY status;"
docker compose up -d --build --force-recreate api
```
Postgres is bound to `127.0.0.1` only — not reachable from outside the server. To inspect from your laptop, tunnel: `ssh -L 55432:127.0.0.1:<POSTGRES_HOST_PORT> user@optical-dev.oliver.solutions`, then `psql postgresql://marriott:***@127.0.0.1:55432/marriott_tagging`.
### 500 on search
### Notes
Usually `pg_trgm` extension missing. The api lifespan handler installs it on startup, but a stale running container might not have re-applied schema:
- The Docker deploy and the `systemd` deploy below target the same `/opt/marriott-box-image-video-tagging/` directory. Pick one on any given server — don't run both, they'll both fire the tagger and double-write to Box.
- The SPA build bakes `VITE_AZURE_*` and `VITE_DEV_AUTH_BYPASS` into the bundle. Flipping the bypass requires a re-build (`./deploy/deploy.sh` does this).
```bash
docker compose exec db psql -U marriott marriott_tagging \
-c "CREATE EXTENSION IF NOT EXISTS pg_trgm;"
```
## Server Deployment (systemd, Ubuntu)
Or just `docker compose up -d --force-recreate api`.
The repo includes `marriott-tagger.service` and `marriott-tagger.timer` for running the tagger as a scheduled service. These steps are written for **Ubuntu 22.04 / 24.04** but should work on any systemd-based distribution with minor path tweaks (e.g. `/sbin/nologin` instead of `/usr/sbin/nologin` on Red Hat-family).
### "Run now" did nothing visible
### 0. Prerequisites
Probably the background thread crashed during init. Check api logs:
```bash
docker compose logs api --tail 60
```
Common causes:
- `box_config.json` not mounted into the api container — confirm with `docker compose exec api ls -la /app/box_config.json`. The compose file bind-mounts `./box_config.json`; if it didn't exist when compose came up, no mount.
- `GEMINI_API_KEY` empty in the api container — `docker compose exec api printenv GEMINI_API_KEY`.
- Every file already has metadata in Box / the DB — the pass completes silently with `0 tagged`.
### Postgres host-port conflict
`deploy.sh` scans 5435-5499. If your laptop already has a Postgres listening on those, bump the upper bound in `deploy.sh` or set `POSTGRES_HOST_PORT` manually in `.env`.
---
## Legacy: systemd deployment (Ubuntu)
The `marriott-tagger.service` / `.timer` unit files are kept in the repo for a pre-Docker deployment path that runs `main.py` directly via a systemd timer. **Don't run this alongside the Docker deploy on the same host — both will fire passes and double-write to Box.**
### Setup
```bash
sudo apt update
sudo apt install -y git python3 python3-venv python3-pip
```
`python3-venv` is a separate apt package on Ubuntu — `python3 -m venv` will fail without it.
### 1. Clone the repo on the server
```bash
sudo mkdir -p /opt/marriott-box-image-video-tagging
sudo chown $USER:$USER /opt/marriott-box-image-video-tagging
git clone git@bitbucket.org:zlalani/marriott-box-image-video-tagging.git /opt/marriott-box-image-video-tagging
cd /opt/marriott-box-image-video-tagging
```
### 2. Create the service user
```bash
sudo useradd --system --shell /usr/sbin/nologin --home-dir /opt/marriott-box-image-video-tagging marriott-tagger
sudo chown -R marriott-tagger:marriott-tagger /opt/marriott-box-image-video-tagging
```
### 3. Drop credentials in place (NOT in git)
```bash
# Drop credentials
sudo -u marriott-tagger tee /opt/marriott-box-image-video-tagging/box_config.json > /dev/null < /path/to/local/box_config.json
sudo -u marriott-tagger tee /opt/marriott-box-image-video-tagging/.env > /dev/null <<'EOF'
GEMINI_API_KEY=your_key_here
EOF
sudo chmod 600 /opt/marriott-box-image-video-tagging/box_config.json /opt/marriott-box-image-video-tagging/.env
```
### 4. Set up the virtualenv
```bash
# venv
sudo -u marriott-tagger python3 -m venv /opt/marriott-box-image-video-tagging/env
sudo -u marriott-tagger /opt/marriott-box-image-video-tagging/env/bin/pip install -r /opt/marriott-box-image-video-tagging/requirements.txt
```
### 5. Install the systemd unit files
```bash
sudo cp /opt/marriott-box-image-video-tagging/marriott-tagger.service /etc/systemd/system/
sudo cp /opt/marriott-box-image-video-tagging/marriott-tagger.timer /etc/systemd/system/
# Install + enable
sudo cp marriott-tagger.service marriott-tagger.timer /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now marriott-tagger.timer
```
### 6. Verify
In this mode there's no Postgres, no SPA, no api — just `main.py` running on a cron. Tagging-events logging requires `DATABASE_URL` to be set in `.env`; otherwise `db.log_event` no-ops gracefully and you lose the audit log.
```bash
# Show the next scheduled run
systemctl list-timers marriott-tagger.timer
---
# Trigger a one-off run immediately (timer will still run on schedule)
sudo systemctl start marriott-tagger.service
## How the tagging pipeline works
# Tail the logs (live)
sudo journalctl -u marriott-tagger -f
- **Dynamic prompt**: Gemini's prompt is built at runtime from the live Box template definition (`fetch_template_schema`). Field additions / option changes propagate automatically.
- **Metadata + description**: Each file gets structured metadata (filterable in Box search) and a short description (visible in Box list views, also indexed by Box search).
- **Search-keyword tail**: Descriptions are formatted as `<summary>. <comma-separated keywords>.` — the tail covers synonyms / broader terms (food/dining/eating/meal/restaurant) so a Box search for "Food" still hits assets tagged with enum value `Dining`.
- **Video scene breakdown**: Videos additionally get a timestamped scene breakdown written as a comment on the Box file — high-level chapter map for finding moments inside long videos.
- **DB-based skip**: Once a file has a `success` or `backfilled` row, the next pass skips it locally (no Box call, no Gemini call). Run **Backfill from Box** once to mirror Box's existing metadata into the local DB before relying on this.
- **Validation**: Gemini output is validated against the template schema — invalid enum values are dropped, multi-select arrays are filtered to allowed options only.
- **Large-video gating**: Videos exceeding the source or proxy size limits are skipped cleanly rather than wasting time / API budget. Skips are reported as `skipped`, not `errored`.
- **Per-run limiter**: A run will tag at most `MAX_FILES_PER_RUN` files in `MAX_RUN_DURATION` wall-clock seconds. Whichever cap hits first, the run exits cleanly with a summary; the next run picks up the remaining untagged files. This keeps a sudden 1000-file upload from blowing through your Gemini budget in one click.
- **Thumbnail cache**: After a successful tag (or as part of backfill), the file's 160×160 JPG is fetched from Box and stored in `file_assets.thumbnail_bytes`. The SPA renders it inline in search results; `Cache-Control: private, max-age=86400` means the browser caches it for a day.
# Inspect the most recent run's full output
sudo journalctl -u marriott-tagger --since "1 day ago"
```
---
### Updating the service
## Credentials & files NOT in git
```bash
cd /opt/marriott-box-image-video-tagging
sudo -u marriott-tagger git pull
# If unit files changed:
sudo cp marriott-tagger.service marriott-tagger.timer /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl restart marriott-tagger.timer
```
- `box_config.json` — Box JWT config. Bind-mounted read-only into the api container.
- `.env` — All env vars including `GEMINI_API_KEY`, `POSTGRES_PASSWORD`, `AZURE_CLIENT_ID`, etc.
- `deploy/apache-marriott-tagging.conf` — generated by `deploy.sh` from the `.tmpl`.
- `frontend/node_modules/`, `frontend/dist/` — npm install / Vite build artefacts.
## Configuration
Edit the constants at the top of `main.py`:
| Setting | Default | Description |
|---------|---------|-------------|
| `BOX_FOLDER_ID` | (varies) | Box folder to process |
| `METADATA_TEMPLATE_KEY` | `marriottUsa` | Box metadata template key |
| `GEMINI_MODEL` | `gemini-2.5-flash` | Gemini model for analysis |
| `EXCLUDED_FOLDER_PREFIXES` | `("z_", "zz_", "zzz_")` | Subfolder name prefixes to skip |
| `GEMINI_DELAY` | `7` | Seconds between Gemini image calls |
| `GEMINI_VIDEO_DELAY` | `10` | Seconds between Gemini video calls |
| `SKIP_ALREADY_TAGGED` | `True` | Skip files with existing metadata |
| `MAX_IMAGE_SIZE` | `1000` | Max pixel dimension for image resize |
| `VIDEO_SIZE_LIMIT_INLINE` | `20 MB` | Below this, send video inline; above, use Gemini File API |
| `VIDEO_SOURCE_SIZE_LIMIT` | `5 GB` | Skip videos whose source file exceeds this |
| `VIDEO_PROXY_SIZE_LIMIT` | `400 MB` | Skip videos whose 480p proxy exceeds this (~60 min runtime) |
| `MAX_FILES_PER_RUN` | `200` | Hard cap on newly-tagged files per run |
| `MAX_RUN_DURATION` | `4h` | Hard wall-clock cap per run |
| `DESCRIPTION_MAX_LENGTH` | `255` | Box description field char limit |
## How It Works
- **Dynamic prompt**: The Gemini prompt is built at runtime from the actual Box template definition. If Marriott adds/changes fields or options in Box Admin, the script adapts automatically.
- **Metadata + description**: Each file gets structured metadata (for filtered search) and a short description (visible in Box list views, also indexed by Box search).
- **Search-keyword tail**: Each description is formatted as `<summary sentence>. <comma-separated keywords>.` — the keyword tail covers synonyms and broader category terms (e.g. food/dining/eating/meal/restaurant) so a search for "Food" hits assets tagged with the enum value `Dining`, etc.
- **Video scene breakdown**: Videos additionally get a timestamped scene breakdown written as a comment on the Box file — a high-level chapter map for finding moments inside long videos.
- **Resumable**: Files with existing metadata are skipped by default, so the script can be re-run after interruptions or when new files are added.
- **Validation**: Gemini output is validated against the template schema — invalid enum values are dropped, multiSelect arrays are filtered to allowed options only.
- **Large-video gating**: Videos exceeding the source or proxy size limits are skipped cleanly rather than wasting time / API budget on content beyond Gemini's context window. Skips are reported in the summary as `skipped`, not `errored`.
- **Per-run limiter**: A daily run will tag at most `MAX_FILES_PER_RUN` newly-tagged files in `MAX_RUN_DURATION` of wall clock. Whichever cap hits first, the run exits cleanly with a summary line; the next scheduled run picks up the remaining untagged files. This keeps a sudden 1000-file upload from blowing through your Gemini budget in one night.
`.env.example` is checked in; copy it to `.env` and fill in.