Rewrite README to reflect current architecture
Old README still described the nightly scheduler container, didn't cover backfill / thumbnails / admin gating / multi-token search / the API endpoints, and pointed at fields that no longer exist on events. Comprehensive rewrite covering: what the app does today, architecture diagram, repo layout, local quickstart, full env-var reference, operations (run/backfill/inspect), API surface, MSAL setup steps, deploy script + manual vhost Include, the two-table schema, troubleshooting, and the legacy systemd path preserved at the end for reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
04440d661d
commit
88a0adcfbf
1 changed files with 406 additions and 206 deletions
612
README.md
612
README.md
|
|
@ -1,189 +1,364 @@
|
|||
# Marriott Box Asset Tagger
|
||||
|
||||
Batch-processes images **and videos** in a Box folder, analyzes them with Gemini AI, and writes structured metadata back to Box using the `marriottUsa` metadata template. Videos use Box's 480p MP4 proxy representations to keep bandwidth and Gemini token usage manageable.
|
||||
AI-driven metadata tagging for images and videos stored in a Marriott Box folder, with a searchable Postgres audit log and a React SPA on top. Gemini analyses each asset against the `marriottUsa` metadata template; the resulting structured metadata, description, and (for videos) scene breakdown are written back to Box. Every Gemini call is also persisted to a local Postgres so it can be searched, audited, and re-displayed without round-tripping Box.
|
||||
|
||||
Every Gemini-analysed file is also written to a Postgres `tagging_events` table for search/audit, and there's a small React SPA on top (search across all logged fields, click through to the original Box file, trigger an on-demand tagging pass).
|
||||
---
|
||||
|
||||
## Components
|
||||
## What you can do
|
||||
|
||||
- **`scheduler.py` (tagger container)** — APScheduler that fires `main.main()` on a cron (default daily 02:00).
|
||||
- **`api.py` (api container, FastAPI)** — search, list runs, kick off ad-hoc runs in a background thread.
|
||||
- **`db.py` + `schema.sql`** — Postgres logging layer.
|
||||
- **`frontend/` (Vite + React + TS)** — single-page UI, served by Apache from `/var/www/html/marriott-tagging/` in prod. Auth via `@azure/msal-react` with a `VITE_DEV_AUTH_BYPASS` switch.
|
||||
- **Trigger a tagging pass** from the SPA's **Run now** button — admin-only. Walks the configured Box folder, skips files already in the local DB, sends new ones to Gemini, validates against the live Box template schema, writes metadata + description (and scene-breakdown comments for videos) back to Box, and inserts a `tagging_events` row per file Gemini saw.
|
||||
- **Backfill from Box** — admin-only. Walks the Box folder and mirrors any existing `marriottUsa` metadata into the local DB (status = `backfilled`). No Gemini calls, no Box writes. Use this after first deploy, after restoring a wiped DB, or to refresh thumbnails. Re-runnable safely.
|
||||
- **Search** the request log across every text + JSON field (file name, folder path, description, validated metadata, raw Gemini response, scene breakdown, status, file ID, and the consolidated `search_terms` blob). Multi-word queries are AND'd across tokens; each token also fuzzy-matches via `pg_trgm` similarity so `femalle` still finds "female".
|
||||
- **See thumbnails** inline in the search results — Box's pre-generated 160×160 JPG for each file is cached in Postgres (`file_assets.thumbnail_bytes`).
|
||||
- **Click through to Box** on every result — the `box_url` is synthesised per row.
|
||||
- **Azure AD SSO** for sign-in, with a `DEV_AUTH_BYPASS` switch for local dev and `ADMIN_EMAILS` allowlist gating the destructive endpoints.
|
||||
|
||||
## Setup
|
||||
The cron-driven nightly scheduler that used to fire passes automatically has been **removed**. The tool is manual-only: a human clicks Run now (or POSTs `/api/runs`). This keeps Box and Gemini API costs predictable as the folder grows. `scheduler.py` remains in the repo if you want to wire cron back in.
|
||||
|
||||
### 1. Clone and create virtual environment
|
||||
---
|
||||
|
||||
```bash
|
||||
cd Marriott_Box_Asset_Tagging
|
||||
python3 -m venv env
|
||||
source env/bin/activate
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### 2. Box JWT credentials
|
||||
|
||||
Download your Box app's JWT config from the [Box Developer Console](https://app.box.com/developers/console) and save it as `box_config.json` in the project root.
|
||||
|
||||
The service account must have:
|
||||
- Access to folder `370595013246`
|
||||
- Permission to read/write metadata using the `marriottUsa` template
|
||||
|
||||
### 3. Gemini API key
|
||||
|
||||
Add your key to `.env`:
|
||||
## Architecture
|
||||
|
||||
```
|
||||
GEMINI_API_KEY=your_key_here
|
||||
Apache (shared vhost on optical-dev.oliver.solutions)
|
||||
│
|
||||
├──── /marriott-tagging/api/* ──┐
|
||||
│ ▼
|
||||
│ ┌──────────────────┐
|
||||
│ │ api container │
|
||||
│ │ (uvicorn, │
|
||||
│ │ FastAPI) │
|
||||
│ │ │
|
||||
│ │ • /api/health │
|
||||
│ │ • /api/me │
|
||||
│ │ • /api/events │
|
||||
│ │ • /api/runs │ ──┐
|
||||
│ │ • /api/backfill │ ──┤ background
|
||||
│ │ • /api/files/ │ │ thread runs
|
||||
│ │ {id}/thumb │ │ main._run_pass
|
||||
│ └──────────────────┘ │ / _run_backfill
|
||||
│ │ │ which call →
|
||||
└──── /marriott-tagging/* ◀────┘ │
|
||||
(static SPA from ▼
|
||||
/var/www/html/ ┌──────────────────┐
|
||||
marriott-tagging/) │ Box API │
|
||||
│ Gemini API │
|
||||
└──────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────┐
|
||||
│ db container │
|
||||
│ (Postgres 16, │
|
||||
│ bound to │
|
||||
│ 127.0.0.1) │
|
||||
│ │
|
||||
│ • tagging_events │
|
||||
│ • file_assets │
|
||||
└──────────────────┘
|
||||
```
|
||||
|
||||
Get a key at [Google AI Studio](https://aistudio.google.com/apikey).
|
||||
**Containers:** `db` + `api`. They share a Docker network and a named volume (`marriott-tagging_pgdata`).
|
||||
**Outside the container set:** Apache (host), built SPA at `/var/www/html/marriott-tagging/`, the shared vhost include that proxies `/marriott-tagging/api/` to the api container.
|
||||
|
||||
## Usage
|
||||
---
|
||||
|
||||
```bash
|
||||
source env/bin/activate
|
||||
python main.py
|
||||
```
|
||||
## Repo layout
|
||||
|
||||
The script will:
|
||||
1. Authenticate with Box and Gemini
|
||||
2. Fetch the `marriottUsa` template schema (fields, types, allowed values)
|
||||
3. Build a dynamic Gemini prompt from the schema
|
||||
4. Recursively list all image and video files in the target folder
|
||||
5. For each image: download, resize, analyze with Gemini, validate metadata, write metadata + description to Box
|
||||
6. For each video: fetch the 480p MP4 proxy from Box, analyze with Gemini, write metadata + description + a scene-breakdown comment to Box
|
||||
7. Print a summary of results
|
||||
| Path | Purpose |
|
||||
|---|---|
|
||||
| `main.py` | The tagging pipeline — Box client, Gemini calls, validation, Box writes, Postgres logging, thumbnail fetch. `_run_pass(...)` for normal passes; `_run_backfill(...)` for the Box → DB mirror. |
|
||||
| `api.py` | FastAPI app — search, run-trigger, backfill-trigger, thumbnail-serve. Background threads do the actual tagging/backfill work so the request returns immediately. |
|
||||
| `auth.py` | Azure AD JWT validation against the tenant JWKS + the `DEV_AUTH_BYPASS` short-circuit. Exposes `require_auth` and `require_admin` FastAPI dependencies. |
|
||||
| `db.py` | psycopg3 helpers — `get_conn`, `ensure_schema`, `log_event`, `upsert_file_asset`, `get_thumbnail`, `is_file_already_tagged`. Defensive — DB errors never crash a tagging pass. |
|
||||
| `schema.sql` | `tagging_events`, `file_assets`, indexes, `pg_trgm` extension. Applied idempotently on api startup via the FastAPI lifespan handler. |
|
||||
| `scheduler.py` | APScheduler entry point — kept for archival / opt-back-in. Not currently used; the compose file no longer wires up a `tagger` service. |
|
||||
| `frontend/` | Vite + React + TS SPA. `src/App.tsx` is the main page; `src/auth.tsx` does MSAL with the bypass switch; `src/api.ts` is the client. |
|
||||
| `Dockerfile` | python:3.12-slim, non-root `appuser`. Same image runs the api container (and could run the scheduler if reactivated). |
|
||||
| `docker-compose.yml` | `name: marriott-tagging` pinned. db (postgres:16) + api (built from Dockerfile). All host ports bound to `127.0.0.1`. |
|
||||
| `deploy/deploy.sh` | Idempotent server deploy: port auto-pick, git pull, rebuild, SPA build via one-shot node:20-alpine, rsync to `/var/www/html/`, `/api/health` poll. |
|
||||
| `deploy/apache-marriott-tagging.conf.tmpl` | Apache vhost include — proxy `/marriott-tagging/api/`, alias `/marriott-tagging` to the SPA web-root, SPA fallback. `__API_PORT__` rendered by `deploy.sh`. |
|
||||
| `marriott-tagger.service` / `.timer` | Legacy systemd path. Not used in Docker mode. |
|
||||
|
||||
## Run with Docker
|
||||
---
|
||||
|
||||
Brings up Postgres, the scheduler (`tagger`), and the FastAPI backend (`api`). The frontend is built separately by `deploy/deploy.sh` (or `npm run dev` locally) and consumed by the API.
|
||||
## Quick start — local dev (macOS / Linux)
|
||||
|
||||
### 1. Prereqs
|
||||
|
||||
- Docker Desktop (or Docker Engine + Compose v2)
|
||||
- `box_config.json` in the project root
|
||||
- A `.env` copied from `.env.example`, filled in
|
||||
- Docker Desktop or Docker Engine with Compose v2
|
||||
- Node 20+ (for `npm run dev`)
|
||||
- `box_config.json` in the repo root (JWT config from the Box Developer Console)
|
||||
- `.env` from `.env.example`
|
||||
|
||||
```bash
|
||||
cp .env.example .env
|
||||
# minimum: GEMINI_API_KEY, POSTGRES_PASSWORD
|
||||
# leave DEV_AUTH_BYPASS=true for now if you don't have Azure IDs ready
|
||||
# At minimum: set GEMINI_API_KEY and POSTGRES_PASSWORD
|
||||
$EDITOR .env
|
||||
```
|
||||
|
||||
### 2. Build and start the backend services
|
||||
### 2. Bring up Postgres + API
|
||||
|
||||
```bash
|
||||
docker compose up --build -d
|
||||
```
|
||||
|
||||
This brings up three services:
|
||||
- `db` — Postgres 16, named volume `pgdata`, port bound to `127.0.0.1:${POSTGRES_HOST_PORT:-5432}`.
|
||||
- `tagger` — runs `scheduler.py` (cron-driven Gemini passes).
|
||||
- `api` — runs `uvicorn api:app` on container port 8000, published to `127.0.0.1:${MARRIOTT_API_PORT:-8004}`.
|
||||
This starts:
|
||||
- `db` — Postgres 16, named volume `pgdata`, host port `127.0.0.1:${POSTGRES_HOST_PORT:-5432}`.
|
||||
- `api` — `uvicorn api:app`, host port `127.0.0.1:${MARRIOTT_API_PORT:-8004}`.
|
||||
|
||||
### 3. Run the frontend (local dev)
|
||||
Check health:
|
||||
|
||||
```bash
|
||||
curl -s http://127.0.0.1:8004/api/health | jq
|
||||
```
|
||||
|
||||
### 3. Run the SPA
|
||||
|
||||
```bash
|
||||
cd frontend
|
||||
npm install
|
||||
npm run dev # http://localhost:5173
|
||||
npm run dev # http://localhost:5173
|
||||
```
|
||||
|
||||
Vite proxies `/api/*` to the FastAPI host port (default `8004`). With `VITE_DEV_AUTH_BYPASS=true` you'll be auto-signed-in as the dev user.
|
||||
Vite proxies `/api/*` to `127.0.0.1:${MARRIOTT_API_PORT:-8004}`. With the default `VITE_DEV_AUTH_BYPASS=true` you're auto-signed-in as the dev user.
|
||||
|
||||
### 4. Fire a tagging pass
|
||||
### 4. Try a backfill
|
||||
|
||||
Three ways:
|
||||
- **UI** — open the SPA and click **Run now**. Polls live until done.
|
||||
- **API** — `curl -X POST http://127.0.0.1:8004/api/runs` (DEV_AUTH_BYPASS=true) or with a Bearer token in prod.
|
||||
- **Container** — `docker compose exec tagger python main.py` (bypasses the API entirely).
|
||||
In the SPA click **Backfill from Box**. The active panel polls every 2 s and shows each file as it's processed. Thumbnails appear inline as rows land.
|
||||
|
||||
### 5. Inspect the DB
|
||||
---
|
||||
|
||||
## Configuration reference
|
||||
|
||||
All variables live in `.env` (gitignored). `.env.example` has the full list with comments.
|
||||
|
||||
### Required to start
|
||||
|
||||
| Variable | Purpose |
|
||||
|---|---|
|
||||
| `GEMINI_API_KEY` | Google AI Studio key for Gemini calls. |
|
||||
| `POSTGRES_USER` / `POSTGRES_PASSWORD` / `POSTGRES_DB` | DB creds. The compose file uses these to create the role + database. |
|
||||
|
||||
### Ports (auto-managed by `deploy.sh` on the server)
|
||||
|
||||
| Variable | Default | Range scanned by deploy.sh |
|
||||
|---|---|---|
|
||||
| `POSTGRES_HOST_PORT` | `5432` | 5435 – 5499 |
|
||||
| `MARRIOTT_API_PORT` | `8004` | 8003 – 8099 |
|
||||
|
||||
Both bound to `127.0.0.1` only — Postgres and the FastAPI process are never on the public internet. Apache reverse-proxies to `MARRIOTT_API_PORT`.
|
||||
|
||||
### Auth
|
||||
|
||||
| Variable | Purpose |
|
||||
|---|---|
|
||||
| `DEV_AUTH_BYPASS` | `true` skips MSAL entirely; the api treats every caller as `DEV_AUTH_EMAIL`. Defaults to `true` to keep dev/first-deploy unblocked. |
|
||||
| `DEV_AUTH_EMAIL` / `DEV_AUTH_NAME` | Identity stamped on requests when bypassed. |
|
||||
| `DEV_AUTH_IS_ADMIN` | `true` (default) keeps the bypass user as admin; flip to `false` to preview the read-only UX. |
|
||||
| `AZURE_TENANT_ID` / `AZURE_CLIENT_ID` | Your Azure AD app registration. Backend uses them to validate JWTs (JWKS fetch + `aud`/`iss` check). |
|
||||
| `ADMIN_EMAILS` | Comma-separated allowlist that gates `POST /api/runs` and `POST /api/backfill`. Case-insensitive. Members see the destructive buttons in the SPA; everyone else gets read-only search. |
|
||||
| `VITE_DEV_AUTH_BYPASS` / `VITE_AZURE_TENANT_ID` / `VITE_AZURE_CLIENT_ID` | Frontend mirrors. Baked into the SPA bundle at build time — changing them requires a re-build (`deploy.sh` handles this). |
|
||||
| `VITE_PUBLIC_BASE` | Used by Vite for the SPA's `base` (asset prefix) AND by MSAL as the redirect-URI root. In local dev: `http://localhost:5173`. On the server, `deploy.sh` overrides with the prod URL automatically. |
|
||||
|
||||
### Behavioural
|
||||
|
||||
| Variable | Purpose |
|
||||
|---|---|
|
||||
| `CORS_ORIGINS` | Comma-separated. Only set in local dev when Vite is on `:5173` and FastAPI on host `:8004`. Empty in prod (Apache makes them same-origin). |
|
||||
| `TZ` | Container timezone. Defaults to `UTC`. |
|
||||
| `SCHEDULE_CRON` / `RUN_AT_STARTUP` | Read by `scheduler.py` only. Unused by default (no scheduler container in compose). |
|
||||
|
||||
### Pipeline tuning (`main.py` constants)
|
||||
|
||||
Not in `.env` — edited at the top of `main.py`:
|
||||
|
||||
| Setting | Default | Description |
|
||||
|---|---|---|
|
||||
| `BOX_FOLDER_ID` | varies | Root Box folder to scan recursively. |
|
||||
| `METADATA_TEMPLATE_KEY` | `marriottUsa` | Box metadata template key. |
|
||||
| `GEMINI_MODEL` | `gemini-2.5-flash` | Model used for both image + video analysis. |
|
||||
| `EXCLUDED_FOLDER_PREFIXES` | `("z_", "zz_", "zzz_")` | Subfolder names to skip. |
|
||||
| `GEMINI_DELAY` / `GEMINI_VIDEO_DELAY` | `7` / `10` s | Per-call rate-limit sleep. |
|
||||
| `MAX_IMAGE_SIZE` | `1000 px` | Longest side after resize before sending to Gemini. |
|
||||
| `VIDEO_SIZE_LIMIT_INLINE` | `20 MB` | Below this, Gemini gets the video inline; above, the File API is used. |
|
||||
| `VIDEO_SOURCE_SIZE_LIMIT` | `5 GB` | Skip videos with source file above this. |
|
||||
| `VIDEO_PROXY_SIZE_LIMIT` | `400 MB` | Skip videos with 480p proxy above this. |
|
||||
| `MAX_FILES_PER_RUN` | `200` | Hard cap on newly-tagged files per pass. |
|
||||
| `MAX_RUN_DURATION` | `4 h` | Hard wall-clock cap per pass. |
|
||||
| `DESCRIPTION_MAX_LENGTH` | `255` | Box description field char limit. |
|
||||
| `SKIP_ALREADY_TAGGED` | `True` | Toggles the DB-based skip check. |
|
||||
| `THUMBNAIL_DIM` | `160` | Pixel dimension for cached thumbnails. |
|
||||
|
||||
---
|
||||
|
||||
## Operations
|
||||
|
||||
### Trigger a tagging pass
|
||||
|
||||
- **From the SPA** — click **Run now**. UI polls live; events stream into the active panel.
|
||||
- **From a shell** (works with `DEV_AUTH_BYPASS=true`):
|
||||
```bash
|
||||
curl -X POST http://127.0.0.1:8004/api/runs
|
||||
```
|
||||
- **From inside the api container** (bypasses the API entirely):
|
||||
```bash
|
||||
docker compose exec api python main.py
|
||||
```
|
||||
|
||||
### Trigger a backfill
|
||||
|
||||
- **From the SPA** — click **Backfill from Box** (admin-only; confirms first).
|
||||
- **From a shell**:
|
||||
```bash
|
||||
curl -X POST http://127.0.0.1:8004/api/backfill
|
||||
```
|
||||
|
||||
Backfill is idempotent: re-running won't duplicate `tagging_events` rows, and `file_assets` rows are upserted (preserving previously-captured thumbnails if today's fetch fails).
|
||||
|
||||
### Inspect the DB
|
||||
|
||||
```bash
|
||||
docker compose exec db psql -U marriott marriott_tagging -c '\d tagging_events'
|
||||
|
||||
docker compose exec db psql -U marriott marriott_tagging -c \
|
||||
"SELECT status, count(*) FROM tagging_events GROUP BY status;"
|
||||
docker compose exec db psql -U marriott marriott_tagging
|
||||
```
|
||||
|
||||
### Auth: enabling Azure AD SSO
|
||||
```sql
|
||||
-- Row counts by status
|
||||
SELECT status, count(*) FROM tagging_events GROUP BY status;
|
||||
|
||||
1. Register (or reuse) an Azure AD app. Redirect URI:
|
||||
- Local dev: `http://localhost:5173`
|
||||
- Prod: `https://optical-dev.oliver.solutions/marriott-tagging/`
|
||||
2. Expose an API with scope `access_as_user` whose audience is the same client ID.
|
||||
3. Fill `.env`:
|
||||
-- Recent events
|
||||
SELECT created_at, media_type, file_name, status
|
||||
FROM tagging_events ORDER BY created_at DESC LIMIT 20;
|
||||
|
||||
-- Thumbnail coverage
|
||||
SELECT count(*) AS total,
|
||||
count(*) FILTER (WHERE thumbnail_bytes IS NOT NULL) AS with_thumb,
|
||||
avg(octet_length(thumbnail_bytes))::int AS avg_bytes
|
||||
FROM file_assets;
|
||||
|
||||
-- All events for a given run
|
||||
SELECT file_name, status, error_message
|
||||
FROM tagging_events
|
||||
WHERE run_id = '<uuid>'
|
||||
ORDER BY created_at;
|
||||
```
|
||||
|
||||
From your laptop (via SSH tunnel — Postgres isn't on the public internet):
|
||||
|
||||
```bash
|
||||
ssh -L 55432:127.0.0.1:5435 user@optical-dev.oliver.solutions
|
||||
psql postgresql://marriott:<password>@127.0.0.1:55432/marriott_tagging
|
||||
```
|
||||
|
||||
### Logs
|
||||
|
||||
```bash
|
||||
docker compose logs -f api # API + background tagging/backfill threads
|
||||
docker compose logs -f db
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## API reference
|
||||
|
||||
All endpoints behind `/api`. With `DEV_AUTH_BYPASS=true` no token is needed; with SSO enabled, include `Authorization: Bearer <access_token>`.
|
||||
|
||||
| Method | Path | Auth | Description |
|
||||
|---|---|---|---|
|
||||
| GET | `/api/health` | none | Liveness + DB-reachable check + auth-config summary. |
|
||||
| GET | `/api/me` | required | `{ oid, name, email, dev, is_admin }`. SPA uses `is_admin` to hide the destructive buttons. |
|
||||
| GET | `/api/events?q=…&limit=…` | required | Search. Whitespace-tokenises `q`; each token must match (substring OR pg_trgm similarity > 0.3) across the searched columns. Results ranked by summed similarity. `limit` 1-500 (default 100). |
|
||||
| POST | `/api/runs` | admin | Kicks off a tagging pass in a daemon thread. Returns `{ run_id, state: "running", started_by }`. |
|
||||
| GET | `/api/runs?limit=…` | required | Recent runs from `tagging_events`, grouped by `run_id`, with counts and live state if still running. |
|
||||
| GET | `/api/runs/{run_id}/events` | required | Per-event detail for a single run. Includes `live_state` (`running` / `completed` / `failed`) and `live_error`. |
|
||||
| POST | `/api/backfill` | admin | Kicks off a backfill in a daemon thread. Same response shape as `/api/runs`. |
|
||||
| GET | `/api/files/{file_id}/thumbnail` | required | Streams the cached JPG thumbnail (`Cache-Control: max-age=86400`) or 404. |
|
||||
|
||||
Every event in `/api/events` / `/api/runs/{id}/events` includes a synthesised `box_url` (`https://app.box.com/file/<file_id>`) and a `has_thumbnail` boolean. The frontend builds the thumbnail URL via `thumbnailUrl(file_id)` which respects the SPA's base prefix.
|
||||
|
||||
---
|
||||
|
||||
## Auth setup
|
||||
|
||||
### Dev / first deploy
|
||||
|
||||
Keep `DEV_AUTH_BYPASS=true` and `VITE_DEV_AUTH_BYPASS=true`. Every request authenticates as `DEV_AUTH_EMAIL`, and the dev user is admin by default (toggle `DEV_AUTH_IS_ADMIN=false` to test the read-only UX).
|
||||
|
||||
### Enabling Azure AD SSO
|
||||
|
||||
1. **Azure AD app registration** (reuse an existing one if you have it).
|
||||
- Redirect URIs (Single-page application platform):
|
||||
- Local: `http://localhost:5173`
|
||||
- Prod: `https://optical-dev.oliver.solutions/marriott-tagging/`
|
||||
- **Expose an API** with scope `access_as_user` whose Application ID URI is `api://<client-id>`.
|
||||
2. **Backend `.env`** (the api container):
|
||||
```
|
||||
DEV_AUTH_BYPASS=false
|
||||
AZURE_TENANT_ID=...
|
||||
AZURE_CLIENT_ID=...
|
||||
VITE_DEV_AUTH_BYPASS=false
|
||||
VITE_AZURE_TENANT_ID=...
|
||||
VITE_AZURE_CLIENT_ID=...
|
||||
AZURE_TENANT_ID=<tenant-uuid>
|
||||
AZURE_CLIENT_ID=<client-uuid>
|
||||
ADMIN_EMAILS=alice@oliver.agency,bob@oliver.agency
|
||||
```
|
||||
3. **Frontend `.env`** (baked into the SPA at build time):
|
||||
```
|
||||
VITE_DEV_AUTH_BYPASS=false
|
||||
VITE_AZURE_TENANT_ID=<tenant-uuid>
|
||||
VITE_AZURE_CLIENT_ID=<client-uuid>
|
||||
```
|
||||
4. Rebuild + redeploy:
|
||||
```bash
|
||||
./deploy/deploy.sh
|
||||
docker compose up -d --force-recreate api
|
||||
```
|
||||
4. `docker compose up -d --force-recreate api` and rebuild the SPA (`deploy.sh` does this on the server; locally `cd frontend && npm run build`).
|
||||
|
||||
Backend validates JWT signature against the tenant's JWKS, checks `aud == AZURE_CLIENT_ID` and `iss` matches one of the tenant URLs. With bypass=true, every request is logged as the `DEV_AUTH_EMAIL` user.
|
||||
Backend validation: fetches the tenant's JWKS, verifies the RS256 signature, checks `aud == AZURE_CLIENT_ID` and `iss` matches one of the tenant issuer URLs. Admin gating: the email claim (`preferred_username` / `upn` / `email`) must match an entry in `ADMIN_EMAILS` (case-insensitive).
|
||||
|
||||
### Stop / tear down
|
||||
---
|
||||
|
||||
```bash
|
||||
docker compose down # stops containers, keeps the DB volume
|
||||
docker compose down -v # also deletes the DB volume (destroys data)
|
||||
```
|
||||
## Server deployment — optical-dev.oliver.solutions
|
||||
|
||||
### Notes
|
||||
|
||||
- Postgres failures never stop the tagger — `db.log_event` swallows errors. Box is the source of truth for "already tagged".
|
||||
- The `marriott-tagger.service` / `.timer` files below remain for the older systemd deployment path; the Docker path is the recommended one. Don't run both on the same host.
|
||||
|
||||
## Server Deployment (Docker — optical-dev.oliver.solutions)
|
||||
|
||||
This is the recommended path on the shared `optical-dev.oliver.solutions` dev server. Apps live under `/opt/<slug>/` with an idempotent `deploy/deploy.sh`. Mirrors the OSOP / adeo split-build pattern: backend in Docker, SPA built and served by Apache from `/var/www/html/marriott-tagging/`.
|
||||
Mirrors the OSOP / adeo split-build pattern: backend in Docker, SPA built and served by Apache.
|
||||
|
||||
**Public URL:** `https://optical-dev.oliver.solutions/marriott-tagging/`
|
||||
|
||||
### First-time setup
|
||||
|
||||
```bash
|
||||
# 1. Clone into /opt
|
||||
sudo git clone git@bitbucket.org:zlalani/marriott-box-image-video-tagging.git \
|
||||
/opt/marriott-box-image-video-tagging
|
||||
sudo chown -R "$USER:$USER" /opt/marriott-box-image-video-tagging
|
||||
cd /opt/marriott-box-image-video-tagging
|
||||
|
||||
# 2. Drop credentials in place (NOT in git)
|
||||
cp .env.example .env
|
||||
$EDITOR .env # GEMINI_API_KEY, POSTGRES_PASSWORD,
|
||||
# Azure IDs (or DEV_AUTH_BYPASS=true)
|
||||
$EDITOR box_config.json # paste the Box JWT config
|
||||
$EDITOR .env # fill required values
|
||||
$EDITOR box_config.json # paste Box JWT config
|
||||
|
||||
# 3. Deploy
|
||||
./deploy/deploy.sh
|
||||
```
|
||||
|
||||
The script will:
|
||||
- Sanity-check `.env`, `box_config.json`, docker, git, compose v2.
|
||||
- Pick free host ports — Postgres (default 5435, range 5435-5499) and API (default 8004, range 8003-8099) — persisted to `.env`.
|
||||
- Render `deploy/apache-marriott-tagging.conf` from `.tmpl` with the picked API port.
|
||||
- `git pull --ff-only`, `docker compose build`, `docker compose up -d` (db + tagger + api).
|
||||
- Build the Vite SPA in a one-shot `node:20` container; rsync `frontend/dist/` to `/var/www/html/marriott-tagging/`.
|
||||
- Poll `/api/health` until ready and verify the tagger container is running.
|
||||
- Print the Apache `Include` line you need to add to the shared vhost.
|
||||
`deploy.sh` will:
|
||||
1. Sanity-check `.env`, `box_config.json`, docker, git, compose v2.
|
||||
2. Auto-pick free host ports (`POSTGRES_HOST_PORT` 5435-5499, `MARRIOTT_API_PORT` 8003-8099), persisting choices back to `.env`.
|
||||
3. Render `deploy/apache-marriott-tagging.conf` from the `.tmpl` with the picked api port.
|
||||
4. `git pull --ff-only`, `docker compose build`, `docker compose up -d`.
|
||||
5. Build the Vite SPA in a one-shot `node:20-alpine` container (with `VITE_PUBLIC_BASE=https://optical-dev.oliver.solutions/marriott-tagging`), rsync `frontend/dist/` to `/var/www/html/marriott-tagging/`.
|
||||
6. Poll `/api/health` until ready; verify the api container is running.
|
||||
7. Print the Apache `Include` line to add to the shared vhost.
|
||||
|
||||
### One-time vhost step (manual)
|
||||
|
||||
Add **inside** `</VirtualHost>` of `/etc/apache2/sites-enabled/optical-dev.oliver.solutions.conf`:
|
||||
|
||||
**One-time vhost step (manual):**
|
||||
Edit `/etc/apache2/sites-enabled/optical-dev.oliver.solutions.conf` and add **inside** `</VirtualHost>`:
|
||||
```apache
|
||||
Include /opt/marriott-box-image-video-tagging/deploy/apache-marriott-tagging.conf
|
||||
```
|
||||
|
||||
Then:
|
||||
|
||||
```bash
|
||||
sudo apachectl configtest && sudo systemctl reload apache2
|
||||
```
|
||||
|
||||
The deploy script intentionally does NOT touch the shared vhost — it's shared across many apps, and a per-app script editing it risks breaking others.
|
||||
|
||||
### Re-deploying
|
||||
|
||||
```bash
|
||||
|
|
@ -192,143 +367,168 @@ cd /opt/marriott-box-image-video-tagging
|
|||
```
|
||||
|
||||
Flags:
|
||||
- `--no-pull` skip `git pull`
|
||||
- `--no-build` skip `docker compose build`
|
||||
- `--no-frontend` skip Vite build + SPA sync
|
||||
- `--run-now` also fire a tagging pass via `/api/runs` (works with DEV_AUTH_BYPASS=true)
|
||||
- `--logs` tail scheduler logs after deploy
|
||||
|
||||
### Verifying it ran
|
||||
| Flag | Effect |
|
||||
|---|---|
|
||||
| `--no-pull` | Skip `git pull` (deploy whatever is in the working tree). |
|
||||
| `--no-build` | Skip `docker compose build` (faster when only env / config changed). |
|
||||
| `--no-frontend` | Skip Vite build + SPA sync. |
|
||||
| `--run-now` | Also POST `/api/runs` to fire a tagging pass immediately (only works with `DEV_AUTH_BYPASS=true`). |
|
||||
| `--logs` | Tail api logs after deploy. |
|
||||
|
||||
### Common follow-ups
|
||||
|
||||
- **Code changed but container kept the old image:** `docker compose up -d --build --force-recreate api`.
|
||||
- **SPA changed but you don't want to rebuild the Python image:** `./deploy/deploy.sh --no-build`.
|
||||
- **Schema added/changed:** the api lifespan handler runs `ensure_schema` on startup, so a recreated api container applies it. New tables / indexes / extensions land automatically.
|
||||
|
||||
---
|
||||
|
||||
## Database schema
|
||||
|
||||
### `tagging_events` (append-only log)
|
||||
|
||||
One row per file the tagger sent to Gemini OR mirrored from Box. Skipped-as-already-tagged files are not logged.
|
||||
|
||||
| Column | Type | Notes |
|
||||
|---|---|---|
|
||||
| `id` | bigserial PK | |
|
||||
| `run_id` | uuid NOT NULL | UUID per tagging/backfill pass — groups rows belonging to one run. |
|
||||
| `created_at` | timestamptz NOT NULL | |
|
||||
| `file_id`, `file_name`, `folder_path` | text | Box identifiers + display. |
|
||||
| `media_type` | text | `image` or `video`. |
|
||||
| `gemini_model` | text | E.g. `gemini-2.5-flash`. |
|
||||
| `prompt` | text | Full prompt sent to Gemini (null for backfilled rows). |
|
||||
| `raw_response` | jsonb | Untouched Gemini response (null for backfilled rows). |
|
||||
| `description` | text | Description written to Box. |
|
||||
| `scenes` | jsonb | Video scene breakdown. |
|
||||
| `validated_metadata` | jsonb | Cleaned dict actually written to Box. |
|
||||
| `metadata_write_success`, `description_write_success`, `scene_comment_write_success` | boolean | Per Box write. |
|
||||
| `status` | text | `success`, `backfilled`, `gemini_error`, `validation_error`, `metadata_write_error`. |
|
||||
| `error_message` | text | Free-form error if `status` is an error. |
|
||||
| `duration_ms` | int | Gemini-call elapsed time (null for backfilled rows). |
|
||||
|
||||
Indexes: `run_id`, `file_id`, `created_at DESC`.
|
||||
|
||||
### `file_assets` (per-file state)
|
||||
|
||||
One row per Box file_id, upserted by both the tagging pass and backfill.
|
||||
|
||||
| Column | Type | Notes |
|
||||
|---|---|---|
|
||||
| `file_id` | text PK | Matches `tagging_events.file_id`. |
|
||||
| `thumbnail_bytes` | bytea | Box's 160×160 JPG. ~10-20 KB. |
|
||||
| `thumbnail_content_type` | text | E.g. `image/jpeg`. |
|
||||
| `thumbnail_size` | int | 160 today. |
|
||||
| `search_terms` | text | Lowercased, whitespace-normalised text blob: file_name + folder + description + metadata values. |
|
||||
| `updated_at` | timestamptz | |
|
||||
|
||||
Index: `updated_at DESC`. Extension: `pg_trgm` (for fuzzy search via `similarity()`).
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Blank page at the deployed URL
|
||||
|
||||
Asset paths baked with the wrong base. View-source the page; if the `<script>` tag reads `src="/assets/..."` instead of `src="/marriott-tagging/assets/..."`, your `VITE_PUBLIC_BASE` was misset at build time. `deploy.sh` now overrides this with the prod URL automatically — `git pull && ./deploy/deploy.sh --no-build` rebuilds the SPA.
|
||||
|
||||
### 404 on a new API endpoint
|
||||
|
||||
The api container is running an old image. Force a recreate:
|
||||
|
||||
```bash
|
||||
# Scheduler logs (next cron-fired pass is at SCHEDULE_CRON; default 02:00 UTC)
|
||||
docker compose logs -f tagger
|
||||
|
||||
# API logs
|
||||
docker compose logs -f api
|
||||
|
||||
# Postgres request log
|
||||
docker compose exec db psql -U marriott marriott_tagging -c \
|
||||
"SELECT status, count(*) FROM tagging_events GROUP BY status;"
|
||||
docker compose up -d --build --force-recreate api
|
||||
```
|
||||
|
||||
Postgres is bound to `127.0.0.1` only — not reachable from outside the server. To inspect from your laptop, tunnel: `ssh -L 55432:127.0.0.1:<POSTGRES_HOST_PORT> user@optical-dev.oliver.solutions`, then `psql postgresql://marriott:***@127.0.0.1:55432/marriott_tagging`.
|
||||
### 500 on search
|
||||
|
||||
### Notes
|
||||
Usually `pg_trgm` extension missing. The api lifespan handler installs it on startup, but a stale running container might not have re-applied schema:
|
||||
|
||||
- The Docker deploy and the `systemd` deploy below target the same `/opt/marriott-box-image-video-tagging/` directory. Pick one on any given server — don't run both, they'll both fire the tagger and double-write to Box.
|
||||
- The SPA build bakes `VITE_AZURE_*` and `VITE_DEV_AUTH_BYPASS` into the bundle. Flipping the bypass requires a re-build (`./deploy/deploy.sh` does this).
|
||||
```bash
|
||||
docker compose exec db psql -U marriott marriott_tagging \
|
||||
-c "CREATE EXTENSION IF NOT EXISTS pg_trgm;"
|
||||
```
|
||||
|
||||
## Server Deployment (systemd, Ubuntu)
|
||||
Or just `docker compose up -d --force-recreate api`.
|
||||
|
||||
The repo includes `marriott-tagger.service` and `marriott-tagger.timer` for running the tagger as a scheduled service. These steps are written for **Ubuntu 22.04 / 24.04** but should work on any systemd-based distribution with minor path tweaks (e.g. `/sbin/nologin` instead of `/usr/sbin/nologin` on Red Hat-family).
|
||||
### "Run now" did nothing visible
|
||||
|
||||
### 0. Prerequisites
|
||||
Probably the background thread crashed during init. Check api logs:
|
||||
|
||||
```bash
|
||||
docker compose logs api --tail 60
|
||||
```
|
||||
|
||||
Common causes:
|
||||
- `box_config.json` not mounted into the api container — confirm with `docker compose exec api ls -la /app/box_config.json`. The compose file bind-mounts `./box_config.json`; if it didn't exist when compose came up, no mount.
|
||||
- `GEMINI_API_KEY` empty in the api container — `docker compose exec api printenv GEMINI_API_KEY`.
|
||||
- Every file already has metadata in Box / the DB — the pass completes silently with `0 tagged`.
|
||||
|
||||
### Postgres host-port conflict
|
||||
|
||||
`deploy.sh` scans 5435-5499. If your laptop already has a Postgres listening on those, bump the upper bound in `deploy.sh` or set `POSTGRES_HOST_PORT` manually in `.env`.
|
||||
|
||||
---
|
||||
|
||||
## Legacy: systemd deployment (Ubuntu)
|
||||
|
||||
The `marriott-tagger.service` / `.timer` unit files are kept in the repo for a pre-Docker deployment path that runs `main.py` directly via a systemd timer. **Don't run this alongside the Docker deploy on the same host — both will fire passes and double-write to Box.**
|
||||
|
||||
### Setup
|
||||
|
||||
```bash
|
||||
sudo apt update
|
||||
sudo apt install -y git python3 python3-venv python3-pip
|
||||
```
|
||||
|
||||
`python3-venv` is a separate apt package on Ubuntu — `python3 -m venv` will fail without it.
|
||||
|
||||
### 1. Clone the repo on the server
|
||||
|
||||
```bash
|
||||
sudo mkdir -p /opt/marriott-box-image-video-tagging
|
||||
sudo chown $USER:$USER /opt/marriott-box-image-video-tagging
|
||||
git clone git@bitbucket.org:zlalani/marriott-box-image-video-tagging.git /opt/marriott-box-image-video-tagging
|
||||
cd /opt/marriott-box-image-video-tagging
|
||||
```
|
||||
|
||||
### 2. Create the service user
|
||||
|
||||
```bash
|
||||
sudo useradd --system --shell /usr/sbin/nologin --home-dir /opt/marriott-box-image-video-tagging marriott-tagger
|
||||
sudo chown -R marriott-tagger:marriott-tagger /opt/marriott-box-image-video-tagging
|
||||
```
|
||||
|
||||
### 3. Drop credentials in place (NOT in git)
|
||||
|
||||
```bash
|
||||
# Drop credentials
|
||||
sudo -u marriott-tagger tee /opt/marriott-box-image-video-tagging/box_config.json > /dev/null < /path/to/local/box_config.json
|
||||
sudo -u marriott-tagger tee /opt/marriott-box-image-video-tagging/.env > /dev/null <<'EOF'
|
||||
GEMINI_API_KEY=your_key_here
|
||||
EOF
|
||||
sudo chmod 600 /opt/marriott-box-image-video-tagging/box_config.json /opt/marriott-box-image-video-tagging/.env
|
||||
```
|
||||
|
||||
### 4. Set up the virtualenv
|
||||
|
||||
```bash
|
||||
# venv
|
||||
sudo -u marriott-tagger python3 -m venv /opt/marriott-box-image-video-tagging/env
|
||||
sudo -u marriott-tagger /opt/marriott-box-image-video-tagging/env/bin/pip install -r /opt/marriott-box-image-video-tagging/requirements.txt
|
||||
```
|
||||
|
||||
### 5. Install the systemd unit files
|
||||
|
||||
```bash
|
||||
sudo cp /opt/marriott-box-image-video-tagging/marriott-tagger.service /etc/systemd/system/
|
||||
sudo cp /opt/marriott-box-image-video-tagging/marriott-tagger.timer /etc/systemd/system/
|
||||
# Install + enable
|
||||
sudo cp marriott-tagger.service marriott-tagger.timer /etc/systemd/system/
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl enable --now marriott-tagger.timer
|
||||
```
|
||||
|
||||
### 6. Verify
|
||||
In this mode there's no Postgres, no SPA, no api — just `main.py` running on a cron. Tagging-events logging requires `DATABASE_URL` to be set in `.env`; otherwise `db.log_event` no-ops gracefully and you lose the audit log.
|
||||
|
||||
```bash
|
||||
# Show the next scheduled run
|
||||
systemctl list-timers marriott-tagger.timer
|
||||
---
|
||||
|
||||
# Trigger a one-off run immediately (timer will still run on schedule)
|
||||
sudo systemctl start marriott-tagger.service
|
||||
## How the tagging pipeline works
|
||||
|
||||
# Tail the logs (live)
|
||||
sudo journalctl -u marriott-tagger -f
|
||||
- **Dynamic prompt**: Gemini's prompt is built at runtime from the live Box template definition (`fetch_template_schema`). Field additions / option changes propagate automatically.
|
||||
- **Metadata + description**: Each file gets structured metadata (filterable in Box search) and a short description (visible in Box list views, also indexed by Box search).
|
||||
- **Search-keyword tail**: Descriptions are formatted as `<summary>. <comma-separated keywords>.` — the tail covers synonyms / broader terms (food/dining/eating/meal/restaurant) so a Box search for "Food" still hits assets tagged with enum value `Dining`.
|
||||
- **Video scene breakdown**: Videos additionally get a timestamped scene breakdown written as a comment on the Box file — high-level chapter map for finding moments inside long videos.
|
||||
- **DB-based skip**: Once a file has a `success` or `backfilled` row, the next pass skips it locally (no Box call, no Gemini call). Run **Backfill from Box** once to mirror Box's existing metadata into the local DB before relying on this.
|
||||
- **Validation**: Gemini output is validated against the template schema — invalid enum values are dropped, multi-select arrays are filtered to allowed options only.
|
||||
- **Large-video gating**: Videos exceeding the source or proxy size limits are skipped cleanly rather than wasting time / API budget. Skips are reported as `skipped`, not `errored`.
|
||||
- **Per-run limiter**: A run will tag at most `MAX_FILES_PER_RUN` files in `MAX_RUN_DURATION` wall-clock seconds. Whichever cap hits first, the run exits cleanly with a summary; the next run picks up the remaining untagged files. This keeps a sudden 1000-file upload from blowing through your Gemini budget in one click.
|
||||
- **Thumbnail cache**: After a successful tag (or as part of backfill), the file's 160×160 JPG is fetched from Box and stored in `file_assets.thumbnail_bytes`. The SPA renders it inline in search results; `Cache-Control: private, max-age=86400` means the browser caches it for a day.
|
||||
|
||||
# Inspect the most recent run's full output
|
||||
sudo journalctl -u marriott-tagger --since "1 day ago"
|
||||
```
|
||||
---
|
||||
|
||||
### Updating the service
|
||||
## Credentials & files NOT in git
|
||||
|
||||
```bash
|
||||
cd /opt/marriott-box-image-video-tagging
|
||||
sudo -u marriott-tagger git pull
|
||||
# If unit files changed:
|
||||
sudo cp marriott-tagger.service marriott-tagger.timer /etc/systemd/system/
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl restart marriott-tagger.timer
|
||||
```
|
||||
- `box_config.json` — Box JWT config. Bind-mounted read-only into the api container.
|
||||
- `.env` — All env vars including `GEMINI_API_KEY`, `POSTGRES_PASSWORD`, `AZURE_CLIENT_ID`, etc.
|
||||
- `deploy/apache-marriott-tagging.conf` — generated by `deploy.sh` from the `.tmpl`.
|
||||
- `frontend/node_modules/`, `frontend/dist/` — npm install / Vite build artefacts.
|
||||
|
||||
## Configuration
|
||||
|
||||
Edit the constants at the top of `main.py`:
|
||||
|
||||
| Setting | Default | Description |
|
||||
|---------|---------|-------------|
|
||||
| `BOX_FOLDER_ID` | (varies) | Box folder to process |
|
||||
| `METADATA_TEMPLATE_KEY` | `marriottUsa` | Box metadata template key |
|
||||
| `GEMINI_MODEL` | `gemini-2.5-flash` | Gemini model for analysis |
|
||||
| `EXCLUDED_FOLDER_PREFIXES` | `("z_", "zz_", "zzz_")` | Subfolder name prefixes to skip |
|
||||
| `GEMINI_DELAY` | `7` | Seconds between Gemini image calls |
|
||||
| `GEMINI_VIDEO_DELAY` | `10` | Seconds between Gemini video calls |
|
||||
| `SKIP_ALREADY_TAGGED` | `True` | Skip files with existing metadata |
|
||||
| `MAX_IMAGE_SIZE` | `1000` | Max pixel dimension for image resize |
|
||||
| `VIDEO_SIZE_LIMIT_INLINE` | `20 MB` | Below this, send video inline; above, use Gemini File API |
|
||||
| `VIDEO_SOURCE_SIZE_LIMIT` | `5 GB` | Skip videos whose source file exceeds this |
|
||||
| `VIDEO_PROXY_SIZE_LIMIT` | `400 MB` | Skip videos whose 480p proxy exceeds this (~60 min runtime) |
|
||||
| `MAX_FILES_PER_RUN` | `200` | Hard cap on newly-tagged files per run |
|
||||
| `MAX_RUN_DURATION` | `4h` | Hard wall-clock cap per run |
|
||||
| `DESCRIPTION_MAX_LENGTH` | `255` | Box description field char limit |
|
||||
|
||||
## How It Works
|
||||
|
||||
- **Dynamic prompt**: The Gemini prompt is built at runtime from the actual Box template definition. If Marriott adds/changes fields or options in Box Admin, the script adapts automatically.
|
||||
- **Metadata + description**: Each file gets structured metadata (for filtered search) and a short description (visible in Box list views, also indexed by Box search).
|
||||
- **Search-keyword tail**: Each description is formatted as `<summary sentence>. <comma-separated keywords>.` — the keyword tail covers synonyms and broader category terms (e.g. food/dining/eating/meal/restaurant) so a search for "Food" hits assets tagged with the enum value `Dining`, etc.
|
||||
- **Video scene breakdown**: Videos additionally get a timestamped scene breakdown written as a comment on the Box file — a high-level chapter map for finding moments inside long videos.
|
||||
- **Resumable**: Files with existing metadata are skipped by default, so the script can be re-run after interruptions or when new files are added.
|
||||
- **Validation**: Gemini output is validated against the template schema — invalid enum values are dropped, multiSelect arrays are filtered to allowed options only.
|
||||
- **Large-video gating**: Videos exceeding the source or proxy size limits are skipped cleanly rather than wasting time / API budget on content beyond Gemini's context window. Skips are reported in the summary as `skipped`, not `errored`.
|
||||
- **Per-run limiter**: A daily run will tag at most `MAX_FILES_PER_RUN` newly-tagged files in `MAX_RUN_DURATION` of wall clock. Whichever cap hits first, the run exits cleanly with a summary line; the next scheduled run picks up the remaining untagged files. This keeps a sudden 1000-file upload from blowing through your Gemini budget in one night.
|
||||
`.env.example` is checked in; copy it to `.env` and fill in.
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue