Commit graph

14 commits

Author SHA1 Message Date
DJP
c653fe42a6 Auto-refresh Azure access tokens on 401
The SPA's MSAL access token has a 1h lifetime. When the tab idles
past it, the first request after returns a cached-but-expired token,
the backend (correctly) 401s with "Signature has expired", and the
user has to hard-refresh. acquireTokenSilent doesn't always
pre-empt this because its expiry check can pass on the cached entry
that's then expired by the time the backend validates it.

Make the client recover: getToken now accepts { forceRefresh }, and
the api client retries any 401 once with a forced-refresh token. If
the retry also 401s we propagate (means MSAL itself can't refresh —
genuinely signed out — and the user is routed back to the gate on
the next action).

No backend change: the JWT expiry check is correct. Bypass mode is
unaffected (token is "" either way; the retry is a no-op for it).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 18:20:46 -04:00
DJP
30ac050af9 Env-tunable per-run caps (MAX_FILES_PER_RUN, MAX_RUN_DURATION_SECS)
The two per-run limiters in main.py now read from the environment with
their current hardcoded values as defaults. Lets us tune cadence (e.g.
200 → 500 newly-tagged files per click) without rebuilding the image —
edit .env and `docker compose up -d --force-recreate api`.

docker-compose.yml threads both vars into the api container.
.env.example documents them with empty defaults.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 14:10:22 -04:00
DJP
6ac845fe34 Support multiple Box root folders via BOX_FOLDER_IDS
main.py now accepts a comma-separated list of Box folder IDs from
BOX_FOLDER_IDS (env). Each root is walked recursively; files surfaced
under more than one root are deduped by file_id in list_all_media so
a tagging pass / backfill processes each Box file once even when
roots overlap. Falls back to the original single hardcoded folder
when the env var is unset, so existing deployments behave the same.

docker-compose.yml threads BOX_FOLDER_IDS into the api container.
.env.example documents the new var.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 11:41:46 -04:00
DJP
2b2d0cc2f8 Gate the SPA behind a login screen when MSAL is enabled
Previously the auth provider rendered the app shell unconditionally
and fired loginPopup() from a useEffect once MSAL was ready. If the
popup got blocked or dismissed, the user saw the full app even
without an authenticated account — API calls then 401'd silently
or returned 403s on the destructive endpoints. Now `RealAuthProvider`
returns a dedicated login gate when `!authed`, with an explicit
"Sign in with Microsoft" button (no auto-popup). The app shell
only mounts after a successful sign-in.

Bypass mode is unaffected — `AuthProvider` still short-circuits to
the dev user without going through the gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 14:10:54 -04:00
DJP
88a0adcfbf Rewrite README to reflect current architecture
Old README still described the nightly scheduler container, didn't
cover backfill / thumbnails / admin gating / multi-token search /
the API endpoints, and pointed at fields that no longer exist on
events. Comprehensive rewrite covering: what the app does today,
architecture diagram, repo layout, local quickstart, full env-var
reference, operations (run/backfill/inspect), API surface, MSAL
setup steps, deploy script + manual vhost Include, the two-table
schema, troubleshooting, and the legacy systemd path preserved at
the end for reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 17:30:10 -04:00
DJP
04440d661d Cache Box thumbnails + search blob; render in UI
Search results were text-only — hard to scan visually with thousands
of assets coming. Now every file Gemini-tags or backfill mirrors also
gets its Box-generated 160x160 JPG thumbnail (~10-20 KB) pulled and
stashed in Postgres, plus a consolidated `search_terms` blob
(file_name + folder + description + flattened metadata values).
Search results render the thumbnail inline; rows missing one show a
striped placeholder. Search SQL now LEFT JOINs file_assets and hits
search_terms too, so backfilled rows are properly searchable.

- schema.sql: new `file_assets` table (file_id PK, thumbnail_bytes
  bytea, search_terms text, updated_at). idempotent.
- db.py: `upsert_file_asset` (INSERT … ON CONFLICT preserving
  existing thumbnail bytes if today's fetch failed) and
  `get_thumbnail`. Both swallow exceptions per the established
  defensive pattern.
- main.py: `fetch_thumbnail` (Box SDK get_file_thumbnail_by_id, JPG
  at 160 px, handles BoxAPIError 202/404 as soft misses) and
  `build_search_terms` (lowercase, whitespace-collapsed text blob).
  `_persist_file_asset` wires both into the image+video success
  paths of `_run_pass` and into every iteration of `_run_backfill`.
- Backfill skip logic refined: always upsert file_assets (idempotent
  PK), only skip the tagging_events insert if a good row already
  exists. Re-running Backfill from Box populates thumbnails for
  rows backfilled before this feature shipped.
- api.py: `GET /api/files/{file_id}/thumbnail` streams the bytea
  with Cache-Control max-age=86400. Search SQL gains the LEFT JOIN
  and emits `has_thumbnail` per row. Search also matches against
  fa.search_terms so backfilled rows surface for free-text queries
  that hit their metadata.
- frontend: Event type adds `has_thumbnail`; `thumbnailUrl(fileId)`
  helper builds the prefix-aware URL via Vite's BASE. EventList
  renders the thumbnail (lazy, with onError fallback) or a striped
  placeholder. .thumb styling + .event-head layout in styles.css.

Verified locally: schema applies via lifespan; upsert + get_thumbnail
roundtrip; /api/files/999/thumbnail returns 200 with bytes; /api/events
returns has_thumbnail per row; multi-token "female city" search finds
a row whose validated_metadata contains both tokens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 16:20:13 -04:00
DJP
6f367d5b29 Run ensure_schema on api startup
Search hits the DB without going through the run-thread path, so on a
fresh DB the pg_trgm extension added by schema.sql was never created
and /api/events 500'd on similarity(). Move the bootstrap to a FastAPI
lifespan handler so schema (table + indexes + extensions) is applied
the first time the api container comes up, regardless of whether any
run/backfill has fired.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 16:00:17 -04:00
DJP
1f2c2ff8e1 Multi-token + fuzzy search; admin-only Run Now / Backfill
Search:
- Previously /api/events did one ILIKE %q% across the columns, so
  "female city" required the literal substring "female city" to
  appear somewhere. Now the query is tokenised on whitespace; every
  token must match somewhere (AND), and each token matches either
  by substring (ILIKE) across the searched columns OR by trigram
  similarity (pg_trgm) against a concatenated text blob with a 0.3
  threshold — handles typos like "femalle" → "female".
- Results ranked by summed similarity score across all tokens, then
  recency. Empty query falls back to "newest 100".
- schema.sql: CREATE EXTENSION IF NOT EXISTS pg_trgm (idempotent;
  applied by ensure_schema on api startup).

Admin gating:
- auth.py: User now carries `is_admin`. Computed from a
  comma-separated ADMIN_EMAILS env var (case-insensitive match
  against `preferred_username`/`upn`/`email` claim). New
  `require_admin` FastAPI dependency 403s non-admins.
- In DEV_AUTH_BYPASS mode the dev user is admin by default; flip
  DEV_AUTH_IS_ADMIN=false to test the read-only UX without enabling
  SSO.
- POST /api/runs and POST /api/backfill now gated by require_admin.
- /api/me carries is_admin so the SPA can hide the destructive
  buttons for non-admins.

Frontend:
- App.tsx fetches /api/me on mount and hides Run Now + Backfill
  unless `is_admin` is true. Non-admins still see search + results +
  recent-runs table.

docker-compose / .env.example: thread ADMIN_EMAILS +
DEV_AUTH_IS_ADMIN into the api container.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 15:51:50 -04:00
DJP
9e6a75feb6 Manual-only runs, DB-based skip check, backfill-from-Box
Previously a nightly APScheduler container fired the tagger on every
file in the configured Box folder. With ~5000 files coming, that's
~5000 Box HTTP calls every night just to ask "is this tagged?". Move
to manual-only mode and source the skip decision from the local DB.

- `db.is_file_already_tagged(conn, file_id)` — returns True iff the
  DB has a row with status IN ('success','backfilled'). Used by both
  image and video loops in main.py instead of the previous
  `check_existing_metadata(box_client, file_id)` Box round-trip.
- `fetch_existing_metadata(box_client, file_id)` (main.py) — returns
  the user-defined template fields as a flat dict by stripping the
  Box `$id`/`$type`/etc. attrs from the SDK response.
- `_run_backfill(run_id, db_conn)` (main.py) — walks the Box folder
  and inserts a `status='backfilled'` row for every file Box already
  has marriottUsa metadata for. Read-only against Box; safe to re-run.
  Use this after first deploy, or to repopulate the DB from Box.
- `POST /api/backfill` mirrors `POST /api/runs` (background thread,
  same live-state record).
- SPA: new "Backfill from Box" button next to "Run now" (with a
  confirm dialog and a yellow `.status-backfilled` event treatment).
- docker-compose.yml: removed the `tagger` (scheduler) service.
  Manual triggers via the SPA / `POST /api/runs` only. scheduler.py
  stays in the repo for archival / opt-back-in.
- deploy.sh: readiness now checks the `api` container instead of
  `tagger`; `--logs` tails api logs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 15:41:10 -04:00
DJP
dafd097d24 Force prod URL as VITE_PUBLIC_BASE for server builds
VITE_PUBLIC_BASE in .env is for local `npm run dev`; honoring it at
build time on the server baked `base: "/"` into the bundle and asset
URLs came out as `/assets/...` instead of `/marriott-tagging/assets/...`,
so the script tag 404'd and the SPA rendered blank. Deploy.sh now
always builds with the prod URL.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 15:27:41 -04:00
DJP
99e978b895 Dockerize, add Postgres request log, FastAPI + React SPA
Run model: long-running scheduler container (APScheduler) replacing the
systemd timer in Docker deployments. Every Gemini-analysed file is also
persisted to a Postgres `tagging_events` table (run_id, prompt, raw
response, validated metadata, Box-write outcomes, status, error, timing)
for search and audit. Box is still updated exactly as before and remains
the source of truth for "already tagged" — `db.log_event` swallows DB
failures so an outage can't stop a tagging pass.

Backend:
- `db.py` + `schema.sql` — append-only `tagging_events` with indexes on
  run_id, file_id, created_at.
- `scheduler.py` — APScheduler BlockingScheduler with `SCHEDULE_CRON`
  (default daily 02:00), `RUN_AT_STARTUP`, SIGTERM handling.
- `api.py` (FastAPI) — `/api/health`, `/api/me`, `/api/events?q=…`
  (single-input search across file_name, folder_path, description,
  status, file_id, validated_metadata::text, raw_response::text,
  scenes::text), `POST /api/runs` (fire-and-forget pass in a background
  thread), `/api/runs`, `/api/runs/{id}/events`. Every event response
  carries a synthesised `box_url`.
- `auth.py` — Azure AD bearer-token validation against the tenant JWKS
  (signature + aud + iss). `DEV_AUTH_BYPASS=true` short-circuits to a
  configurable dev user, mirrored on the frontend by
  `VITE_DEV_AUTH_BYPASS`.

Frontend (Vite + React + TS):
- `frontend/` SPA, Montserrat + black/white/#FFC407 palette.
- @azure/msal-react with the bypass switch (auto-signin when bypass off).
- Search bar across all logged fields, results list with metadata tags,
  status pills, and "Open in Box ↗" links.
- "Run now" button kicks off a tagging pass via `POST /api/runs` and
  polls `/api/runs/{id}/events` every 2 s for live progress.

Docker / compose:
- `docker-compose.yml` pins `name: marriott-tagging`. Three services:
  `db` (postgres:16, named volume, bound to 127.0.0.1 only), `tagger`
  (scheduler.py), `api` (uvicorn). Same image, different `command`.
- `Dockerfile` — python:3.12-slim, non-root user.

Deploy (optical-dev.oliver.solutions):
- `deploy/deploy.sh` — idempotent. Auto-picks free host ports
  (POSTGRES_HOST_PORT 5435-5499, MARRIOTT_API_PORT 8003-8099), renders
  `apache-marriott-tagging.conf` from the .tmpl, builds the SPA in a
  one-shot node:20-alpine container, rsyncs `dist/` to
  `/var/www/html/marriott-tagging/`, polls `/api/health`, and prints the
  shared-vhost Include line.
- `apache-marriott-tagging.conf.tmpl` — proxy `/marriott-tagging/api/`
  to the API container, alias `/marriott-tagging` to the SPA web-root,
  SPA fallback to `index.html`.

systemd unit files left in place for the existing Ubuntu deployment
path; do not run both on the same host (would double-fire the tagger).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 14:56:58 -04:00
Simeon.Schecter
010a3955a8 Document Ubuntu systemd deployment and current configuration
- Add full Server Deployment section with prereqs, user setup, credential
  placement, venv setup, unit installation, verification, and update flow
- Tailored for Ubuntu 22.04/24.04 (notes python3-venv apt package gotcha)
- True up Configuration table with current constants (video size limits,
  per-run cap, video delay, excluded folder prefixes)
- Update How It Works to cover keyword-tail descriptions, scene-breakdown
  comments, large-video gating, and the per-run limiter
- Mention videos in the intro (was previously images-only)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-06 14:35:58 -04:00
Simeon.Schecter
a04e8c1e37 Add asset tagger pipeline with keyword-tail descriptions and large-video gating
- Box JWT + Gemini integration for image and video metadata tagging
- Description format includes search-keyword tail to address synonym gaps
  (e.g. "Food" search now hits assets tagged "Dining")
- Skip videos exceeding 5GB source or 400MB proxy (~60min runtime, beyond
  Gemini context budget) — counted as skipped, not errored
- Hardened None-response handling in Gemini JSON parser
- Per-run limiter: 200 newly-tagged files / 4 hour wall-clock cap, with
  clean exit and resumable progress on next run
- systemd service + timer for daily 2am tagging passes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-05-06 14:09:28 -04:00
simeonschecter@oliver.agency
9a837a33b9 Initial commit 2026-05-06 17:44:39 +00:00