The SPA's MSAL access token has a 1h lifetime. When the tab idles
past it, the first request after returns a cached-but-expired token,
the backend (correctly) 401s with "Signature has expired", and the
user has to hard-refresh. acquireTokenSilent doesn't always
pre-empt this because its expiry check can pass on the cached entry
that's then expired by the time the backend validates it.
Make the client recover: getToken now accepts { forceRefresh }, and
the api client retries any 401 once with a forced-refresh token. If
the retry also 401s we propagate (means MSAL itself can't refresh —
genuinely signed out — and the user is routed back to the gate on
the next action).
No backend change: the JWT expiry check is correct. Bypass mode is
unaffected (token is "" either way; the retry is a no-op for it).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two per-run limiters in main.py now read from the environment with
their current hardcoded values as defaults. Lets us tune cadence (e.g.
200 → 500 newly-tagged files per click) without rebuilding the image —
edit .env and `docker compose up -d --force-recreate api`.
docker-compose.yml threads both vars into the api container.
.env.example documents them with empty defaults.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
main.py now accepts a comma-separated list of Box folder IDs from
BOX_FOLDER_IDS (env). Each root is walked recursively; files surfaced
under more than one root are deduped by file_id in list_all_media so
a tagging pass / backfill processes each Box file once even when
roots overlap. Falls back to the original single hardcoded folder
when the env var is unset, so existing deployments behave the same.
docker-compose.yml threads BOX_FOLDER_IDS into the api container.
.env.example documents the new var.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the auth provider rendered the app shell unconditionally
and fired loginPopup() from a useEffect once MSAL was ready. If the
popup got blocked or dismissed, the user saw the full app even
without an authenticated account — API calls then 401'd silently
or returned 403s on the destructive endpoints. Now `RealAuthProvider`
returns a dedicated login gate when `!authed`, with an explicit
"Sign in with Microsoft" button (no auto-popup). The app shell
only mounts after a successful sign-in.
Bypass mode is unaffected — `AuthProvider` still short-circuits to
the dev user without going through the gate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Old README still described the nightly scheduler container, didn't
cover backfill / thumbnails / admin gating / multi-token search /
the API endpoints, and pointed at fields that no longer exist on
events. Comprehensive rewrite covering: what the app does today,
architecture diagram, repo layout, local quickstart, full env-var
reference, operations (run/backfill/inspect), API surface, MSAL
setup steps, deploy script + manual vhost Include, the two-table
schema, troubleshooting, and the legacy systemd path preserved at
the end for reference.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Search results were text-only — hard to scan visually with thousands
of assets coming. Now every file Gemini-tags or backfill mirrors also
gets its Box-generated 160x160 JPG thumbnail (~10-20 KB) pulled and
stashed in Postgres, plus a consolidated `search_terms` blob
(file_name + folder + description + flattened metadata values).
Search results render the thumbnail inline; rows missing one show a
striped placeholder. Search SQL now LEFT JOINs file_assets and hits
search_terms too, so backfilled rows are properly searchable.
- schema.sql: new `file_assets` table (file_id PK, thumbnail_bytes
bytea, search_terms text, updated_at). idempotent.
- db.py: `upsert_file_asset` (INSERT … ON CONFLICT preserving
existing thumbnail bytes if today's fetch failed) and
`get_thumbnail`. Both swallow exceptions per the established
defensive pattern.
- main.py: `fetch_thumbnail` (Box SDK get_file_thumbnail_by_id, JPG
at 160 px, handles BoxAPIError 202/404 as soft misses) and
`build_search_terms` (lowercase, whitespace-collapsed text blob).
`_persist_file_asset` wires both into the image+video success
paths of `_run_pass` and into every iteration of `_run_backfill`.
- Backfill skip logic refined: always upsert file_assets (idempotent
PK), only skip the tagging_events insert if a good row already
exists. Re-running Backfill from Box populates thumbnails for
rows backfilled before this feature shipped.
- api.py: `GET /api/files/{file_id}/thumbnail` streams the bytea
with Cache-Control max-age=86400. Search SQL gains the LEFT JOIN
and emits `has_thumbnail` per row. Search also matches against
fa.search_terms so backfilled rows surface for free-text queries
that hit their metadata.
- frontend: Event type adds `has_thumbnail`; `thumbnailUrl(fileId)`
helper builds the prefix-aware URL via Vite's BASE. EventList
renders the thumbnail (lazy, with onError fallback) or a striped
placeholder. .thumb styling + .event-head layout in styles.css.
Verified locally: schema applies via lifespan; upsert + get_thumbnail
roundtrip; /api/files/999/thumbnail returns 200 with bytes; /api/events
returns has_thumbnail per row; multi-token "female city" search finds
a row whose validated_metadata contains both tokens.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Search hits the DB without going through the run-thread path, so on a
fresh DB the pg_trgm extension added by schema.sql was never created
and /api/events 500'd on similarity(). Move the bootstrap to a FastAPI
lifespan handler so schema (table + indexes + extensions) is applied
the first time the api container comes up, regardless of whether any
run/backfill has fired.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Search:
- Previously /api/events did one ILIKE %q% across the columns, so
"female city" required the literal substring "female city" to
appear somewhere. Now the query is tokenised on whitespace; every
token must match somewhere (AND), and each token matches either
by substring (ILIKE) across the searched columns OR by trigram
similarity (pg_trgm) against a concatenated text blob with a 0.3
threshold — handles typos like "femalle" → "female".
- Results ranked by summed similarity score across all tokens, then
recency. Empty query falls back to "newest 100".
- schema.sql: CREATE EXTENSION IF NOT EXISTS pg_trgm (idempotent;
applied by ensure_schema on api startup).
Admin gating:
- auth.py: User now carries `is_admin`. Computed from a
comma-separated ADMIN_EMAILS env var (case-insensitive match
against `preferred_username`/`upn`/`email` claim). New
`require_admin` FastAPI dependency 403s non-admins.
- In DEV_AUTH_BYPASS mode the dev user is admin by default; flip
DEV_AUTH_IS_ADMIN=false to test the read-only UX without enabling
SSO.
- POST /api/runs and POST /api/backfill now gated by require_admin.
- /api/me carries is_admin so the SPA can hide the destructive
buttons for non-admins.
Frontend:
- App.tsx fetches /api/me on mount and hides Run Now + Backfill
unless `is_admin` is true. Non-admins still see search + results +
recent-runs table.
docker-compose / .env.example: thread ADMIN_EMAILS +
DEV_AUTH_IS_ADMIN into the api container.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously a nightly APScheduler container fired the tagger on every
file in the configured Box folder. With ~5000 files coming, that's
~5000 Box HTTP calls every night just to ask "is this tagged?". Move
to manual-only mode and source the skip decision from the local DB.
- `db.is_file_already_tagged(conn, file_id)` — returns True iff the
DB has a row with status IN ('success','backfilled'). Used by both
image and video loops in main.py instead of the previous
`check_existing_metadata(box_client, file_id)` Box round-trip.
- `fetch_existing_metadata(box_client, file_id)` (main.py) — returns
the user-defined template fields as a flat dict by stripping the
Box `$id`/`$type`/etc. attrs from the SDK response.
- `_run_backfill(run_id, db_conn)` (main.py) — walks the Box folder
and inserts a `status='backfilled'` row for every file Box already
has marriottUsa metadata for. Read-only against Box; safe to re-run.
Use this after first deploy, or to repopulate the DB from Box.
- `POST /api/backfill` mirrors `POST /api/runs` (background thread,
same live-state record).
- SPA: new "Backfill from Box" button next to "Run now" (with a
confirm dialog and a yellow `.status-backfilled` event treatment).
- docker-compose.yml: removed the `tagger` (scheduler) service.
Manual triggers via the SPA / `POST /api/runs` only. scheduler.py
stays in the repo for archival / opt-back-in.
- deploy.sh: readiness now checks the `api` container instead of
`tagger`; `--logs` tails api logs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VITE_PUBLIC_BASE in .env is for local `npm run dev`; honoring it at
build time on the server baked `base: "/"` into the bundle and asset
URLs came out as `/assets/...` instead of `/marriott-tagging/assets/...`,
so the script tag 404'd and the SPA rendered blank. Deploy.sh now
always builds with the prod URL.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run model: long-running scheduler container (APScheduler) replacing the
systemd timer in Docker deployments. Every Gemini-analysed file is also
persisted to a Postgres `tagging_events` table (run_id, prompt, raw
response, validated metadata, Box-write outcomes, status, error, timing)
for search and audit. Box is still updated exactly as before and remains
the source of truth for "already tagged" — `db.log_event` swallows DB
failures so an outage can't stop a tagging pass.
Backend:
- `db.py` + `schema.sql` — append-only `tagging_events` with indexes on
run_id, file_id, created_at.
- `scheduler.py` — APScheduler BlockingScheduler with `SCHEDULE_CRON`
(default daily 02:00), `RUN_AT_STARTUP`, SIGTERM handling.
- `api.py` (FastAPI) — `/api/health`, `/api/me`, `/api/events?q=…`
(single-input search across file_name, folder_path, description,
status, file_id, validated_metadata::text, raw_response::text,
scenes::text), `POST /api/runs` (fire-and-forget pass in a background
thread), `/api/runs`, `/api/runs/{id}/events`. Every event response
carries a synthesised `box_url`.
- `auth.py` — Azure AD bearer-token validation against the tenant JWKS
(signature + aud + iss). `DEV_AUTH_BYPASS=true` short-circuits to a
configurable dev user, mirrored on the frontend by
`VITE_DEV_AUTH_BYPASS`.
Frontend (Vite + React + TS):
- `frontend/` SPA, Montserrat + black/white/#FFC407 palette.
- @azure/msal-react with the bypass switch (auto-signin when bypass off).
- Search bar across all logged fields, results list with metadata tags,
status pills, and "Open in Box ↗" links.
- "Run now" button kicks off a tagging pass via `POST /api/runs` and
polls `/api/runs/{id}/events` every 2 s for live progress.
Docker / compose:
- `docker-compose.yml` pins `name: marriott-tagging`. Three services:
`db` (postgres:16, named volume, bound to 127.0.0.1 only), `tagger`
(scheduler.py), `api` (uvicorn). Same image, different `command`.
- `Dockerfile` — python:3.12-slim, non-root user.
Deploy (optical-dev.oliver.solutions):
- `deploy/deploy.sh` — idempotent. Auto-picks free host ports
(POSTGRES_HOST_PORT 5435-5499, MARRIOTT_API_PORT 8003-8099), renders
`apache-marriott-tagging.conf` from the .tmpl, builds the SPA in a
one-shot node:20-alpine container, rsyncs `dist/` to
`/var/www/html/marriott-tagging/`, polls `/api/health`, and prints the
shared-vhost Include line.
- `apache-marriott-tagging.conf.tmpl` — proxy `/marriott-tagging/api/`
to the API container, alias `/marriott-tagging` to the SPA web-root,
SPA fallback to `index.html`.
systemd unit files left in place for the existing Ubuntu deployment
path; do not run both on the same host (would double-fire the tagger).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add full Server Deployment section with prereqs, user setup, credential
placement, venv setup, unit installation, verification, and update flow
- Tailored for Ubuntu 22.04/24.04 (notes python3-venv apt package gotcha)
- True up Configuration table with current constants (video size limits,
per-run cap, video delay, excluded folder prefixes)
- Update How It Works to cover keyword-tail descriptions, scene-breakdown
comments, large-video gating, and the per-run limiter
- Mention videos in the intro (was previously images-only)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Box JWT + Gemini integration for image and video metadata tagging
- Description format includes search-keyword tail to address synonym gaps
(e.g. "Food" search now hits assets tagged "Dining")
- Skip videos exceeding 5GB source or 400MB proxy (~60min runtime, beyond
Gemini context budget) — counted as skipped, not errored
- Hardened None-response handling in Gemini JSON parser
- Per-run limiter: 200 newly-tagged files / 4 hour wall-clock cap, with
clean exit and resumable progress on next run
- systemd service + timer for daily 2am tagging passes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>