Search results were text-only — hard to scan visually with thousands
of assets coming. Now every file Gemini-tags or backfill mirrors also
gets its Box-generated 160x160 JPG thumbnail (~10-20 KB) pulled and
stashed in Postgres, plus a consolidated `search_terms` blob
(file_name + folder + description + flattened metadata values).
Search results render the thumbnail inline; rows missing one show a
striped placeholder. Search SQL now LEFT JOINs file_assets and hits
search_terms too, so backfilled rows are properly searchable.
- schema.sql: new `file_assets` table (file_id PK, thumbnail_bytes
bytea, search_terms text, updated_at). idempotent.
- db.py: `upsert_file_asset` (INSERT … ON CONFLICT preserving
existing thumbnail bytes if today's fetch failed) and
`get_thumbnail`. Both swallow exceptions per the established
defensive pattern.
- main.py: `fetch_thumbnail` (Box SDK get_file_thumbnail_by_id, JPG
at 160 px, handles BoxAPIError 202/404 as soft misses) and
`build_search_terms` (lowercase, whitespace-collapsed text blob).
`_persist_file_asset` wires both into the image+video success
paths of `_run_pass` and into every iteration of `_run_backfill`.
- Backfill skip logic refined: always upsert file_assets (idempotent
PK), only skip the tagging_events insert if a good row already
exists. Re-running Backfill from Box populates thumbnails for
rows backfilled before this feature shipped.
- api.py: `GET /api/files/{file_id}/thumbnail` streams the bytea
with Cache-Control max-age=86400. Search SQL gains the LEFT JOIN
and emits `has_thumbnail` per row. Search also matches against
fa.search_terms so backfilled rows surface for free-text queries
that hit their metadata.
- frontend: Event type adds `has_thumbnail`; `thumbnailUrl(fileId)`
helper builds the prefix-aware URL via Vite's BASE. EventList
renders the thumbnail (lazy, with onError fallback) or a striped
placeholder. .thumb styling + .event-head layout in styles.css.
Verified locally: schema applies via lifespan; upsert + get_thumbnail
roundtrip; /api/files/999/thumbnail returns 200 with bytes; /api/events
returns has_thumbnail per row; multi-token "female city" search finds
a row whose validated_metadata contains both tokens.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously a nightly APScheduler container fired the tagger on every
file in the configured Box folder. With ~5000 files coming, that's
~5000 Box HTTP calls every night just to ask "is this tagged?". Move
to manual-only mode and source the skip decision from the local DB.
- `db.is_file_already_tagged(conn, file_id)` — returns True iff the
DB has a row with status IN ('success','backfilled'). Used by both
image and video loops in main.py instead of the previous
`check_existing_metadata(box_client, file_id)` Box round-trip.
- `fetch_existing_metadata(box_client, file_id)` (main.py) — returns
the user-defined template fields as a flat dict by stripping the
Box `$id`/`$type`/etc. attrs from the SDK response.
- `_run_backfill(run_id, db_conn)` (main.py) — walks the Box folder
and inserts a `status='backfilled'` row for every file Box already
has marriottUsa metadata for. Read-only against Box; safe to re-run.
Use this after first deploy, or to repopulate the DB from Box.
- `POST /api/backfill` mirrors `POST /api/runs` (background thread,
same live-state record).
- SPA: new "Backfill from Box" button next to "Run now" (with a
confirm dialog and a yellow `.status-backfilled` event treatment).
- docker-compose.yml: removed the `tagger` (scheduler) service.
Manual triggers via the SPA / `POST /api/runs` only. scheduler.py
stays in the repo for archival / opt-back-in.
- deploy.sh: readiness now checks the `api` container instead of
`tagger`; `--logs` tails api logs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run model: long-running scheduler container (APScheduler) replacing the
systemd timer in Docker deployments. Every Gemini-analysed file is also
persisted to a Postgres `tagging_events` table (run_id, prompt, raw
response, validated metadata, Box-write outcomes, status, error, timing)
for search and audit. Box is still updated exactly as before and remains
the source of truth for "already tagged" — `db.log_event` swallows DB
failures so an outage can't stop a tagging pass.
Backend:
- `db.py` + `schema.sql` — append-only `tagging_events` with indexes on
run_id, file_id, created_at.
- `scheduler.py` — APScheduler BlockingScheduler with `SCHEDULE_CRON`
(default daily 02:00), `RUN_AT_STARTUP`, SIGTERM handling.
- `api.py` (FastAPI) — `/api/health`, `/api/me`, `/api/events?q=…`
(single-input search across file_name, folder_path, description,
status, file_id, validated_metadata::text, raw_response::text,
scenes::text), `POST /api/runs` (fire-and-forget pass in a background
thread), `/api/runs`, `/api/runs/{id}/events`. Every event response
carries a synthesised `box_url`.
- `auth.py` — Azure AD bearer-token validation against the tenant JWKS
(signature + aud + iss). `DEV_AUTH_BYPASS=true` short-circuits to a
configurable dev user, mirrored on the frontend by
`VITE_DEV_AUTH_BYPASS`.
Frontend (Vite + React + TS):
- `frontend/` SPA, Montserrat + black/white/#FFC407 palette.
- @azure/msal-react with the bypass switch (auto-signin when bypass off).
- Search bar across all logged fields, results list with metadata tags,
status pills, and "Open in Box ↗" links.
- "Run now" button kicks off a tagging pass via `POST /api/runs` and
polls `/api/runs/{id}/events` every 2 s for live progress.
Docker / compose:
- `docker-compose.yml` pins `name: marriott-tagging`. Three services:
`db` (postgres:16, named volume, bound to 127.0.0.1 only), `tagger`
(scheduler.py), `api` (uvicorn). Same image, different `command`.
- `Dockerfile` — python:3.12-slim, non-root user.
Deploy (optical-dev.oliver.solutions):
- `deploy/deploy.sh` — idempotent. Auto-picks free host ports
(POSTGRES_HOST_PORT 5435-5499, MARRIOTT_API_PORT 8003-8099), renders
`apache-marriott-tagging.conf` from the .tmpl, builds the SPA in a
one-shot node:20-alpine container, rsyncs `dist/` to
`/var/www/html/marriott-tagging/`, polls `/api/health`, and prints the
shared-vhost Include line.
- `apache-marriott-tagging.conf.tmpl` — proxy `/marriott-tagging/api/`
to the API container, alias `/marriott-tagging` to the SPA web-root,
SPA fallback to `index.html`.
systemd unit files left in place for the existing Ubuntu deployment
path; do not run both on the same host (would double-fire the tagger).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>