video-accessibility/docs/project/database_schema.md
Vadym Samoilenko a3b300b76a docs: add canonical documentation + audit cleanup
- AGENTS.md: canonical project entry point (Quick Nav, pipeline, constraints)
- docs/: complete docs tree — architecture, API spec, DB schema, infra,
  runbook, requirements, tech stack, principles, reference ADRs, guides,
  tasks backlog, testing strategy
- tests/README.md: test commands, structure, known gaps
- README.md / CLAUDE.md / DEPLOYMENT.md: updated with canonical doc links
- .archive/: backup of pre-documentation-pipeline originals
- backend/uv.lock: uv dependency lockfile
- Delete committed __pycache__ .pyc files (should have been gitignored)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 14:22:51 +01:00

7.1 KiB

Database Schema — Accessible Video Processing Platform

Database: MongoDB Atlas Database name: configured via MONGODB_DB env var (default: accessible_video)


Collections

jobs

Central document for each video accessibility job.

Field Type Description
_id ObjectId Primary key
org_id ObjectId Owning organisation
client_user_id ObjectId User who uploaded the video
status string JobStatus enum (16 values — see architecture.md)
source_language string BCP-47 code (e.g., en-US)
requested_outputs array[string] Output language codes
source object { gcs_path, filename, duration_seconds }
outputs object Per-language { captions_vtt, ad_vtt, ad_mp3, accessible_mp4 } GCS paths
review object QC state { reviewer_id, approved_at, rejected_at, reason }
language_qc object Per-language QC state (see LanguageQCState below)
vtt_versions array Version snapshot references (see vtt_versions collection)
glossary_id ObjectId Client glossary to use for translation
retry_count int Number of task retries
error string Last error message
created_at datetime ISO 8601
updated_at datetime ISO 8601
completed_at datetime ISO 8601

LanguageQCState (per-language, nested in language_qc):

Field Type Description
status string pending, assigned, approved, rejected, feedback_requested
linguist_id ObjectId Assigned linguist (nullable)
assigned_at datetime When linguist was assigned
reviewed_at datetime When approved/rejected
reason string Rejection or feedback reason

Indexes:

Index Fields Purpose
Primary _id Document lookup
org_status org_id + status List jobs by org and status
client client_user_id Client's own jobs
created created_at (desc) Time-sorted listing
status status Status-filtered queries

users

Field Type Description
_id ObjectId Primary key
email string Unique, lowercase
hashed_password string bcrypt hash (null for SSO-only users)
role string client, reviewer, linguist, pm, admin
org_id ObjectId Primary organisation
is_active boolean Account enabled flag
microsoft_id string Entra ID subject claim (nullable)
created_at datetime
updated_at datetime

Indexes:

Index Fields Purpose
email_unique email (unique) Login lookup
org org_id Members-of-org query
microsoft microsoft_id (sparse) SSO user lookup

organizations

Field Type Description
_id ObjectId Primary key
name string Organisation display name
slug string URL-safe identifier
member_ids array[ObjectId] User IDs in this org
created_at datetime

Indexes:

Index Fields Purpose
slug_unique slug (unique) Org lookup by slug

glossaries

Field Type Description
_id ObjectId Primary key
org_id ObjectId Owning organisation
name string Glossary display name
terms array Array of GlossaryTerm documents
created_at datetime
updated_at datetime

GlossaryTerm (embedded in terms):

Field Type Description
_id ObjectId Term ID
source_term string Term in source language
target_language string BCP-47 code
preferred_translation string Required translation
context string Usage notes (optional)
embedding array[float] Vector embedding for similarity search

Indexes:

Index Fields Purpose
org org_id List org glossaries
vector terms.embedding (Atlas Vector Search) Similarity retrieval

Atlas Vector Search index name: glossary_embedding_index


vtt_versions

Immutable version snapshots created before each VTT save.

Field Type Description
_id ObjectId Primary key
job_id ObjectId Parent job
language string Language code
version_number int Sequential version number
content string Full VTT file content at time of snapshot
author_id ObjectId User who made the change
created_at datetime Snapshot timestamp
diff_from_prev string Diff against previous version (optional)

Indexes:

Index Fields Purpose
job_lang job_id + language + version_number Version history listing
job_lang_created job_id + language + created_at (desc) Time-sorted history

audit_logs

Immutable audit trail for all reviewer, linguist, and PM actions.

Field Type Description
_id ObjectId Primary key
actor_id ObjectId User performing the action
actor_email string Denormalised for readability
action string Action type enum (see below)
job_id ObjectId Affected job (nullable)
org_id ObjectId Organisation context
before_state string Job status before action
after_state string Job status after action
metadata object Action-specific context (reason, language, etc.)
created_at datetime Event timestamp

Action types:

Action Trigger
job_approved QC approve
job_rejected QC reject
qc_feedback_sent QC feedback
language_approved Language-level QC approve
language_rejected Language-level QC reject
linguist_assigned PM assigns linguist
vtt_edited VTT content saved
vtt_restored Version restore
job_retry Admin manual retry
user_invited PM/Admin invites member

Indexes:

Index Fields Purpose
job job_id + created_at Per-job audit trail
org_created org_id + created_at (desc) Org-level audit log
actor actor_id + created_at Per-user action history

invitations

Field Type Description
_id ObjectId Primary key
email string Invitee email
org_id ObjectId Org being joined
role string Role to assign on accept
token string Unique invite token (hashed)
expires_at datetime 7-day expiry
accepted_at datetime Nullable — set on accept
created_by ObjectId User who sent invite

Maintenance

Update triggers: New collection added, index added or removed, field added to model. Verification: All collections listed here exist in production Atlas. Index names match backend/app/core/database.py create_indexes() function (currently commented out — indexes were created manually).