video-accessibility/docs/project/database_schema.md
Vadym Samoilenko a3b300b76a docs: add canonical documentation + audit cleanup
- AGENTS.md: canonical project entry point (Quick Nav, pipeline, constraints)
- docs/: complete docs tree — architecture, API spec, DB schema, infra,
  runbook, requirements, tech stack, principles, reference ADRs, guides,
  tasks backlog, testing strategy
- tests/README.md: test commands, structure, known gaps
- README.md / CLAUDE.md / DEPLOYMENT.md: updated with canonical doc links
- .archive/: backup of pre-documentation-pipeline originals
- backend/uv.lock: uv dependency lockfile
- Delete committed __pycache__ .pyc files (should have been gitignored)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 14:22:51 +01:00

219 lines
7.1 KiB
Markdown

# Database Schema — Accessible Video Processing Platform
<!-- SCOPE: database-schema | owner: ln-113 | generated: 2026-04-29 -->
**Database:** MongoDB Atlas
**Database name:** configured via `MONGODB_DB` env var (default: `accessible_video`)
---
## Collections
### `jobs`
Central document for each video accessibility job.
| Field | Type | Description |
|-------|------|-------------|
| _id | ObjectId | Primary key |
| org_id | ObjectId | Owning organisation |
| client_user_id | ObjectId | User who uploaded the video |
| status | string | JobStatus enum (16 values — see architecture.md) |
| source_language | string | BCP-47 code (e.g., `en-US`) |
| requested_outputs | array[string] | Output language codes |
| source | object | `{ gcs_path, filename, duration_seconds }` |
| outputs | object | Per-language `{ captions_vtt, ad_vtt, ad_mp3, accessible_mp4 }` GCS paths |
| review | object | QC state `{ reviewer_id, approved_at, rejected_at, reason }` |
| language_qc | object | Per-language QC state (see LanguageQCState below) |
| vtt_versions | array | Version snapshot references (see `vtt_versions` collection) |
| glossary_id | ObjectId | Client glossary to use for translation |
| retry_count | int | Number of task retries |
| error | string | Last error message |
| created_at | datetime | ISO 8601 |
| updated_at | datetime | ISO 8601 |
| completed_at | datetime | ISO 8601 |
**LanguageQCState (per-language, nested in `language_qc`):**
| Field | Type | Description |
|-------|------|-------------|
| status | string | `pending`, `assigned`, `approved`, `rejected`, `feedback_requested` |
| linguist_id | ObjectId | Assigned linguist (nullable) |
| assigned_at | datetime | When linguist was assigned |
| reviewed_at | datetime | When approved/rejected |
| reason | string | Rejection or feedback reason |
**Indexes:**
| Index | Fields | Purpose |
|-------|--------|---------|
| Primary | `_id` | Document lookup |
| org_status | `org_id` + `status` | List jobs by org and status |
| client | `client_user_id` | Client's own jobs |
| created | `created_at` (desc) | Time-sorted listing |
| status | `status` | Status-filtered queries |
---
### `users`
| Field | Type | Description |
|-------|------|-------------|
| _id | ObjectId | Primary key |
| email | string | Unique, lowercase |
| hashed_password | string | bcrypt hash (null for SSO-only users) |
| role | string | `client`, `reviewer`, `linguist`, `pm`, `admin` |
| org_id | ObjectId | Primary organisation |
| is_active | boolean | Account enabled flag |
| microsoft_id | string | Entra ID subject claim (nullable) |
| created_at | datetime | |
| updated_at | datetime | |
**Indexes:**
| Index | Fields | Purpose |
|-------|--------|---------|
| email_unique | `email` (unique) | Login lookup |
| org | `org_id` | Members-of-org query |
| microsoft | `microsoft_id` (sparse) | SSO user lookup |
---
### `organizations`
| Field | Type | Description |
|-------|------|-------------|
| _id | ObjectId | Primary key |
| name | string | Organisation display name |
| slug | string | URL-safe identifier |
| member_ids | array[ObjectId] | User IDs in this org |
| created_at | datetime | |
**Indexes:**
| Index | Fields | Purpose |
|-------|--------|---------|
| slug_unique | `slug` (unique) | Org lookup by slug |
---
### `glossaries`
| Field | Type | Description |
|-------|------|-------------|
| _id | ObjectId | Primary key |
| org_id | ObjectId | Owning organisation |
| name | string | Glossary display name |
| terms | array | Array of GlossaryTerm documents |
| created_at | datetime | |
| updated_at | datetime | |
**GlossaryTerm (embedded in `terms`):**
| Field | Type | Description |
|-------|------|-------------|
| _id | ObjectId | Term ID |
| source_term | string | Term in source language |
| target_language | string | BCP-47 code |
| preferred_translation | string | Required translation |
| context | string | Usage notes (optional) |
| embedding | array[float] | Vector embedding for similarity search |
**Indexes:**
| Index | Fields | Purpose |
|-------|--------|---------|
| org | `org_id` | List org glossaries |
| vector | `terms.embedding` (Atlas Vector Search) | Similarity retrieval |
**Atlas Vector Search index name:** `glossary_embedding_index`
---
### `vtt_versions`
Immutable version snapshots created before each VTT save.
| Field | Type | Description |
|-------|------|-------------|
| _id | ObjectId | Primary key |
| job_id | ObjectId | Parent job |
| language | string | Language code |
| version_number | int | Sequential version number |
| content | string | Full VTT file content at time of snapshot |
| author_id | ObjectId | User who made the change |
| created_at | datetime | Snapshot timestamp |
| diff_from_prev | string | Diff against previous version (optional) |
**Indexes:**
| Index | Fields | Purpose |
|-------|--------|---------|
| job_lang | `job_id` + `language` + `version_number` | Version history listing |
| job_lang_created | `job_id` + `language` + `created_at` (desc) | Time-sorted history |
---
### `audit_logs`
Immutable audit trail for all reviewer, linguist, and PM actions.
| Field | Type | Description |
|-------|------|-------------|
| _id | ObjectId | Primary key |
| actor_id | ObjectId | User performing the action |
| actor_email | string | Denormalised for readability |
| action | string | Action type enum (see below) |
| job_id | ObjectId | Affected job (nullable) |
| org_id | ObjectId | Organisation context |
| before_state | string | Job status before action |
| after_state | string | Job status after action |
| metadata | object | Action-specific context (reason, language, etc.) |
| created_at | datetime | Event timestamp |
**Action types:**
| Action | Trigger |
|--------|---------|
| `job_approved` | QC approve |
| `job_rejected` | QC reject |
| `qc_feedback_sent` | QC feedback |
| `language_approved` | Language-level QC approve |
| `language_rejected` | Language-level QC reject |
| `linguist_assigned` | PM assigns linguist |
| `vtt_edited` | VTT content saved |
| `vtt_restored` | Version restore |
| `job_retry` | Admin manual retry |
| `user_invited` | PM/Admin invites member |
**Indexes:**
| Index | Fields | Purpose |
|-------|--------|---------|
| job | `job_id` + `created_at` | Per-job audit trail |
| org_created | `org_id` + `created_at` (desc) | Org-level audit log |
| actor | `actor_id` + `created_at` | Per-user action history |
---
### `invitations`
| Field | Type | Description |
|-------|------|-------------|
| _id | ObjectId | Primary key |
| email | string | Invitee email |
| org_id | ObjectId | Org being joined |
| role | string | Role to assign on accept |
| token | string | Unique invite token (hashed) |
| expires_at | datetime | 7-day expiry |
| accepted_at | datetime | Nullable — set on accept |
| created_by | ObjectId | User who sent invite |
---
## Maintenance
**Update triggers:** New collection added, index added or removed, field added to model.
**Verification:** All collections listed here exist in production Atlas. Index names match `backend/app/core/database.py` `create_indexes()` function (currently commented out — indexes were created manually).
<!-- END SCOPE: database-schema -->