Vadym Samoilenko a3b300b76a docs: add canonical documentation + audit cleanup

- AGENTS.md: canonical project entry point (Quick Nav, pipeline, constraints)
- docs/: complete docs tree — architecture, API spec, DB schema, infra,
  runbook, requirements, tech stack, principles, reference ADRs, guides,
  tasks backlog, testing strategy
- tests/README.md: test commands, structure, known gaps
- README.md / CLAUDE.md / DEPLOYMENT.md: updated with canonical doc links
- .archive/: backup of pre-documentation-pipeline originals
- backend/uv.lock: uv dependency lockfile
- Delete committed __pycache__ .pyc files (should have been gitignored)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-29 14:22:51 +01:00

7.1 KiB

Raw Blame History

Database Schema — Accessible Video Processing Platform

Database: MongoDB Atlas Database name: configured via MONGODB_DB env var (default: accessible_video)

Collections

`jobs`

Central document for each video accessibility job.

Field	Type	Description
_id	ObjectId	Primary key
org_id	ObjectId	Owning organisation
client_user_id	ObjectId	User who uploaded the video
status	string	JobStatus enum (16 values — see architecture.md)
source_language	string	BCP-47 code (e.g., `en-US`)
requested_outputs	array[string]	Output language codes
source	object	`{ gcs_path, filename, duration_seconds }`
outputs	object	Per-language `{ captions_vtt, ad_vtt, ad_mp3, accessible_mp4 }` GCS paths
review	object	QC state `{ reviewer_id, approved_at, rejected_at, reason }`
language_qc	object	Per-language QC state (see LanguageQCState below)
vtt_versions	array	Version snapshot references (see `vtt_versions` collection)
glossary_id	ObjectId	Client glossary to use for translation
retry_count	int	Number of task retries
error	string	Last error message
created_at	datetime	ISO 8601
updated_at	datetime	ISO 8601
completed_at	datetime	ISO 8601

LanguageQCState (per-language, nested in language_qc):

Field	Type	Description
status	string	`pending`, `assigned`, `approved`, `rejected`, `feedback_requested`
linguist_id	ObjectId	Assigned linguist (nullable)
assigned_at	datetime	When linguist was assigned
reviewed_at	datetime	When approved/rejected
reason	string	Rejection or feedback reason

Indexes:

Index	Fields	Purpose
Primary	`_id`	Document lookup
org_status	`org_id` + `status`	List jobs by org and status
client	`client_user_id`	Client's own jobs
created	`created_at` (desc)	Time-sorted listing
status	`status`	Status-filtered queries

`users`

Field	Type	Description
_id	ObjectId	Primary key
email	string	Unique, lowercase
hashed_password	string	bcrypt hash (null for SSO-only users)
role	string	`client`, `reviewer`, `linguist`, `pm`, `admin`
org_id	ObjectId	Primary organisation
is_active	boolean	Account enabled flag
microsoft_id	string	Entra ID subject claim (nullable)
created_at	datetime
updated_at	datetime

Indexes:

Index	Fields	Purpose
email_unique	`email` (unique)	Login lookup
org	`org_id`	Members-of-org query
microsoft	`microsoft_id` (sparse)	SSO user lookup

`organizations`

Field	Type	Description
_id	ObjectId	Primary key
name	string	Organisation display name
slug	string	URL-safe identifier
member_ids	array[ObjectId]	User IDs in this org
created_at	datetime

Indexes:

Index	Fields	Purpose
slug_unique	`slug` (unique)	Org lookup by slug

`glossaries`

Field	Type	Description
_id	ObjectId	Primary key
org_id	ObjectId	Owning organisation
name	string	Glossary display name
terms	array	Array of GlossaryTerm documents
created_at	datetime
updated_at	datetime

GlossaryTerm (embedded in terms):

Field	Type	Description
_id	ObjectId	Term ID
source_term	string	Term in source language
target_language	string	BCP-47 code
preferred_translation	string	Required translation
context	string	Usage notes (optional)
embedding	array[float]	Vector embedding for similarity search

Indexes:

Index	Fields	Purpose
org	`org_id`	List org glossaries
vector	`terms.embedding` (Atlas Vector Search)	Similarity retrieval

Atlas Vector Search index name: glossary_embedding_index

`vtt_versions`

Immutable version snapshots created before each VTT save.

Field	Type	Description
_id	ObjectId	Primary key
job_id	ObjectId	Parent job
language	string	Language code
version_number	int	Sequential version number
content	string	Full VTT file content at time of snapshot
author_id	ObjectId	User who made the change
created_at	datetime	Snapshot timestamp
diff_from_prev	string	Diff against previous version (optional)

Indexes:

Index	Fields	Purpose
job_lang	`job_id` + `language` + `version_number`	Version history listing
job_lang_created	`job_id` + `language` + `created_at` (desc)	Time-sorted history

`audit_logs`

Immutable audit trail for all reviewer, linguist, and PM actions.

Field	Type	Description
_id	ObjectId	Primary key
actor_id	ObjectId	User performing the action
actor_email	string	Denormalised for readability
action	string	Action type enum (see below)
job_id	ObjectId	Affected job (nullable)
org_id	ObjectId	Organisation context
before_state	string	Job status before action
after_state	string	Job status after action
metadata	object	Action-specific context (reason, language, etc.)
created_at	datetime	Event timestamp

Action types:

Action	Trigger
`job_approved`	QC approve
`job_rejected`	QC reject
`qc_feedback_sent`	QC feedback
`language_approved`	Language-level QC approve
`language_rejected`	Language-level QC reject
`linguist_assigned`	PM assigns linguist
`vtt_edited`	VTT content saved
`vtt_restored`	Version restore
`job_retry`	Admin manual retry
`user_invited`	PM/Admin invites member

Indexes:

Index	Fields	Purpose
job	`job_id` + `created_at`	Per-job audit trail
org_created	`org_id` + `created_at` (desc)	Org-level audit log
actor	`actor_id` + `created_at`	Per-user action history

`invitations`

Field	Type	Description
_id	ObjectId	Primary key
email	string	Invitee email
org_id	ObjectId	Org being joined
role	string	Role to assign on accept
token	string	Unique invite token (hashed)
expires_at	datetime	7-day expiry
accepted_at	datetime	Nullable — set on accept
created_by	ObjectId	User who sent invite

Maintenance

Update triggers: New collection added, index added or removed, field added to model. Verification: All collections listed here exist in production Atlas. Index names match backend/app/core/database.py create_indexes() function (currently commented out — indexes were created manually).

7.1 KiB Raw Blame History

Database Schema — Accessible Video Processing Platform

Collections

jobs

users

organizations

glossaries

vtt_versions

audit_logs

invitations

Maintenance

7.1 KiB

Raw Blame History

`jobs`

`users`

`organizations`

`glossaries`

`vtt_versions`

`audit_logs`

`invitations`