nickviljoen eb31ac1498 Initial Commit

2025-10-15 16:25:04 +02:00

23 KiB

Raw Permalink Blame History

Video Master-Adaptation Detection - Technical Documentation

Overview
How It Works
Architecture
Matching Algorithm
CLI Reference
Batch Matching & HTML Reports
Advanced Usage
Understanding Results
Performance Tuning
Troubleshooting
API Reference

Overview

This tool identifies which master video files were used to create adaptation videos (cutdowns, re-edits, speed changes, crops, etc.). It uses spatial-only matching that compares video content regardless of temporal order, making it robust to:

Speed changes (slow-motion, time-lapse, speed ramping)
Duration changes (15s adaptation from 20s master)
Shot reordering (non-linear edits)
Different aspect ratios (with separate masters per aspect ratio)
Cropping and transformations
Re-encoding and compression

Key Features

✅ Spatial-only video matching - Ignores timing, focuses on content ✅ Audio fingerprinting - Chromaprint-based robust audio matching ✅ Multi-master detection - Identifies all masters used in an adaptation ✅ Percentage contribution - Shows how much of each master was used ✅ Confidence scoring - Weighted scoring combining video + audio ✅ Batch processing - Bulk add masters from directories

How It Works

1. Fingerprinting Phase

When you add a master video or match an adaptation, the tool:

Extracts frames at 2 frames per second (default, configurable)
Creates perceptual hashes (8×8 DCT-based hashing)
Extracts audio fingerprint using Chromaprint (if available)
Stores fingerprints as JSON files for future comparisons

2. Matching Phase

When matching an adaptation against masters:

Generates adaptation fingerprint (same process as masters)
Spatial comparison: For each adaptation frame, finds the most similar frame in each master (anywhere in the timeline)
Calculates percentage: (matching frames / total frames) × 100%
Combines signals: Weighted combination of video (70%) + audio (30%)
Ranks results: Sorted by combined confidence score

Key Insight: Spatial-Only Matching

Traditional video matching fails when adaptations are:

Speed-changed (frames at different timestamps)
Reordered (shots in different sequence)
Edited (missing sections, insertions)

Solution: We ask "Does this frame exist ANYWHERE in the master?" instead of "Does this frame exist at timestamp T?"

This makes matching robust to timing changes while still accurately identifying source content.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        CLI Layer (cli.py)                       │
│  Commands: add-master, list-masters, match, clear, status       │
└────────────────────────┬────────────────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────────────────┐
│                   Matcher Layer (matcher.py)                    │
│  • Loads fingerprints                                           │
│  • Orchestrates comparison                                      │
│  • Calculates percentages & confidence                          │
└────────────────────────┬────────────────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────────────────┐
│              Fingerprinter Layer (fingerprinter.py)             │
│  • Video frame extraction (FFmpeg)                              │
│  • Perceptual hashing (8×8 DCT)                                 │
│  • Audio fingerprinting (Chromaprint)                           │
│  • Spatial-only comparison                                      │
└────────────────────────┬────────────────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────────────────┐
│                     Storage Layer                               │
│  • data/fingerprints/*.json - Fingerprint files                 │
│  • data/masters.json - Master video database                    │
└─────────────────────────────────────────────────────────────────┘

Core Components

1. `VideoFingerprinter` (fingerprinter.py)

Extracts video frames and generates perceptual hashes
Creates audio fingerprints using Chromaprint
Supports configurable sampling rate (frames per second)
Stores fingerprints as JSON for reuse

2. `VideoMatcher` (matcher.py)

Manages master video database
Performs spatial-only matching
Calculates percentage contributions
Generates confidence scores

3. `CLI` (cli.py)

User-facing command-line interface
Rich terminal output with tables and colors
Progress bars for batch operations

Matching Algorithm

Spatial-Only Video Matching

def compare_spatial_only(adaptation_fp, master_fp, threshold=0.70):
    matches = 0

    for adapt_frame in adaptation_frames:
        best_similarity = 0

        # Compare against ALL master frames (ignore time)
        for master_frame in master_frames:
            similarity = hamming_distance(adapt_frame.hash, master_frame.hash)
            best_similarity = max(best_similarity, similarity)

        if best_similarity >= threshold:
            matches += 1

    percentage = (matches / total_frames) * 100
    return percentage

Key Parameters

Parameter	Default	Description
`samples_per_second`	2.0	Frames extracted per second (configurable in code)
`frame_threshold`	0.70	Minimum similarity for frame match (0-1)
`threshold`	0.30	Minimum % of frames to report master (0-1)

Confidence Calculation

combined_score = (video_percentage / 100 × 0.7) + (audio_similarity × 0.3)

Confidence Levels:
- Very High: combined_score ≥ 0.90
- High:      combined_score ≥ 0.75
- Medium:    combined_score ≥ 0.60
- Low:       combined_score ≥ 0.50
- Very Low:  combined_score < 0.50

CLI Reference

`add-master` - Add Master Video

Add a master video to the library.

python cli.py add-master <video_path> [--id <custom_id>]

Examples:

# Auto-generate ID from filename
python cli.py add-master /path/to/master.mp4

# Use custom ID
python cli.py add-master /path/to/master.mp4 --id master_v1

`list-masters` - List All Masters

Display all master videos in the library.

python cli.py list-masters

Output:

Master ID
Filename
Duration
File path

`match` - Match Adaptation Video

Match an adaptation against all masters using spatial-only matching.

python cli.py match <video_path> [OPTIONS]

Options:

--threshold, -t (default: 0.3): Minimum percentage of frames matching (0-1)
--frame-threshold, -f (default: 0.70): Similarity threshold for individual frames (0-1)

Examples:

# Default matching
python cli.py match /path/to/adaptation.mp4

# Stricter matching (require 50% of frames)
python cli.py match /path/to/adaptation.mp4 -t 0.5

# More sensitive frame matching
python cli.py match /path/to/adaptation.mp4 -f 0.65

# Combined: require 70% match with sensitive frame detection
python cli.py match /path/to/adaptation.mp4 -t 0.7 -f 0.65

`status` - System Status

Check system dependencies and library statistics.

python cli.py status

Shows:

FFmpeg availability
Chromaprint/AcoustID status
TMK status
Number of master videos

`batch-match` - Batch Match Folder

Match all videos in a folder and generate an HTML report.

python cli.py batch-match <folder_path> [OPTIONS]

Options:

--threshold, -t (default: 0.3): Minimum percentage match (0-1)
--frame-threshold, -f (default: 0.70): Frame similarity threshold (0-1)
--output, -o: Output HTML file path (default: auto-generated timestamp)

Examples:

# Process all videos in folder
python cli.py batch-match /path/to/adaptations/

# Custom thresholds
python cli.py batch-match /path/to/adaptations/ -t 0.5 -f 0.75

# Custom output filename
python cli.py batch-match /path/to/adaptations/ -o report.html

Output:

Generates timestamped HTML report: matching_report_YYYYMMDD_HHMMSS.html
Shows summary statistics in terminal
Provides clickable file:// URL to open report

`clear` - Clear Library

Remove all master videos from the library.

python cli.py clear

⚠️ Warning: This deletes all fingerprints and master records. Cannot be undone.

Batch Matching & HTML Reports

Overview

The batch matching feature allows you to process an entire folder of adaptation videos and generate a comprehensive HTML report showing which masters were used for each adaptation.

Usage

Command Line:

# Basic usage
python cli.py batch-match /path/to/adaptations/

# With custom thresholds
python cli.py batch-match /path/to/adaptations/ -t 0.5 -f 0.75

# Specify output filename
python cli.py batch-match /path/to/adaptations/ -o my_report.html

Standalone Script:

# You can also use the standalone script
python batch_match.py /path/to/adaptations/
python batch_match.py /path/to/adaptations/ --output reports/batch_results.html

HTML Report Features

The generated HTML report includes:

1. Summary Dashboard

Total adaptations processed
Number of matched adaptations
Number with no matches
Total master matches across all adaptations

2. Per-Adaptation Cards Each adaptation is shown in a card with:

Adaptation filename
Number of matches badge
List of all matching masters
Error message (if processing failed)

3. Per-Master Match Details For each matching master:

Master ID and filename
Color-coded confidence badge:
- 🟢 Green - Very High/High confidence
- 🟡 Yellow - Medium confidence
- 🔴 Red - Low/Very Low confidence
Master duration
Video match percentage
Frames matched (X/Y format)
Combined confidence score
Visual progress bar showing match percentage

4. Design Features

Modern gradient design (purple theme)
Responsive layout (works on mobile/tablet/desktop)
Hover effects on cards
Print-friendly styling
Clean, professional appearance

Example Workflow

# 1. Add all masters
python bulk_add_masters.py "masters/" -r

# 2. Process all adaptations
python cli.py batch-match "adaptations/"

# 3. Open the generated report
open matching_report_20251010_153045.html

# 4. Review results:
#    - Which adaptations matched which masters
#    - Confidence levels for each match
#    - Any processing errors

Use Cases

Quality Control:

Verify adaptations were created from correct masters
Check if all expected masters were used
Identify adaptations with low confidence matches

Production Tracking:

Document which masters were used for each delivery
Generate audit trail of master usage
Track adaptation creation workflow

Asset Management:

Identify unused masters
Find duplicate or similar adaptations
Organize video library by source masters

Report Customization

The HTML report can be customized by editing batch_match.py:

# Line 23: Change color scheme
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);

# Line 80: Adjust card styling
.adaptation {
    background: white;
    padding: 25px;
    border-radius: 15px;
}

# Line 150: Modify confidence colors
.confidence-very-high { background: #51cf66; }
.confidence-high { background: #69db7c; }

Advanced Usage

Bulk Adding Masters

Use the bulk_add_masters.py script to add multiple videos at once:

# Add all .mp4 files from a directory
python bulk_add_masters.py /path/to/masters/

# Recursively add from subdirectories
python bulk_add_masters.py /path/to/masters/ --recursive

# Add specific pattern
python bulk_add_masters.py /path/to/masters/ --pattern "*.mov"

Adjusting Sampling Rate

The default is 2 frames per second, optimized for fast-paced advertising content with quick edits.

Edit src/video_matcher/fingerprinter.py:106:

samples_per_second = 2.0  # Default: good for ads with quick cuts
samples_per_second = 1.0  # Faster: basic matching, may miss quick edits
samples_per_second = 3.0  # Slower: catches sub-second cuts

Trade-offs:

Rate	20s Video	Use Case	Pros	Cons
0.5 fps	10 frames	Long-form content	Fast, small files	May miss cuts
1.0 fps	20 frames	General purpose	Balanced	Misses quick edits
2.0 fps	40 frames	Ads/Marketing	Catches quick cuts	2x storage
3.0 fps	60 frames	Frame-accurate	Very detailed	3x slower

Recommendation: Keep 2 fps for advertising/marketing content with fast edits.

Handling Different Aspect Ratios

Best Practice: Maintain separate masters for each aspect ratio:

masters/
├── 16x9/
│   ├── master_A_16x9.mp4
│   ├── master_B_16x9.mp4
├── 9x16/
│   ├── master_A_9x16.mp4
│   ├── master_B_9x16.mp4
└── 1x1/
    ├── master_A_1x1.mp4
    └── master_B_1x1.mp4

Add all versions to the library:

python bulk_add_masters.py masters/16x9/ -r
python bulk_add_masters.py masters/9x16/ -r
python bulk_add_masters.py masters/1x1/ -r

The matcher will automatically identify the correct aspect ratio master.

Understanding Results

Sample Output

Found 2 master(s) matching this adaptation:

╭──────┬────────────┬─────────────┬────────┬───────┬──────────┬────────────╮
│ Rank │ Master ID  │ Video Match │ Frames │ Audio │ Combined │ Confidence │
├──────┼────────────┼─────────────┼────────┼───────┼──────────┼────────────┤
│    1 │ master_C   │      100.0% │ 15/15  │ 0.500 │    0.850 │ High       │
│    2 │ master_B   │       73.3% │ 11/15  │ 0.500 │    0.663 │ Medium     │
╰──────┴────────────┴─────────────┴────────┴───────┴──────────┴────────────╯

Best Match:
  Master: master_C
  Video frames matched: 100.0% (15/15 frames)
  Average frame similarity: 94.4%
  Audio similarity: 0.500
  Combined confidence: 85.0%

Interpreting Scores

Video Match Percentage:

100%: All adaptation frames found in master
75-99%: Most frames match, likely correct master
50-74%: Partial match, possibly similar content
<50%: Unlikely to be source master

Average Frame Similarity:

>90%: Near-identical frames (same encoding/quality)
75-90%: Very similar (different encoding/compression)
60-75%: Similar content (crops, color grading)
<60%: Different content or heavy transformations

Combined Score:

Weighted combination: 70% video + 30% audio
Audio helps disambiguate visually similar masters
Higher combined score = more confident match

When Multiple Masters Match

If an adaptation uses content from multiple masters:

Best Match:
  Master: master_A - 60% of frames

Other Potential Matches:
  • master_B: 40% of frames

This indicates the adaptation combined:

60% content from master_A
40% content from master_B

Performance Tuning

Speed vs Accuracy

For faster matching (lower accuracy):

# Reduce sampling rate (1.0 = 1 frame per second)
samples_per_second = 1.0

# Increase thresholds (stricter matching)
frame_threshold = 0.75
threshold = 0.5

For better accuracy (slower):

# Increase sampling rate (3.0 = 3 frames per second)
samples_per_second = 3.0

# Lower thresholds (more sensitive)
frame_threshold = 0.65
threshold = 0.3

Default (balanced for ads):

samples_per_second = 2.0  # Catches quick edits
frame_threshold = 0.70
threshold = 0.3

Large Libraries

For libraries with 100+ masters:

Pre-filter by duration:
- Skip masters that are too short/long for the adaptation
Use audio pre-filtering:
- Match audio first, then only check video for audio matches
Parallel processing:
- Compare against multiple masters simultaneously

Troubleshooting

Common Issues

❌ No matches found

Cause: Thresholds too strict, or videos unrelated

Solution:

# Try more lenient settings
python cli.py match video.mp4 -t 0.2 -f 0.65

❌ Too many false positives

Cause: Thresholds too lenient, similar-looking content

Solution:

# Stricter matching
python cli.py match video.mp4 -t 0.5 -f 0.75

❌ Speed-changed adaptations not matching

Cause: Already handled! Spatial matching ignores timing

Check:

Ensure video content is actually similar
Lower frame_threshold if heavily processed

❌ Different aspect ratios not matching

Solution: Ensure you have masters in the same aspect ratio

# Add masters for each aspect ratio
python cli.py add-master master_16x9.mp4
python cli.py add-master master_1x1.mp4

❌ Audio similarity always 0.500

Cause: Chromaprint comparison not fully implemented (placeholder)

Note: This is a POC limitation. Video matching still works.

API Reference

VideoFingerprinter

from video_matcher.fingerprinter import VideoFingerprinter

fp = VideoFingerprinter(data_dir="data/fingerprints")

# Generate fingerprint
fingerprint = fp.fingerprint_video(
    video_path="/path/to/video.mp4",
    video_id="my_video"
)

# Load existing fingerprint
existing = fp.load_fingerprint("my_video")

# List all fingerprints
all_ids = fp.list_fingerprints()

VideoMatcher

from video_matcher.matcher import VideoMatcher

matcher = VideoMatcher(data_dir="data")

# Add master
matcher.add_master(
    video_path="/path/to/master.mp4",
    master_id="master_1"
)

# List masters
masters = matcher.list_masters()

# Match adaptation
matches = matcher.match_adaptation(
    video_path="/path/to/adaptation.mp4",
    threshold=0.3,
    frame_threshold=0.70
)

# Clear all masters
matcher.clear_masters()

Comparison Functions

from video_matcher.fingerprinter import (
    compare_spatial_only,
    compare_audio_fingerprints
)

# Spatial video comparison
result = compare_spatial_only(
    adaptation_fp=adapt_fp,
    master_fp=master_fp,
    similarity_threshold=0.75
)
# Returns: {
#   'matching_frames': 12,
#   'total_frames': 15,
#   'percentage': 80.0,
#   'average_similarity': 0.87
# }

# Audio comparison
audio_score = compare_audio_fingerprints(
    fp1=adapt_audio,
    fp2=master_audio
)
# Returns: float (0-1)

File Formats

Fingerprint JSON Structure

{
  "video_id": "master_example",
  "path": "/path/to/video.mp4",
  "filename": "video.mp4",
  "info": {
    "duration": 20.0,
    "width": 1920,
    "height": 1080,
    "fps": 25.0,
    "has_audio": true,
    "codec": "h264"
  },
  "audio_fp": {
    "duration": 20.0,
    "fingerprint": "AQAAZEw4Kc9w...",
    "method": "chromaprint"
  },
  "video_fp": {
    "method": "basic_hash",
    "samples_per_second": 1.0,
    "num_frames": 20,
    "frames": [
      {
        "frame_id": 0,
        "timestamp": 0.0,
        "hash": "0xcfcfc7e3c3e3e3e3"
      }
    ]
  }
}

Masters Database (masters.json)

{
  "masters": [
    {
      "master_id": "master_example",
      "fingerprint_id": "master_master_example",
      "path": "/path/to/video.mp4",
      "filename": "video.mp4",
      "duration": 20.0
    }
  ]
}

Future Enhancements

Production-Ready Improvements

TMK Integration - Facebook's Threat Match for more robust matching
Segment Timeline - Show exactly which parts came from which master
Web UI - Drag-drop interface with side-by-side comparison
Batch Processing - Process hundreds of adaptations in parallel
Database Storage - PostgreSQL/MongoDB instead of JSON files
Vector Search - Milvus/Qdrant for sub-second matching in large libraries
GPU Acceleration - CUDA-based hash computation
CLIP Embeddings - Handle heavy crops, overlays, graphics
Shot Detection - PySceneDetect for segment-level matching
Audio Refinement - Proper Chromaprint comparison implementation

Suggested Architecture for Scale

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Web UI      │────▶│  API Gateway │────▶│  Job Queue   │
│  (React)     │     │  (FastAPI)   │     │  (Celery)    │
└──────────────┘     └──────────────┘     └──────┬───────┘
                                                   │
                     ┌──────────────┐     ┌───────▼───────┐
                     │  Vector DB   │────▶│  Workers      │
                     │  (Qdrant)    │     │  (GPU-based)  │
                     └──────────────┘     └───────────────┘

License

MIT License - See LICENSE file for details.

Support & Contact

For questions, issues, or contributions, please open an issue on the GitHub repository.

Documentation Version: 1.0 Last Updated: 2025-10-05

23 KiB Raw Permalink Blame History Unescape Escape

Video Master-Adaptation Detection - Technical Documentation

Table of Contents

Overview

Key Features

How It Works

1. Fingerprinting Phase

2. Matching Phase

Key Insight: Spatial-Only Matching

Architecture

Core Components

1. VideoFingerprinter (fingerprinter.py)

2. VideoMatcher (matcher.py)

3. CLI (cli.py)

Matching Algorithm

Spatial-Only Video Matching

Key Parameters

Confidence Calculation

CLI Reference

add-master - Add Master Video

list-masters - List All Masters

match - Match Adaptation Video

status - System Status

batch-match - Batch Match Folder

clear - Clear Library

Batch Matching & HTML Reports

Overview

Usage

HTML Report Features

Example Workflow

Use Cases

Report Customization

Advanced Usage

Bulk Adding Masters

Adjusting Sampling Rate

Handling Different Aspect Ratios

Understanding Results

Sample Output

Interpreting Scores

When Multiple Masters Match

Performance Tuning

Speed vs Accuracy

Large Libraries

Troubleshooting

Common Issues

API Reference

VideoFingerprinter

VideoMatcher

Comparison Functions

File Formats

Fingerprint JSON Structure

Masters Database (masters.json)

Future Enhancements

Production-Ready Improvements

Suggested Architecture for Scale

License

Support & Contact

23 KiB

Raw Permalink Blame History

1. `VideoFingerprinter` (fingerprinter.py)

2. `VideoMatcher` (matcher.py)

3. `CLI` (cli.py)

`add-master` - Add Master Video

`list-masters` - List All Masters

`match` - Match Adaptation Video

`status` - System Status

`batch-match` - Batch Match Folder

`clear` - Clear Library