# Video Master-Adaptation Detection - Technical Documentation ## Table of Contents 1. [Overview](#overview) 2. [How It Works](#how-it-works) 3. [Architecture](#architecture) 4. [Matching Algorithm](#matching-algorithm) 5. [CLI Reference](#cli-reference) 6. [Batch Matching & HTML Reports](#batch-matching--html-reports) 7. [Advanced Usage](#advanced-usage) 8. [Understanding Results](#understanding-results) 9. [Performance Tuning](#performance-tuning) 10. [Troubleshooting](#troubleshooting) 11. [API Reference](#api-reference) --- ## Overview This tool identifies which master video files were used to create adaptation videos (cutdowns, re-edits, speed changes, crops, etc.). It uses **spatial-only matching** that compares video content regardless of temporal order, making it robust to: - **Speed changes** (slow-motion, time-lapse, speed ramping) - **Duration changes** (15s adaptation from 20s master) - **Shot reordering** (non-linear edits) - **Different aspect ratios** (with separate masters per aspect ratio) - **Cropping and transformations** - **Re-encoding and compression** ### Key Features ✅ **Spatial-only video matching** - Ignores timing, focuses on content ✅ **Audio fingerprinting** - Chromaprint-based robust audio matching ✅ **Multi-master detection** - Identifies all masters used in an adaptation ✅ **Percentage contribution** - Shows how much of each master was used ✅ **Confidence scoring** - Weighted scoring combining video + audio ✅ **Batch processing** - Bulk add masters from directories --- ## How It Works ### 1. Fingerprinting Phase When you add a master video or match an adaptation, the tool: 1. **Extracts frames** at 2 frames per second (default, configurable) 2. **Creates perceptual hashes** (8×8 DCT-based hashing) 3. **Extracts audio fingerprint** using Chromaprint (if available) 4. **Stores fingerprints** as JSON files for future comparisons ### 2. Matching Phase When matching an adaptation against masters: 1. **Generates adaptation fingerprint** (same process as masters) 2. **Spatial comparison**: For each adaptation frame, finds the most similar frame in each master (anywhere in the timeline) 3. **Calculates percentage**: (matching frames / total frames) × 100% 4. **Combines signals**: Weighted combination of video (70%) + audio (30%) 5. **Ranks results**: Sorted by combined confidence score ### Key Insight: Spatial-Only Matching Traditional video matching fails when adaptations are: - Speed-changed (frames at different timestamps) - Reordered (shots in different sequence) - Edited (missing sections, insertions) **Solution**: We ask "Does this frame exist ANYWHERE in the master?" instead of "Does this frame exist at timestamp T?" This makes matching robust to timing changes while still accurately identifying source content. --- ## Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ CLI Layer (cli.py) │ │ Commands: add-master, list-masters, match, clear, status │ └────────────────────────┬────────────────────────────────────────┘ │ ┌────────────────────────▼────────────────────────────────────────┐ │ Matcher Layer (matcher.py) │ │ • Loads fingerprints │ │ • Orchestrates comparison │ │ • Calculates percentages & confidence │ └────────────────────────┬────────────────────────────────────────┘ │ ┌────────────────────────▼────────────────────────────────────────┐ │ Fingerprinter Layer (fingerprinter.py) │ │ • Video frame extraction (FFmpeg) │ │ • Perceptual hashing (8×8 DCT) │ │ • Audio fingerprinting (Chromaprint) │ │ • Spatial-only comparison │ └────────────────────────┬────────────────────────────────────────┘ │ ┌────────────────────────▼────────────────────────────────────────┐ │ Storage Layer │ │ • data/fingerprints/*.json - Fingerprint files │ │ • data/masters.json - Master video database │ └─────────────────────────────────────────────────────────────────┘ ``` ### Core Components #### 1. `VideoFingerprinter` (fingerprinter.py) - Extracts video frames and generates perceptual hashes - Creates audio fingerprints using Chromaprint - Supports configurable sampling rate (frames per second) - Stores fingerprints as JSON for reuse #### 2. `VideoMatcher` (matcher.py) - Manages master video database - Performs spatial-only matching - Calculates percentage contributions - Generates confidence scores #### 3. `CLI` (cli.py) - User-facing command-line interface - Rich terminal output with tables and colors - Progress bars for batch operations --- ## Matching Algorithm ### Spatial-Only Video Matching ```python def compare_spatial_only(adaptation_fp, master_fp, threshold=0.70): matches = 0 for adapt_frame in adaptation_frames: best_similarity = 0 # Compare against ALL master frames (ignore time) for master_frame in master_frames: similarity = hamming_distance(adapt_frame.hash, master_frame.hash) best_similarity = max(best_similarity, similarity) if best_similarity >= threshold: matches += 1 percentage = (matches / total_frames) * 100 return percentage ``` ### Key Parameters | Parameter | Default | Description | |-----------|---------|-------------| | `samples_per_second` | 2.0 | Frames extracted per second (configurable in code) | | `frame_threshold` | 0.70 | Minimum similarity for frame match (0-1) | | `threshold` | 0.30 | Minimum % of frames to report master (0-1) | ### Confidence Calculation ``` combined_score = (video_percentage / 100 × 0.7) + (audio_similarity × 0.3) Confidence Levels: - Very High: combined_score ≥ 0.90 - High: combined_score ≥ 0.75 - Medium: combined_score ≥ 0.60 - Low: combined_score ≥ 0.50 - Very Low: combined_score < 0.50 ``` --- ## CLI Reference ### `add-master` - Add Master Video Add a master video to the library. ```bash python cli.py add-master [--id ] ``` **Examples:** ```bash # Auto-generate ID from filename python cli.py add-master /path/to/master.mp4 # Use custom ID python cli.py add-master /path/to/master.mp4 --id master_v1 ``` ### `list-masters` - List All Masters Display all master videos in the library. ```bash python cli.py list-masters ``` **Output:** - Master ID - Filename - Duration - File path ### `match` - Match Adaptation Video Match an adaptation against all masters using spatial-only matching. ```bash python cli.py match [OPTIONS] ``` **Options:** - `--threshold`, `-t` (default: 0.3): Minimum percentage of frames matching (0-1) - `--frame-threshold`, `-f` (default: 0.70): Similarity threshold for individual frames (0-1) **Examples:** ```bash # Default matching python cli.py match /path/to/adaptation.mp4 # Stricter matching (require 50% of frames) python cli.py match /path/to/adaptation.mp4 -t 0.5 # More sensitive frame matching python cli.py match /path/to/adaptation.mp4 -f 0.65 # Combined: require 70% match with sensitive frame detection python cli.py match /path/to/adaptation.mp4 -t 0.7 -f 0.65 ``` ### `status` - System Status Check system dependencies and library statistics. ```bash python cli.py status ``` **Shows:** - FFmpeg availability - Chromaprint/AcoustID status - TMK status - Number of master videos ### `batch-match` - Batch Match Folder Match all videos in a folder and generate an HTML report. ```bash python cli.py batch-match [OPTIONS] ``` **Options:** - `--threshold`, `-t` (default: 0.3): Minimum percentage match (0-1) - `--frame-threshold`, `-f` (default: 0.70): Frame similarity threshold (0-1) - `--output`, `-o`: Output HTML file path (default: auto-generated timestamp) **Examples:** ```bash # Process all videos in folder python cli.py batch-match /path/to/adaptations/ # Custom thresholds python cli.py batch-match /path/to/adaptations/ -t 0.5 -f 0.75 # Custom output filename python cli.py batch-match /path/to/adaptations/ -o report.html ``` **Output:** - Generates timestamped HTML report: `matching_report_YYYYMMDD_HHMMSS.html` - Shows summary statistics in terminal - Provides clickable file:// URL to open report ### `clear` - Clear Library Remove all master videos from the library. ```bash python cli.py clear ``` ⚠️ **Warning:** This deletes all fingerprints and master records. Cannot be undone. --- ## Batch Matching & HTML Reports ### Overview The batch matching feature allows you to process an entire folder of adaptation videos and generate a comprehensive HTML report showing which masters were used for each adaptation. ### Usage **Command Line:** ```bash # Basic usage python cli.py batch-match /path/to/adaptations/ # With custom thresholds python cli.py batch-match /path/to/adaptations/ -t 0.5 -f 0.75 # Specify output filename python cli.py batch-match /path/to/adaptations/ -o my_report.html ``` **Standalone Script:** ```bash # You can also use the standalone script python batch_match.py /path/to/adaptations/ python batch_match.py /path/to/adaptations/ --output reports/batch_results.html ``` ### HTML Report Features The generated HTML report includes: **1. Summary Dashboard** - Total adaptations processed - Number of matched adaptations - Number with no matches - Total master matches across all adaptations **2. Per-Adaptation Cards** Each adaptation is shown in a card with: - Adaptation filename - Number of matches badge - List of all matching masters - Error message (if processing failed) **3. Per-Master Match Details** For each matching master: - Master ID and filename - Color-coded confidence badge: - 🟢 **Green** - Very High/High confidence - 🟡 **Yellow** - Medium confidence - 🔴 **Red** - Low/Very Low confidence - Master duration - Video match percentage - Frames matched (X/Y format) - Combined confidence score - Visual progress bar showing match percentage **4. Design Features** - Modern gradient design (purple theme) - Responsive layout (works on mobile/tablet/desktop) - Hover effects on cards - Print-friendly styling - Clean, professional appearance ### Example Workflow ```bash # 1. Add all masters python bulk_add_masters.py "masters/" -r # 2. Process all adaptations python cli.py batch-match "adaptations/" # 3. Open the generated report open matching_report_20251010_153045.html # 4. Review results: # - Which adaptations matched which masters # - Confidence levels for each match # - Any processing errors ``` ### Use Cases **Quality Control:** - Verify adaptations were created from correct masters - Check if all expected masters were used - Identify adaptations with low confidence matches **Production Tracking:** - Document which masters were used for each delivery - Generate audit trail of master usage - Track adaptation creation workflow **Asset Management:** - Identify unused masters - Find duplicate or similar adaptations - Organize video library by source masters ### Report Customization The HTML report can be customized by editing `batch_match.py`: ```python # Line 23: Change color scheme background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); # Line 80: Adjust card styling .adaptation { background: white; padding: 25px; border-radius: 15px; } # Line 150: Modify confidence colors .confidence-very-high { background: #51cf66; } .confidence-high { background: #69db7c; } ``` --- ## Advanced Usage ### Bulk Adding Masters Use the `bulk_add_masters.py` script to add multiple videos at once: ```bash # Add all .mp4 files from a directory python bulk_add_masters.py /path/to/masters/ # Recursively add from subdirectories python bulk_add_masters.py /path/to/masters/ --recursive # Add specific pattern python bulk_add_masters.py /path/to/masters/ --pattern "*.mov" ``` ### Adjusting Sampling Rate The default is **2 frames per second**, optimized for fast-paced advertising content with quick edits. Edit `src/video_matcher/fingerprinter.py:106`: ```python samples_per_second = 2.0 # Default: good for ads with quick cuts samples_per_second = 1.0 # Faster: basic matching, may miss quick edits samples_per_second = 3.0 # Slower: catches sub-second cuts ``` **Trade-offs:** | Rate | 20s Video | Use Case | Pros | Cons | |------|-----------|----------|------|------| | 0.5 fps | 10 frames | Long-form content | Fast, small files | May miss cuts | | 1.0 fps | 20 frames | General purpose | Balanced | Misses quick edits | | **2.0 fps** | **40 frames** | **Ads/Marketing** | **Catches quick cuts** | **2x storage** | | 3.0 fps | 60 frames | Frame-accurate | Very detailed | 3x slower | **Recommendation:** Keep 2 fps for advertising/marketing content with fast edits. ### Handling Different Aspect Ratios **Best Practice:** Maintain separate masters for each aspect ratio: ``` masters/ ├── 16x9/ │ ├── master_A_16x9.mp4 │ ├── master_B_16x9.mp4 ├── 9x16/ │ ├── master_A_9x16.mp4 │ ├── master_B_9x16.mp4 └── 1x1/ ├── master_A_1x1.mp4 └── master_B_1x1.mp4 ``` Add all versions to the library: ```bash python bulk_add_masters.py masters/16x9/ -r python bulk_add_masters.py masters/9x16/ -r python bulk_add_masters.py masters/1x1/ -r ``` The matcher will automatically identify the correct aspect ratio master. --- ## Understanding Results ### Sample Output ``` Found 2 master(s) matching this adaptation: ╭──────┬────────────┬─────────────┬────────┬───────┬──────────┬────────────╮ │ Rank │ Master ID │ Video Match │ Frames │ Audio │ Combined │ Confidence │ ├──────┼────────────┼─────────────┼────────┼───────┼──────────┼────────────┤ │ 1 │ master_C │ 100.0% │ 15/15 │ 0.500 │ 0.850 │ High │ │ 2 │ master_B │ 73.3% │ 11/15 │ 0.500 │ 0.663 │ Medium │ ╰──────┴────────────┴─────────────┴────────┴───────┴──────────┴────────────╯ Best Match: Master: master_C Video frames matched: 100.0% (15/15 frames) Average frame similarity: 94.4% Audio similarity: 0.500 Combined confidence: 85.0% ``` ### Interpreting Scores **Video Match Percentage:** - **100%**: All adaptation frames found in master - **75-99%**: Most frames match, likely correct master - **50-74%**: Partial match, possibly similar content - **<50%**: Unlikely to be source master **Average Frame Similarity:** - **>90%**: Near-identical frames (same encoding/quality) - **75-90%**: Very similar (different encoding/compression) - **60-75%**: Similar content (crops, color grading) - **<60%**: Different content or heavy transformations **Combined Score:** - Weighted combination: 70% video + 30% audio - Audio helps disambiguate visually similar masters - Higher combined score = more confident match ### When Multiple Masters Match If an adaptation uses content from multiple masters: ``` Best Match: Master: master_A - 60% of frames Other Potential Matches: • master_B: 40% of frames ``` This indicates the adaptation combined: - 60% content from master_A - 40% content from master_B --- ## Performance Tuning ### Speed vs Accuracy **For faster matching (lower accuracy):** ```python # Reduce sampling rate (1.0 = 1 frame per second) samples_per_second = 1.0 # Increase thresholds (stricter matching) frame_threshold = 0.75 threshold = 0.5 ``` **For better accuracy (slower):** ```python # Increase sampling rate (3.0 = 3 frames per second) samples_per_second = 3.0 # Lower thresholds (more sensitive) frame_threshold = 0.65 threshold = 0.3 ``` **Default (balanced for ads):** ```python samples_per_second = 2.0 # Catches quick edits frame_threshold = 0.70 threshold = 0.3 ``` ### Large Libraries For libraries with 100+ masters: 1. **Pre-filter by duration:** - Skip masters that are too short/long for the adaptation 2. **Use audio pre-filtering:** - Match audio first, then only check video for audio matches 3. **Parallel processing:** - Compare against multiple masters simultaneously --- ## Troubleshooting ### Common Issues **❌ No matches found** **Cause:** Thresholds too strict, or videos unrelated **Solution:** ```bash # Try more lenient settings python cli.py match video.mp4 -t 0.2 -f 0.65 ``` --- **❌ Too many false positives** **Cause:** Thresholds too lenient, similar-looking content **Solution:** ```bash # Stricter matching python cli.py match video.mp4 -t 0.5 -f 0.75 ``` --- **❌ Speed-changed adaptations not matching** **Cause:** Already handled! Spatial matching ignores timing **Check:** - Ensure video content is actually similar - Lower frame_threshold if heavily processed --- **❌ Different aspect ratios not matching** **Solution:** Ensure you have masters in the same aspect ratio ```bash # Add masters for each aspect ratio python cli.py add-master master_16x9.mp4 python cli.py add-master master_1x1.mp4 ``` --- **❌ Audio similarity always 0.500** **Cause:** Chromaprint comparison not fully implemented (placeholder) **Note:** This is a POC limitation. Video matching still works. --- ## API Reference ### VideoFingerprinter ```python from video_matcher.fingerprinter import VideoFingerprinter fp = VideoFingerprinter(data_dir="data/fingerprints") # Generate fingerprint fingerprint = fp.fingerprint_video( video_path="/path/to/video.mp4", video_id="my_video" ) # Load existing fingerprint existing = fp.load_fingerprint("my_video") # List all fingerprints all_ids = fp.list_fingerprints() ``` ### VideoMatcher ```python from video_matcher.matcher import VideoMatcher matcher = VideoMatcher(data_dir="data") # Add master matcher.add_master( video_path="/path/to/master.mp4", master_id="master_1" ) # List masters masters = matcher.list_masters() # Match adaptation matches = matcher.match_adaptation( video_path="/path/to/adaptation.mp4", threshold=0.3, frame_threshold=0.70 ) # Clear all masters matcher.clear_masters() ``` ### Comparison Functions ```python from video_matcher.fingerprinter import ( compare_spatial_only, compare_audio_fingerprints ) # Spatial video comparison result = compare_spatial_only( adaptation_fp=adapt_fp, master_fp=master_fp, similarity_threshold=0.75 ) # Returns: { # 'matching_frames': 12, # 'total_frames': 15, # 'percentage': 80.0, # 'average_similarity': 0.87 # } # Audio comparison audio_score = compare_audio_fingerprints( fp1=adapt_audio, fp2=master_audio ) # Returns: float (0-1) ``` --- ## File Formats ### Fingerprint JSON Structure ```json { "video_id": "master_example", "path": "/path/to/video.mp4", "filename": "video.mp4", "info": { "duration": 20.0, "width": 1920, "height": 1080, "fps": 25.0, "has_audio": true, "codec": "h264" }, "audio_fp": { "duration": 20.0, "fingerprint": "AQAAZEw4Kc9w...", "method": "chromaprint" }, "video_fp": { "method": "basic_hash", "samples_per_second": 1.0, "num_frames": 20, "frames": [ { "frame_id": 0, "timestamp": 0.0, "hash": "0xcfcfc7e3c3e3e3e3" } ] } } ``` ### Masters Database (masters.json) ```json { "masters": [ { "master_id": "master_example", "fingerprint_id": "master_master_example", "path": "/path/to/video.mp4", "filename": "video.mp4", "duration": 20.0 } ] } ``` --- ## Future Enhancements ### Production-Ready Improvements 1. **TMK Integration** - Facebook's Threat Match for more robust matching 2. **Segment Timeline** - Show exactly which parts came from which master 3. **Web UI** - Drag-drop interface with side-by-side comparison 4. **Batch Processing** - Process hundreds of adaptations in parallel 5. **Database Storage** - PostgreSQL/MongoDB instead of JSON files 6. **Vector Search** - Milvus/Qdrant for sub-second matching in large libraries 7. **GPU Acceleration** - CUDA-based hash computation 8. **CLIP Embeddings** - Handle heavy crops, overlays, graphics 9. **Shot Detection** - PySceneDetect for segment-level matching 10. **Audio Refinement** - Proper Chromaprint comparison implementation ### Suggested Architecture for Scale ``` ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Web UI │────▶│ API Gateway │────▶│ Job Queue │ │ (React) │ │ (FastAPI) │ │ (Celery) │ └──────────────┘ └──────────────┘ └──────┬───────┘ │ ┌──────────────┐ ┌───────▼───────┐ │ Vector DB │────▶│ Workers │ │ (Qdrant) │ │ (GPU-based) │ └──────────────┘ └───────────────┘ ``` --- ## License MIT License - See LICENSE file for details. --- ## Support & Contact For questions, issues, or contributions, please open an issue on the GitHub repository. **Documentation Version:** 1.0 **Last Updated:** 2025-10-05