video-query/BATCH_PROCESSING_IMPROVEMENTS.md

349 lines
11 KiB
Markdown

# Batch Processing Improvements - Implementation Summary
**Date**: 2025-11-10
**Status**: ✅ All Phases Completed
## Overview
Implemented comprehensive improvements to batch video processing including model consistency fixes, specialized synthesis strategies, enhanced logging, and configurable options. All videos in a batch are now processed with the same prompt and synthesized intelligently based on content type.
---
## Changes Implemented
### ✅ Phase 1: Enhanced Logging
**File Modified**: `backend/video_processor.py`
**Changes**:
- Added structured logging with `[Stage 1]`, `[Stage 2]`, `[Traceability]`, and `[Metrics]` prefixes
- Implemented configurable debug-level logging for prompts and summaries
- Added performance metrics tracking (stage times, avg time per video, API call count)
- Added video-to-summary-to-result traceability logging
**New Log Output**:
```
Batch abc123: [Stage 1] Processing video 1/3: meeting1.mp4
Batch abc123: [Stage 1] Video 1 complete: 1,245 chars in 45.2s
Batch abc123: [Stage 2] Detected prompt type: meeting_summary
Batch abc123: [Stage 2] Synthesis complete: 3,456 chars in 15.3s
Batch abc123: [Traceability] Video-to-summary mapping:
Batch abc123: - Video 1: meeting1.mp4 → Summary 1
Batch abc123: [Metrics] Stage 1: 135.6s, Stage 2: 15.3s, Total: 150.9s
```
**Lines Modified**: 987-1055, 1123-1247
---
### ✅ Phase 2: Model Consistency Fix
**File Modified**: `backend/video_processor.py`
**Changes**:
- Changed synthesis model from `gemini-2.0-flash-exp` to `gemini-2.5-pro`
- Added model configuration constants at class level
- Made models configurable via environment variables
**Before**:
```python
# Individual processing
model="gemini-2.5-pro"
# Batch synthesis
model="gemini-2.0-flash-exp" # ❌ INCONSISTENT
```
**After**:
```python
# Both use same model
self.processing_model = "gemini-2.5-pro"
self.synthesis_model = "gemini-2.5-pro" # ✅ CONSISTENT
```
**Lines Modified**: 48-50, 82-88, 339, 553, 1252
---
### ✅ Phase 3: Specialized Synthesis Strategies
**File Modified**: `backend/video_processor.py`
**Changes**:
- Added `_detect_prompt_type()` method to classify prompts
- Added `_create_synthesis_prompt_meeting()` for meeting summaries
- Added `_create_synthesis_prompt_documentation()` for process docs
- Updated `_synthesize_final_result()` to route to specialized strategies
**Prompt Type Detection**:
```python
def _detect_prompt_type(self, prompt: str, summaries: List[str]) -> str:
"""
Detects: meeting_summary | documentation | documentation_with_charts | generic
"""
# Keywords: meeting, discussion, action item → meeting_summary
# Keywords: documentation, process, training → documentation
# Keywords: diagram, chart, mermaid → documentation_with_charts
```
**Meeting Synthesis Strategy**:
- Consolidates discussion points across all videos
- Creates master action items list (removes duplicates)
- Formats with clear sections: Overview, Discussion, Action Items, Outcomes
**Documentation Synthesis Strategy**:
- Combines steps into sequential guide
- Numbers steps continuously (Step 1, Step 2, ...)
- Includes Prerequisites, Tips, Troubleshooting sections
**Lines Added**: 1195-1441
---
### ✅ Phase 4: Configuration Options
**Files Modified**:
- `backend/video_processor.py`
- `backend/.env.example` (created)
**New Environment Variables**:
| Variable | Default | Description |
|----------|---------|-------------|
| `VIDEO_PROCESSOR_MODEL` | `gemini-2.5-pro` | Model for individual video processing |
| `VIDEO_SYNTHESIS_MODEL` | `gemini-2.5-pro` | Model for batch synthesis |
| `BATCH_PROCESSING_LOG_PROMPTS` | `false` | Enable prompt logging (debug) |
| `BATCH_PROCESSING_LOG_SUMMARIES` | `false` | Enable summary preview logging (debug) |
**Usage Example**:
```bash
# Enable detailed logging for debugging
export BATCH_PROCESSING_LOG_PROMPTS=true
export BATCH_PROCESSING_LOG_SUMMARIES=true
# Use different model for synthesis (optional)
export VIDEO_SYNTHESIS_MODEL=gemini-2.0-flash-exp
```
**Lines Modified**: 82-88, 1003-1004, 1016-1017, 1150-1151, 1170-1171, 1190-1192, 1204-1205, 1240-1242, 1272-1273
---
### ✅ Documentation Updates
**File Modified**: `CLAUDE.md`
**Sections Added/Updated**:
1. **Backend Setup**: Added .env example with all configuration options
2. **Production Deployment**: Updated environment configuration section
3. **Key Architecture Components**: Added comprehensive Batch Processing Architecture section
4. **Configuration Files**: Documented all environment variables
5. **Troubleshooting**: Added Batch Processing Issues section with debugging guide
**New Documentation Sections**:
- Batch Processing Architecture
- Batch Processing Flow (4-stage explanation)
- Logging Levels guide
- Troubleshooting: Inconsistent summaries
- Troubleshooting: Prompt visibility
- Troubleshooting: Video-to-result mapping
- Troubleshooting: Performance issues
---
## How to Use
### Normal Operation (Default)
```bash
# No changes needed - works out of the box
GOOGLE_API_KEY=your_key
```
### Enable Debugging
```bash
# In backend/.env
GOOGLE_API_KEY=your_key
BATCH_PROCESSING_LOG_PROMPTS=true
BATCH_PROCESSING_LOG_SUMMARIES=true
# Restart backend
sudo systemctl restart video-query
# View logs with filtering
journalctl -u video-query -f | grep "Batch"
```
### View Traceability (Always Enabled)
```bash
# See which video contributed to which part of result
journalctl -u video-query -f | grep "Traceability"
```
### View Performance Metrics (Always Enabled)
```bash
# See timing breakdown and API call counts
journalctl -u video-query -f | grep "Metrics"
```
---
## Verification
### Test Batch Processing
```bash
# Process multiple videos as batch
curl -X POST http://localhost:5010/api/process-batch \
-H "Content-Type: application/json" \
-d '{
"videos": [
{"file_path": "/tmp/video1.mp4", "filename": "meeting_part1.mp4", "order": 1},
{"file_path": "/tmp/video2.mp4", "filename": "meeting_part2.mp4", "order": 2}
],
"prompt": "Generate a detailed meeting summary with action items",
"batch_id": "test-batch-001"
}'
# Check logs for:
# 1. Prompt type detection: "Detected prompt type: meeting_summary"
# 2. Model consistency: "model: gemini-2.5-pro" for both stages
# 3. Traceability: Video-to-summary mapping
# 4. Performance: Stage 1/2 timing
```
### Expected Log Output
```
2025-11-10 10:30:00 - Batch test-batch-001: Processing 2 videos (meeting_part1.mp4, meeting_part2.mp4)
2025-11-10 10:30:00 - Batch test-batch-001: [Stage 1] Direct processing of 2 videos
2025-11-10 10:30:05 - Batch test-batch-001: [Stage 1] Processing video 1/2: meeting_part1.mp4
2025-11-10 10:30:50 - Batch test-batch-001: [Stage 1] Video 1 complete: 1,234 chars in 45.2s
2025-11-10 10:30:55 - Batch test-batch-001: [Stage 1] Processing video 2/2: meeting_part2.mp4
2025-11-10 10:31:40 - Batch test-batch-001: [Stage 1] Video 2 complete: 1,567 chars in 45.1s
2025-11-10 10:31:40 - Batch test-batch-001: [Stage 1] Complete - 2 summaries in 95.3s
2025-11-10 10:31:40 - Batch test-batch-001: [Traceability] Video-to-summary mapping:
2025-11-10 10:31:40 - Batch test-batch-001: - Video 1: meeting_part1.mp4 → Summary 1
2025-11-10 10:31:40 - Batch test-batch-001: - Video 2: meeting_part2.mp4 → Summary 2
2025-11-10 10:31:40 - Batch test-batch-001: [Stage 2] Synthesizing 2 summaries
2025-11-10 10:31:40 - Batch test-batch-001: [Stage 2] Combined summaries: 2 summaries, 2801 total chars
2025-11-10 10:31:40 - Batch test-batch-001: [Stage 2] Detected prompt type: meeting_summary
2025-11-10 10:31:40 - Batch test-batch-001: [Stage 2] Sending synthesis request to Gemini API (model: gemini-2.5-pro)
2025-11-10 10:31:55 - Batch test-batch-001: [Stage 2] Synthesis complete: 3,456 chars in 15.2s
2025-11-10 10:31:55 - Batch test-batch-001: [Metrics] Stage 1: 95.3s, Stage 2: 15.2s, Total: 110.5s
2025-11-10 10:31:55 - Batch test-batch-001: [Metrics] Avg time per video: 47.7s
```
---
## Benefits
### 1. Model Consistency ✅
- **Before**: Different models for processing vs synthesis
- **After**: Same model (gemini-2.5-pro) ensures consistent quality
- **Impact**: More predictable and reliable results
### 2. Specialized Synthesis ✅
- **Before**: Generic synthesis for all content types
- **After**: Tailored strategies for meetings, documentation, diagrams
- **Impact**: Better quality summaries that match user intent
### 3. Enhanced Visibility ✅
- **Before**: Limited logging, hard to debug issues
- **After**: Comprehensive logging with traceability and metrics
- **Impact**: Easy troubleshooting and performance optimization
### 4. Configurability ✅
- **Before**: Models and logging hardcoded
- **After**: Configurable via environment variables
- **Impact**: Flexible for different use cases and debugging
---
## Files Changed
| File | Lines Modified | Changes |
|------|---------------|---------|
| `backend/video_processor.py` | ~200 lines | Model config, logging, synthesis strategies |
| `backend/.env.example` | New file | Configuration documentation |
| `CLAUDE.md` | ~100 lines | Architecture docs, troubleshooting guide |
| `BATCH_PROCESSING_IMPROVEMENTS.md` | New file | This summary document |
---
## Rollback Instructions
If issues arise, rollback is simple:
### Option 1: Use Git
```bash
cd /path/to/video-query
git checkout HEAD~1 backend/video_processor.py
sudo systemctl restart video-query
```
### Option 2: Disable New Features
```bash
# In backend/.env
VIDEO_SYNTHESIS_MODEL=gemini-2.0-flash-exp # Revert to old model
BATCH_PROCESSING_LOG_PROMPTS=false
BATCH_PROCESSING_LOG_SUMMARIES=false
sudo systemctl restart video-query
```
---
## Next Steps
### Recommended Testing
1. **Test with meeting videos**: Verify meeting-specific synthesis
2. **Test with documentation videos**: Verify documentation synthesis
3. **Test with diagrams**: Verify diagram merging
4. **Load test**: Process batch with 5+ videos
5. **Performance test**: Compare stage 1 vs stage 2 times
### Future Enhancements (Optional)
1. Add structured JSON logging for log aggregation tools
2. Add Prometheus metrics for monitoring
3. Add batch processing status webhooks
4. Add configurable synthesis strategies per user/tenant
5. Add caching for similar prompts
---
## Support
### Enable Debug Logging
```bash
# In backend/.env
BATCH_PROCESSING_LOG_PROMPTS=true
BATCH_PROCESSING_LOG_SUMMARIES=true
# View filtered logs
journalctl -u video-query -f | grep -E "(Batch|Stage|Traceability|Metrics)"
```
### Common Issues
See `CLAUDE.md` → Troubleshooting → Batch Processing Issues
### Questions
Refer to updated documentation in `CLAUDE.md`:
- Batch Processing Architecture section
- Configuration Files section
- Troubleshooting section
---
## Implementation Summary
**Phase 1**: Enhanced Logging - COMPLETE
**Phase 2**: Model Consistency - COMPLETE
**Phase 3**: Specialized Synthesis - COMPLETE
**Phase 4**: Configuration Options - COMPLETE
**Documentation**: Updated CLAUDE.md - COMPLETE
**Total Implementation Time**: ~3 hours
**Testing Recommended**: 1-2 hours
**Production Risk**: Low (backward compatible, configurable)
---
**End of Implementation Summary**