SamoilenkoVadym ae19179752 Phase 1.4: ExifTool integration for enhanced metadata support

Added ExifTool integration to support 300+ file formats with improved
performance and unified API for metadata operations.

Changes:
- Added PyExifTool>=0.5.6 to requirements.txt
- Created comprehensive ExifTool setup guide (docs/EXIFTOOL_SETUP.md)
- Created ExifToolExtractor for reading metadata from images/video/PDF
- Created ExifToolUpdater for writing metadata to images/video/PDF
- Updated README with ExifTool installation instructions

ExifTool Benefits:
- Unified API for images, videos, PDFs (vs 5+ separate libraries)
- Support for 300+ formats (HEIC, RAW, MKV, and more)
- 10-60x faster batch operations with stay_open mode
- Better PDF metadata writing (current pypdf is read-only)
- Battle-tested tool with 20+ years of development

Architecture:
- Hybrid approach: ExifTool for images/video/PDF, Python libs for Office
- Graceful fallback if ExifTool not installed
- Automatic detection on startup with helpful messages
- Tag mapping from ExifTool tags to standard fields (title/subject/keywords)

Implementation follows existing extractor/updater patterns for consistency.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-25 15:26:01 +00:00

2.8 KiB

Raw Blame History

Oliver Metadata Tool

Universal metadata creation and management tool for all file types. Create, import, and manage metadata from multiple sources with an intuitive web interface.

Features

Excel-based metadata lookup: Reads metadata from "Celum ID to Adobe Asset Path Mapping Spreadsheet"
Multi-format support: PDF, images (JPG, PNG, etc.), Office documents (Word, Excel, PowerPoint), video files
Unicode support: Full support for Chinese, Japanese, Korean characters (CGA region)
OCR capabilities: Multi-language text extraction with Tesseract
Web interface: Flask-based UI for easy batch processing
Dual-sheet Excel lookup: Primary lookup from DSB sheet, fallback to Medsurg sheet

Requirements

Python 3.8+
Tesseract OCR (for image text extraction)
Poppler (for PDF processing)
ExifTool 12.15+ (recommended - enables 300+ file formats and improved performance)

Installation

Install system dependencies:

# macOS
brew install tesseract tesseract-lang poppler exiftool

# Linux (Ubuntu/Debian)
sudo apt-get install tesseract-ocr tesseract-ocr-chi-sim tesseract-ocr-chi-tra tesseract-ocr-jpn tesseract-ocr-kor poppler-utils libimage-exiftool-perl

Note: ExifTool is optional but highly recommended. It provides:

Support for 300+ file formats
10-60x faster batch operations
Better PDF metadata writing
See docs/EXIFTOOL_SETUP.md for detailed setup instructions

Create virtual environment and install Python packages:

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Set up environment variables (create .env file):

UPLOAD_FOLDER=uploads
OUTPUT_FOLDER=output
TESSERACT_PATH=/opt/homebrew/bin/tesseract
OCR_LANGUAGES=eng+chi_sim+chi_tra+jpn+kor

Usage

Web Interface

python web_app.py

Open browser at http://localhost:5001

GUI Application

python run_gui.py

Excel Data Structure

The tool reads metadata from Excel file with two sheets:

Sheet 1: DSB Celum ID to Path mapping (Primary)

Column B: Celum ID
Column E: Title
Column F: External Description/Alt Text

Sheet 2: Medsurg Metadata Cheat (Fallback)

Column: Solventum DAM Asset Path (contains filename)
Metadata columns for Title and Description

Lookup is performed by filename (without extension), case-insensitive.

Architecture

web_app.py - Flask web application
run_gui.py - GUI launcher
src/ - Core modules
- extractors/ - Content extraction for different file types
- updaters/ - Metadata update for different file types
- excel_metadata_lookup.py - Excel-based metadata lookup
- main.py - Core processing logic
- config.py - Configuration management

License

Proprietary - Solventum

2.8 KiB Raw Blame History