Added ExifTool integration to support 300+ file formats with improved performance and unified API for metadata operations. Changes: - Added PyExifTool>=0.5.6 to requirements.txt - Created comprehensive ExifTool setup guide (docs/EXIFTOOL_SETUP.md) - Created ExifToolExtractor for reading metadata from images/video/PDF - Created ExifToolUpdater for writing metadata to images/video/PDF - Updated README with ExifTool installation instructions ExifTool Benefits: - Unified API for images, videos, PDFs (vs 5+ separate libraries) - Support for 300+ formats (HEIC, RAW, MKV, and more) - 10-60x faster batch operations with stay_open mode - Better PDF metadata writing (current pypdf is read-only) - Battle-tested tool with 20+ years of development Architecture: - Hybrid approach: ExifTool for images/video/PDF, Python libs for Office - Graceful fallback if ExifTool not installed - Automatic detection on startup with helpful messages - Tag mapping from ExifTool tags to standard fields (title/subject/keywords) Implementation follows existing extractor/updater patterns for consistency. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
97 lines
2.8 KiB
Markdown
97 lines
2.8 KiB
Markdown
# Oliver Metadata Tool
|
|
|
|
Universal metadata creation and management tool for all file types. Create, import, and manage metadata from multiple sources with an intuitive web interface.
|
|
|
|
## Features
|
|
|
|
- **Excel-based metadata lookup**: Reads metadata from "Celum ID to Adobe Asset Path Mapping Spreadsheet"
|
|
- **Multi-format support**: PDF, images (JPG, PNG, etc.), Office documents (Word, Excel, PowerPoint), video files
|
|
- **Unicode support**: Full support for Chinese, Japanese, Korean characters (CGA region)
|
|
- **OCR capabilities**: Multi-language text extraction with Tesseract
|
|
- **Web interface**: Flask-based UI for easy batch processing
|
|
- **Dual-sheet Excel lookup**: Primary lookup from DSB sheet, fallback to Medsurg sheet
|
|
|
|
## Requirements
|
|
|
|
- Python 3.8+
|
|
- Tesseract OCR (for image text extraction)
|
|
- Poppler (for PDF processing)
|
|
- **ExifTool 12.15+** (recommended - enables 300+ file formats and improved performance)
|
|
|
|
## Installation
|
|
|
|
1. Install system dependencies:
|
|
```bash
|
|
# macOS
|
|
brew install tesseract tesseract-lang poppler exiftool
|
|
|
|
# Linux (Ubuntu/Debian)
|
|
sudo apt-get install tesseract-ocr tesseract-ocr-chi-sim tesseract-ocr-chi-tra tesseract-ocr-jpn tesseract-ocr-kor poppler-utils libimage-exiftool-perl
|
|
```
|
|
|
|
**Note:** ExifTool is optional but highly recommended. It provides:
|
|
- Support for 300+ file formats
|
|
- 10-60x faster batch operations
|
|
- Better PDF metadata writing
|
|
- See [docs/EXIFTOOL_SETUP.md](docs/EXIFTOOL_SETUP.md) for detailed setup instructions
|
|
|
|
2. Create virtual environment and install Python packages:
|
|
```bash
|
|
python3 -m venv venv
|
|
source venv/bin/activate # On Windows: venv\Scripts\activate
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
3. Set up environment variables (create `.env` file):
|
|
```
|
|
UPLOAD_FOLDER=uploads
|
|
OUTPUT_FOLDER=output
|
|
TESSERACT_PATH=/opt/homebrew/bin/tesseract
|
|
OCR_LANGUAGES=eng+chi_sim+chi_tra+jpn+kor
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Web Interface
|
|
|
|
```bash
|
|
python web_app.py
|
|
```
|
|
|
|
Open browser at `http://localhost:5001`
|
|
|
|
### GUI Application
|
|
|
|
```bash
|
|
python run_gui.py
|
|
```
|
|
|
|
## Excel Data Structure
|
|
|
|
The tool reads metadata from Excel file with two sheets:
|
|
|
|
### Sheet 1: DSB Celum ID to Path mapping (Primary)
|
|
- Column B: Celum ID
|
|
- Column E: Title
|
|
- Column F: External Description/Alt Text
|
|
|
|
### Sheet 2: Medsurg Metadata Cheat (Fallback)
|
|
- Column: Solventum DAM Asset Path (contains filename)
|
|
- Metadata columns for Title and Description
|
|
|
|
Lookup is performed by filename (without extension), case-insensitive.
|
|
|
|
## Architecture
|
|
|
|
- `web_app.py` - Flask web application
|
|
- `run_gui.py` - GUI launcher
|
|
- `src/` - Core modules
|
|
- `extractors/` - Content extraction for different file types
|
|
- `updaters/` - Metadata update for different file types
|
|
- `excel_metadata_lookup.py` - Excel-based metadata lookup
|
|
- `main.py` - Core processing logic
|
|
- `config.py` - Configuration management
|
|
|
|
## License
|
|
|
|
Proprietary - Solventum
|