solventum-image-metadata/README.md

# Oliver Metadata Tool

Universal metadata creation and management tool for all file types. Create, import, and manage metadata from multiple sources with an intuitive web interface.

## Features

- **Excel-based metadata lookup**: Reads metadata from "Celum ID to Adobe Asset Path Mapping Spreadsheet"
- **Multi-format support**: PDF, images (JPG, PNG, etc.), Office documents (Word, Excel, PowerPoint), video files
- **Unicode support**: Full support for Chinese, Japanese, Korean characters (CGA region)
- **OCR capabilities**: Multi-language text extraction with Tesseract
- **Web interface**: Flask-based UI for easy batch processing
- **Dual-sheet Excel lookup**: Primary lookup from DSB sheet, fallback to Medsurg sheet

## Requirements

- Python 3.8+
- Tesseract OCR (for image text extraction)
- Poppler (for PDF processing)
- **ExifTool 12.15+** (recommended - enables 300+ file formats and improved performance)

## Installation

1. Install system dependencies:
```bash
# macOS
brew install tesseract tesseract-lang poppler exiftool

# Linux (Ubuntu/Debian)
sudo apt-get install tesseract-ocr tesseract-ocr-chi-sim tesseract-ocr-chi-tra tesseract-ocr-jpn tesseract-ocr-kor poppler-utils libimage-exiftool-perl
```

**Note:** ExifTool is optional but highly recommended. It provides:
- Support for 300+ file formats
- 10-60x faster batch operations
- Better PDF metadata writing
- See [docs/EXIFTOOL_SETUP.md](docs/EXIFTOOL_SETUP.md) for detailed setup instructions

2. Create virtual environment and install Python packages:
```bash
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt
```

3. Set up environment variables (create `.env` file):
```
UPLOAD_FOLDER=uploads
OUTPUT_FOLDER=output
TESSERACT_PATH=/opt/homebrew/bin/tesseract
OCR_LANGUAGES=eng+chi_sim+chi_tra+jpn+kor
```

## Usage

### Web Interface

```bash
python web_app.py
```

Open browser at `http://localhost:5001`

### GUI Application

```bash
python run_gui.py
```

## Excel Data Structure

The tool reads metadata from Excel file with two sheets:

### Sheet 1: DSB Celum ID to Path mapping (Primary)
- Column B: Celum ID
- Column E: Title
- Column F: External Description/Alt Text

### Sheet 2: Medsurg Metadata Cheat (Fallback)
- Column: Solventum DAM Asset Path (contains filename)
- Metadata columns for Title and Description

Lookup is performed by filename (without extension), case-insensitive.

## Architecture

- `web_app.py` - Flask web application
- `run_gui.py` - GUI launcher
- `src/` - Core modules
  - `extractors/` - Content extraction for different file types
  - `updaters/` - Metadata update for different file types
  - `excel_metadata_lookup.py` - Excel-based metadata lookup
  - `main.py` - Core processing logic
  - `config.py` - Configuration management

## License

Proprietary - Solventum