No description
Created comprehensive FieldMapper module (400+ lines): - Fuzzy field matching with SequenceMatcher (60% similarity threshold) - 10+ aliases per standard field (title, subject, keywords, description) - Auto-mapping with confidence scores (0.0 to 1.0) - Mapping suggestions with alternatives (top 2 per field) - Exact match detection (score 1.0) and substring bonuses (0.85) - Preset save/load/delete for reusable mappings - Mapping validation (duplicate targets, coverage stats) - Unmapped field detection and coverage percentage FieldMapper features: - auto_map(): Generate mapping from source fields - suggest_mapping(): Get best match + alternatives for each field - validate_mapping(): Check for conflicts and warnings - apply_mapping(): Transform data using field mapping - get_mapping_coverage(): Calculate mapping completeness - Preset management: save, load, list, delete MetadataImporter enhancements: - preview_file_structure(): Preview columns and suggest mappings - import_with_mapping(): Import with custom field mapping - Integration with FieldMapper for smart detection - Sample row preview (5 rows) before import Web API additions: - /preview-import endpoint: Preview file structure and field suggestions - Returns: columns, sample rows, mapping suggestions with confidence - Supports CSV, Excel, JSON format detection Field mapping workflow: 1. User uploads import file for preview 2. System analyzes columns and suggests mappings 3. User reviews/adjusts mappings (confidence scores shown) 4. User confirms and imports with mapping 5. Optional: Save mapping as preset for reuse Technical highlights: - SequenceMatcher from difflib for fuzzy string matching - Normalize field names (lowercase, underscores) - Multiple alias sets per target field - Confidence-based ranking of matches - Preset persistence via JSON file Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> |
||
|---|---|---|
| docs | ||
| src | ||
| templates | ||
| .gitignore | ||
| README.md | ||
| requirements.txt | ||
| run_gui.py | ||
| web_app.py | ||
Oliver Metadata Tool
Universal metadata creation and management tool for all file types. Create, import, and manage metadata from multiple sources with an intuitive web interface.
Features
- Excel-based metadata lookup: Reads metadata from "Celum ID to Adobe Asset Path Mapping Spreadsheet"
- Multi-format support: PDF, images (JPG, PNG, etc.), Office documents (Word, Excel, PowerPoint), video files
- Unicode support: Full support for Chinese, Japanese, Korean characters (CGA region)
- OCR capabilities: Multi-language text extraction with Tesseract
- Web interface: Flask-based UI for easy batch processing
- Dual-sheet Excel lookup: Primary lookup from DSB sheet, fallback to Medsurg sheet
Requirements
- Python 3.8+
- Tesseract OCR (for image text extraction)
- Poppler (for PDF processing)
- ExifTool 12.15+ (recommended - enables 300+ file formats and improved performance)
Installation
- Install system dependencies:
# macOS
brew install tesseract tesseract-lang poppler exiftool
# Linux (Ubuntu/Debian)
sudo apt-get install tesseract-ocr tesseract-ocr-chi-sim tesseract-ocr-chi-tra tesseract-ocr-jpn tesseract-ocr-kor poppler-utils libimage-exiftool-perl
Note: ExifTool is optional but highly recommended. It provides:
- Support for 300+ file formats
- 10-60x faster batch operations
- Better PDF metadata writing
- See docs/EXIFTOOL_SETUP.md for detailed setup instructions
- Create virtual environment and install Python packages:
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
- Set up environment variables (create
.envfile):
UPLOAD_FOLDER=uploads
OUTPUT_FOLDER=output
TESSERACT_PATH=/opt/homebrew/bin/tesseract
OCR_LANGUAGES=eng+chi_sim+chi_tra+jpn+kor
Usage
Web Interface
python web_app.py
Open browser at http://localhost:5001
GUI Application
python run_gui.py
Excel Data Structure
The tool reads metadata from Excel file with two sheets:
Sheet 1: DSB Celum ID to Path mapping (Primary)
- Column B: Celum ID
- Column E: Title
- Column F: External Description/Alt Text
Sheet 2: Medsurg Metadata Cheat (Fallback)
- Column: Solventum DAM Asset Path (contains filename)
- Metadata columns for Title and Description
Lookup is performed by filename (without extension), case-insensitive.
Architecture
web_app.py- Flask web applicationrun_gui.py- GUI launchersrc/- Core modulesextractors/- Content extraction for different file typesupdaters/- Metadata update for different file typesexcel_metadata_lookup.py- Excel-based metadata lookupmain.py- Core processing logicconfig.py- Configuration management
License
Proprietary - Solventum