diff --git a/.gitignore b/.gitignore index 02c1146..34688ff 100644 --- a/.gitignore +++ b/.gitignore @@ -51,6 +51,7 @@ Thumbs.db # Python virtual environments venv/ venv_new/ +venv_local/ env/ ENV/ .venv/ @@ -76,3 +77,19 @@ Files/ .vscode/ .claude/ +# Database files +*.db +*.sqlite +*.sqlite3 + +# Server files +server.pid +server.log +nohup.out + +# Test files +test_*.csv +test_*.xlsx +test_*.json +TEST_REPORT.md + diff --git a/README.md b/README.md index 679d9ff..a7c983e 100644 --- a/README.md +++ b/README.md @@ -1,97 +1,486 @@ -# Oliver Metadata Tool +# Oliver Metadata Tool v3.1 Enterprise Edition -Universal metadata creation and management tool for all file types. Create, import, and manage metadata from multiple sources with an intuitive web interface. +Universal metadata creation and management tool for all file types. Create, import, and manage metadata from multiple sources with an intuitive web interface, user authentication, and AI-powered metadata generation. + +**Developer:** Vadym Samoilenko +**License:** Corporate License - Oliver Marketing +**Version:** 3.1 (Enterprise Edition) + +--- ## Features -- **Excel-based metadata lookup**: Reads metadata from "Celum ID to Adobe Asset Path Mapping Spreadsheet" -- **Multi-format support**: PDF, images (JPG, PNG, etc.), Office documents (Word, Excel, PowerPoint), video files -- **Unicode support**: Full support for Chinese, Japanese, Korean characters (CGA region) -- **OCR capabilities**: Multi-language text extraction with Tesseract -- **Web interface**: Flask-based UI for easy batch processing -- **Dual-sheet Excel lookup**: Primary lookup from DSB sheet, fallback to Medsurg sheet +### Multiple Metadata Sources +- **📊 Excel Lookup**: Configure custom Excel files with column mapping +- **🤖 AI Generation**: OpenAI-powered intelligent metadata generation +- **âœī¸ Manual Entry**: Direct editing with real-time validation +- **📂 File Import**: Import from CSV, Excel, or JSON with custom mapping +- **📋 Templates**: Reusable metadata templates with variables + +### Enterprise Features +- **🔐 Authentication**: Local user authentication + Microsoft SSO support +- **đŸ‘Ĩ User Management**: SQLite database for users and sessions +- **📊 Audit Logging**: Track all user actions and metadata changes +- **🔍 AI Usage Tracking**: Monitor OpenAI token usage and costs + +### File Support +- **300+ File Formats** via ExifTool integration +- **PDF Files**: Full metadata support (title, subject, keywords, author, copyright) +- **Images**: JPEG, PNG, GIF, HEIC, TIFF, RAW formats +- **Office Documents**: Word, Excel, PowerPoint +- **Video Files**: MP4, MOV, AVI, MKV +- **Unicode Support**: Full support for Chinese, Japanese, Korean characters + +### Advanced Capabilities +- **Smart Field Mapping**: Auto-detect columns with fuzzy matching +- **Batch Processing**: Process multiple files with selective updates +- **Custom Metadata Fields**: Add unlimited custom fields +- **CSV Export**: Export metadata and processing results +- **Template Variables**: {filename}, {date}, {user}, custom variables + +--- ## Requirements -- Python 3.8+ -- Tesseract OCR (for image text extraction) -- Poppler (for PDF processing) -- **ExifTool 12.15+** (recommended - enables 300+ file formats and improved performance) +### System Dependencies +- **Python 3.8+** +- **ExifTool 12.15+** (required for 300+ format support) +- **Tesseract OCR** (optional - for image text extraction) +- **Poppler** (optional - for PDF content extraction) + +### Python Dependencies +All listed in `requirements.txt`: +- Flask 2.3.0+ (Web framework) +- pandas, openpyxl (Excel/CSV processing) +- PyExifTool 0.5.6+ (Metadata operations) +- openai 1.0.0+ (AI generation) +- tiktoken 0.5.0+ (Token counting) +- tenacity 8.2.0+ (Retry logic) +- msal (Microsoft SSO - optional) + +--- ## Installation -1. Install system dependencies: -```bash -# macOS -brew install tesseract tesseract-lang poppler exiftool +### 1. Install System Dependencies -# Linux (Ubuntu/Debian) -sudo apt-get install tesseract-ocr tesseract-ocr-chi-sim tesseract-ocr-chi-tra tesseract-ocr-jpn tesseract-ocr-kor poppler-utils libimage-exiftool-perl +**macOS:** +```bash +brew install exiftool tesseract tesseract-lang poppler ``` -**Note:** ExifTool is optional but highly recommended. It provides: -- Support for 300+ file formats -- 10-60x faster batch operations -- Better PDF metadata writing -- See [docs/EXIFTOOL_SETUP.md](docs/EXIFTOOL_SETUP.md) for detailed setup instructions - -2. Create virtual environment and install Python packages: +**Linux (Ubuntu/Debian):** +```bash +sudo apt-get install libimage-exiftool-perl tesseract-ocr tesseract-ocr-chi-sim tesseract-ocr-chi-tra tesseract-ocr-jpn tesseract-ocr-kor poppler-utils +``` + +**Windows:** +```bash +# Install ExifTool from: https://exiftool.org/ +choco install exiftool tesseract +``` + +**Verify ExifTool Installation:** +```bash +exiftool -ver +# Should show version 12.15 or higher +``` + +See [docs/EXIFTOOL_SETUP.md](docs/EXIFTOOL_SETUP.md) for detailed setup instructions. + +### 2. Create Virtual Environment + +```bash +python3 -m venv venv_local +source venv_local/bin/activate # On Windows: venv_local\Scripts\activate +``` + +### 3. Install Python Dependencies + ```bash -python3 -m venv venv -source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt ``` -3. Set up environment variables (create `.env` file): +### 4. Configure Environment Variables + +Create a `.env` file in the project root: + +```env +# Required: OpenAI API Key (for AI metadata generation) +OPENAI_API_KEY=your-openai-api-key-here + +# Optional: Microsoft SSO (for enterprise authentication) +# AZURE_CLIENT_ID=your-azure-client-id +# AZURE_CLIENT_SECRET=your-azure-client-secret +# AZURE_TENANT_ID=your-azure-tenant-id +# REDIRECT_URI=http://localhost:5001/auth/callback + +# Optional: Flask secret key (auto-generated if not set) +# SECRET_KEY=your-secret-key-here + +# Optional: AI settings (defaults shown) +# AI_MODEL=gpt-4o-mini +# MAX_TOKENS=500 +# TEMPERATURE=0.5 +# API_TIMEOUT=30 +# API_MAX_RETRIES=3 ``` -UPLOAD_FOLDER=uploads -OUTPUT_FOLDER=output -TESSERACT_PATH=/opt/homebrew/bin/tesseract -OCR_LANGUAGES=eng+chi_sim+chi_tra+jpn+kor + +### 5. Initialize Database + +The database will be created automatically on first run. To manually initialize: + +```bash +python -c "from src.database import Database; db = Database(); print('Database initialized')" ``` +--- + ## Usage -### Web Interface +### Starting the Web Application ```bash python web_app.py ``` -Open browser at `http://localhost:5001` +The application will: +1. ✅ Check for ExifTool availability +2. ✅ Initialize SQLite database (users, sessions, audit_log) +3. ✅ Start Flask server on http://localhost:5001 +4. 🌐 Open browser automatically -### GUI Application +### Login -```bash -python run_gui.py +**Test Account:** +- Username: `tester` +- Password: `oliveradmin` + +**Microsoft SSO** (if configured): +- Click "Sign in with Microsoft" button +- Authenticate via Azure AD +- Users auto-created on first login + +### Using Metadata Sources + +#### 1. Excel Lookup +1. Click "Upload Excel File" +2. Configure mapping modal: + - Select sheet name + - Map columns: Filename (required), Title, Description, Keywords + - Preview first 3 rows +3. Confirm mapping +4. Upload files to process + +#### 2. AI Generation +1. Select "AI Generation" from metadata source dropdown +2. Upload files +3. AI generates metadata (10-30 seconds per file) +4. Review and edit generated metadata +5. Save changes + +#### 3. Manual Entry +1. Select "Manual Entry" +2. Upload files +3. Fill in metadata fields manually +4. Save changes + +#### 4. Import from File +1. Click "Import from File" +2. Upload CSV/Excel/JSON file +3. Configure column mapping (same as Excel) +4. Upload files to match metadata + +#### 5. Templates +1. Create template with variables +2. Select template from dropdown +3. Apply to selected files +4. Review and save + +### Batch Operations + +1. Upload multiple files +2. Use checkboxes to select files +3. "Select All" / "Deselect All" buttons +4. Edit metadata individually +5. Click "Update Selected Files" to save all at once +6. Export results to CSV + +--- + +## Configuration + +### Database Schema + +**Users Table:** +- id, username, password_hash, email, full_name +- auth_method (local/sso) +- created_at, last_login, is_active + +**Sessions Table:** +- session_id, user_id, created_at, expires_at +- ip_address, user_agent + +**Audit Log Table:** +- id, user_id, action, details, timestamp + +### AI Usage Tracking + +Every AI metadata generation is logged with: +- User ID +- Timestamp +- Tokens used (prompt + completion) +- Cost estimate (based on gpt-4o-mini pricing) + +View logs in database: +```sql +SELECT * FROM audit_log WHERE action = 'ai_generation' ORDER BY timestamp DESC; ``` -## Excel Data Structure +### User Management -The tool reads metadata from Excel file with two sheets: +**Create New User:** +```python +from src.database import Database +db = Database() +db.create_user( + username='newuser', + password='password123', + email='user@example.com', + full_name='New User', + auth_method='local' +) +``` -### Sheet 1: DSB Celum ID to Path mapping (Primary) -- Column B: Celum ID -- Column E: Title -- Column F: External Description/Alt Text +**List All Users:** +```python +users = db.get_all_users() +for user in users: + print(f"{user['username']} - Last login: {user['last_login']}") +``` -### Sheet 2: Medsurg Metadata Cheat (Fallback) -- Column: Solventum DAM Asset Path (contains filename) -- Metadata columns for Title and Description - -Lookup is performed by filename (without extension), case-insensitive. +--- ## Architecture -- `web_app.py` - Flask web application -- `run_gui.py` - GUI launcher -- `src/` - Core modules - - `extractors/` - Content extraction for different file types - - `updaters/` - Metadata update for different file types - - `excel_metadata_lookup.py` - Excel-based metadata lookup - - `main.py` - Core processing logic - - `config.py` - Configuration management +### File Structure -## License +``` +oliver-metadata-tool/ +├── web_app.py # Flask web application (main entry point) +├── requirements.txt # Python dependencies +├── .env # Environment configuration +├── oliver_metadata.db # SQLite database (auto-created) +├── src/ +│ ├── config.py # Configuration management +│ ├── database.py # Database operations +│ ├── auth.py # Authentication logic +│ ├── metadata_analyzer.py # AI metadata generation +│ ├── metadata_importer.py # Import from files +│ ├── template_manager.py # Template system +│ ├── field_mapper.py # Column mapping +│ ├── excel_metadata_lookup.py # Excel lookup +│ ├── extractors/ +│ │ ├── pdf_extractor.py +│ │ ├── image_extractor.py +│ │ ├── office_extractor.py +│ │ ├── video_extractor.py +│ │ └── exiftool_extractor.py +│ └── updaters/ +│ ├── pdf_updater.py +│ ├── image_updater.py +│ ├── office_updater.py +│ ├── video_updater.py +│ └── exiftool_updater.py +├── templates/ +│ ├── index.html # Main UI +│ └── login.html # Login page +└── docs/ + └── EXIFTOOL_SETUP.md # ExifTool setup guide +``` -Proprietary - Solventum +### Technology Stack + +- **Backend:** Flask (Python) +- **Database:** SQLite +- **Frontend:** HTML5, CSS3, JavaScript (Vanilla) +- **Design:** Montserrat font, Dark & Gold theme +- **Authentication:** Flask-Session, werkzeug.security, MSAL +- **AI:** OpenAI API (gpt-4o-mini) +- **Metadata:** PyExifTool, pypdf, python-docx, openpyxl + +--- + +## API Endpoints + +### Authentication +- `GET /login` - Login page +- `POST /login` - Authenticate user +- `GET /logout` - Destroy session +- `GET /login/microsoft` - Microsoft SSO redirect +- `GET /auth/callback` - SSO callback + +### File Operations +- `POST /upload` - Upload files and generate metadata +- `POST /update-manual` - Update file metadata manually +- `GET /download/` - Download processed file + +### Metadata Sources +- `POST /upload-excel` - Upload Excel file for mapping +- `POST /preview-excel-sheet` - Preview Excel sheet structure +- `POST /configure-excel-mapping` - Configure Excel column mapping +- `POST /import-metadata` - Upload import file for mapping +- `POST /configure-import-mapping` - Configure import column mapping + +### Templates +- `GET /templates/list` - List all templates +- `POST /templates/save` - Save new template +- `POST /templates/load` - Load template by name +- `DELETE /templates/delete` - Delete template +- `POST /templates/apply` - Apply template to files +- `POST /templates/preview` - Preview template output + +--- + +## Security & Privacy + +### Authentication +- Passwords hashed with werkzeug.security (pbkdf2:sha256) +- Session tokens: 32-byte cryptographically secure random strings +- Sessions expire after 24 hours +- Microsoft SSO via OAuth2 + Azure AD + +### Data Protection +- All credentials stored in `.env` (excluded from git) +- Database file excluded from git +- API keys never logged or exposed to frontend +- Audit trail for all user actions + +### Production Recommendations +1. **HTTPS:** Use SSL/TLS certificates in production +2. **Database:** Migrate to PostgreSQL for better concurrency +3. **Rate Limiting:** Add rate limits to prevent abuse +4. **CSRF Protection:** Enable Flask-WTF for form security +5. **Error Tracking:** Integrate Sentry or similar service +6. **Backups:** Regular database backups +7. **Monitoring:** Track AI token usage for cost management + +--- + +## Troubleshooting + +### Common Issues + +**ExifTool not found:** +```bash +# Verify installation +exiftool -ver + +# macOS: Reinstall with Homebrew +brew reinstall exiftool + +# Linux: Reinstall with apt +sudo apt-get install --reinstall libimage-exiftool-perl +``` + +**Database locked error:** +```bash +# Stop all instances +lsof -ti:5001 | xargs kill -9 + +# Restart application +python web_app.py +``` + +**OpenAI API errors:** +- Check API key in `.env` file +- Verify API key is valid at https://platform.openai.com/api-keys +- Check token usage limits on OpenAI dashboard + +**Import failed - column not found:** +- Use the mapping modal to manually select columns +- Check that your file has headers in the first row +- Verify file encoding is UTF-8 + +--- + +## Development + +### Running Tests + +```bash +# Unit tests (if implemented) +pytest tests/ + +# Manual integration test +python -c "from src.database import Database; from src.config import Config; print('✅ All imports successful')" +``` + +### Git Workflow + +```bash +# Check status +git status + +# Add changes +git add . + +# Commit with message +git commit -m "Your commit message" + +# Push to remote +git push origin main +``` + +--- + +## License & Credits + +**License:** Corporate License - Oliver Marketing +All rights reserved. Unauthorized copying, distribution, or modification is prohibited. + +**Developer:** Vadym Samoilenko +**Company:** Oliver Marketing +**Version:** 3.1 Enterprise Edition +**Release Date:** January 2026 + +**Third-Party Software:** +- ExifTool by Phil Harvey (Perl Artistic License) +- Flask by Pallets (BSD License) +- OpenAI API (Commercial License) +- PyExifTool (LGPL License) + +--- + +## Support + +For issues, questions, or feature requests: +- **Internal Support:** Contact IT department +- **Developer:** Vadym Samoilenko +- **Documentation:** See `docs/` folder + +--- + +## Changelog + +### v3.1 (January 2026) - Enterprise Edition +- ✅ User authentication (local + Microsoft SSO) +- ✅ SQLite database with audit logging +- ✅ Smart column mapping for Excel/CSV import +- ✅ Custom metadata fields support +- ✅ AI usage tracking and cost monitoring +- ✅ Dark & Gold UI redesign +- ✅ Template variables and preview +- ✅ Batch selection and CSV export + +### v3.0 (January 2026) +- ✅ ExifTool integration (300+ formats) +- ✅ Multiple metadata sources (Excel, AI, Manual, Import) +- ✅ Field mapping with fuzzy matching +- ✅ Metadata templates system +- ✅ Rebranded to Oliver Metadata Tool + +### v2.x (Prior) +- Basic Excel lookup functionality +- Multi-format file support +- Web and GUI interfaces diff --git a/templates/index.html b/templates/index.html index 22f2cd8..ef31f23 100644 --- a/templates/index.html +++ b/templates/index.html @@ -4,51 +4,174 @@ Oliver Metadata Tool +