v3.1 Enterprise Edition: Excel/Import mapping, UI fixes, documentation update

Features:
- Smart column mapping for Excel and Import files (CSV/Excel/JSON)
- Modal dialogs for configuring sheet and column mappings
- Auto-detection of common column names (filename, title, description, keywords)
- Preview of first 3 rows before confirming mapping
- Case-insensitive filename matching without extension

UI Improvements:
- Fixed output folder selection (now uses text input instead of folder browser)
- Removed non-functional Reset button from metadata editor
- Clear button for output folder path

Documentation:
- Updated README.md with v3.1 Enterprise Edition information
- Developer: Vadym Samoilenko
- License: Corporate License - Oliver Marketing
- Added AI usage tracking and logging documentation
- Complete installation guide with all dependencies
- API endpoint documentation
- Security and privacy section
- Troubleshooting guide

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
SamoilenkoVadym 2026-01-25 17:06:18 +00:00
parent e9784d7da8
commit 804c8acbbb
5 changed files with 1849 additions and 221 deletions

17
.gitignore vendored
View file

@ -51,6 +51,7 @@ Thumbs.db
# Python virtual environments
venv/
venv_new/
venv_local/
env/
ENV/
.venv/
@ -76,3 +77,19 @@ Files/
.vscode/
.claude/
# Database files
*.db
*.sqlite
*.sqlite3
# Server files
server.pid
server.log
nohup.out
# Test files
test_*.csv
test_*.xlsx
test_*.json
TEST_REPORT.md

505
README.md
View file

@ -1,97 +1,486 @@
# Oliver Metadata Tool
# Oliver Metadata Tool v3.1 Enterprise Edition
Universal metadata creation and management tool for all file types. Create, import, and manage metadata from multiple sources with an intuitive web interface.
Universal metadata creation and management tool for all file types. Create, import, and manage metadata from multiple sources with an intuitive web interface, user authentication, and AI-powered metadata generation.
**Developer:** Vadym Samoilenko
**License:** Corporate License - Oliver Marketing
**Version:** 3.1 (Enterprise Edition)
---
## Features
- **Excel-based metadata lookup**: Reads metadata from "Celum ID to Adobe Asset Path Mapping Spreadsheet"
- **Multi-format support**: PDF, images (JPG, PNG, etc.), Office documents (Word, Excel, PowerPoint), video files
- **Unicode support**: Full support for Chinese, Japanese, Korean characters (CGA region)
- **OCR capabilities**: Multi-language text extraction with Tesseract
- **Web interface**: Flask-based UI for easy batch processing
- **Dual-sheet Excel lookup**: Primary lookup from DSB sheet, fallback to Medsurg sheet
### Multiple Metadata Sources
- **📊 Excel Lookup**: Configure custom Excel files with column mapping
- **🤖 AI Generation**: OpenAI-powered intelligent metadata generation
- **✏️ Manual Entry**: Direct editing with real-time validation
- **📂 File Import**: Import from CSV, Excel, or JSON with custom mapping
- **📋 Templates**: Reusable metadata templates with variables
### Enterprise Features
- **🔐 Authentication**: Local user authentication + Microsoft SSO support
- **👥 User Management**: SQLite database for users and sessions
- **📊 Audit Logging**: Track all user actions and metadata changes
- **🔍 AI Usage Tracking**: Monitor OpenAI token usage and costs
### File Support
- **300+ File Formats** via ExifTool integration
- **PDF Files**: Full metadata support (title, subject, keywords, author, copyright)
- **Images**: JPEG, PNG, GIF, HEIC, TIFF, RAW formats
- **Office Documents**: Word, Excel, PowerPoint
- **Video Files**: MP4, MOV, AVI, MKV
- **Unicode Support**: Full support for Chinese, Japanese, Korean characters
### Advanced Capabilities
- **Smart Field Mapping**: Auto-detect columns with fuzzy matching
- **Batch Processing**: Process multiple files with selective updates
- **Custom Metadata Fields**: Add unlimited custom fields
- **CSV Export**: Export metadata and processing results
- **Template Variables**: {filename}, {date}, {user}, custom variables
---
## Requirements
- Python 3.8+
- Tesseract OCR (for image text extraction)
- Poppler (for PDF processing)
- **ExifTool 12.15+** (recommended - enables 300+ file formats and improved performance)
### System Dependencies
- **Python 3.8+**
- **ExifTool 12.15+** (required for 300+ format support)
- **Tesseract OCR** (optional - for image text extraction)
- **Poppler** (optional - for PDF content extraction)
### Python Dependencies
All listed in `requirements.txt`:
- Flask 2.3.0+ (Web framework)
- pandas, openpyxl (Excel/CSV processing)
- PyExifTool 0.5.6+ (Metadata operations)
- openai 1.0.0+ (AI generation)
- tiktoken 0.5.0+ (Token counting)
- tenacity 8.2.0+ (Retry logic)
- msal (Microsoft SSO - optional)
---
## Installation
1. Install system dependencies:
```bash
# macOS
brew install tesseract tesseract-lang poppler exiftool
### 1. Install System Dependencies
# Linux (Ubuntu/Debian)
sudo apt-get install tesseract-ocr tesseract-ocr-chi-sim tesseract-ocr-chi-tra tesseract-ocr-jpn tesseract-ocr-kor poppler-utils libimage-exiftool-perl
**macOS:**
```bash
brew install exiftool tesseract tesseract-lang poppler
```
**Note:** ExifTool is optional but highly recommended. It provides:
- Support for 300+ file formats
- 10-60x faster batch operations
- Better PDF metadata writing
- See [docs/EXIFTOOL_SETUP.md](docs/EXIFTOOL_SETUP.md) for detailed setup instructions
2. Create virtual environment and install Python packages:
**Linux (Ubuntu/Debian):**
```bash
sudo apt-get install libimage-exiftool-perl tesseract-ocr tesseract-ocr-chi-sim tesseract-ocr-chi-tra tesseract-ocr-jpn tesseract-ocr-kor poppler-utils
```
**Windows:**
```bash
# Install ExifTool from: https://exiftool.org/
choco install exiftool tesseract
```
**Verify ExifTool Installation:**
```bash
exiftool -ver
# Should show version 12.15 or higher
```
See [docs/EXIFTOOL_SETUP.md](docs/EXIFTOOL_SETUP.md) for detailed setup instructions.
### 2. Create Virtual Environment
```bash
python3 -m venv venv_local
source venv_local/bin/activate # On Windows: venv_local\Scripts\activate
```
### 3. Install Python Dependencies
```bash
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
```
3. Set up environment variables (create `.env` file):
### 4. Configure Environment Variables
Create a `.env` file in the project root:
```env
# Required: OpenAI API Key (for AI metadata generation)
OPENAI_API_KEY=your-openai-api-key-here
# Optional: Microsoft SSO (for enterprise authentication)
# AZURE_CLIENT_ID=your-azure-client-id
# AZURE_CLIENT_SECRET=your-azure-client-secret
# AZURE_TENANT_ID=your-azure-tenant-id
# REDIRECT_URI=http://localhost:5001/auth/callback
# Optional: Flask secret key (auto-generated if not set)
# SECRET_KEY=your-secret-key-here
# Optional: AI settings (defaults shown)
# AI_MODEL=gpt-4o-mini
# MAX_TOKENS=500
# TEMPERATURE=0.5
# API_TIMEOUT=30
# API_MAX_RETRIES=3
```
UPLOAD_FOLDER=uploads
OUTPUT_FOLDER=output
TESSERACT_PATH=/opt/homebrew/bin/tesseract
OCR_LANGUAGES=eng+chi_sim+chi_tra+jpn+kor
### 5. Initialize Database
The database will be created automatically on first run. To manually initialize:
```bash
python -c "from src.database import Database; db = Database(); print('Database initialized')"
```
---
## Usage
### Web Interface
### Starting the Web Application
```bash
python web_app.py
```
Open browser at `http://localhost:5001`
The application will:
1. ✅ Check for ExifTool availability
2. ✅ Initialize SQLite database (users, sessions, audit_log)
3. ✅ Start Flask server on http://localhost:5001
4. 🌐 Open browser automatically
### GUI Application
### Login
```bash
python run_gui.py
**Test Account:**
- Username: `tester`
- Password: `oliveradmin`
**Microsoft SSO** (if configured):
- Click "Sign in with Microsoft" button
- Authenticate via Azure AD
- Users auto-created on first login
### Using Metadata Sources
#### 1. Excel Lookup
1. Click "Upload Excel File"
2. Configure mapping modal:
- Select sheet name
- Map columns: Filename (required), Title, Description, Keywords
- Preview first 3 rows
3. Confirm mapping
4. Upload files to process
#### 2. AI Generation
1. Select "AI Generation" from metadata source dropdown
2. Upload files
3. AI generates metadata (10-30 seconds per file)
4. Review and edit generated metadata
5. Save changes
#### 3. Manual Entry
1. Select "Manual Entry"
2. Upload files
3. Fill in metadata fields manually
4. Save changes
#### 4. Import from File
1. Click "Import from File"
2. Upload CSV/Excel/JSON file
3. Configure column mapping (same as Excel)
4. Upload files to match metadata
#### 5. Templates
1. Create template with variables
2. Select template from dropdown
3. Apply to selected files
4. Review and save
### Batch Operations
1. Upload multiple files
2. Use checkboxes to select files
3. "Select All" / "Deselect All" buttons
4. Edit metadata individually
5. Click "Update Selected Files" to save all at once
6. Export results to CSV
---
## Configuration
### Database Schema
**Users Table:**
- id, username, password_hash, email, full_name
- auth_method (local/sso)
- created_at, last_login, is_active
**Sessions Table:**
- session_id, user_id, created_at, expires_at
- ip_address, user_agent
**Audit Log Table:**
- id, user_id, action, details, timestamp
### AI Usage Tracking
Every AI metadata generation is logged with:
- User ID
- Timestamp
- Tokens used (prompt + completion)
- Cost estimate (based on gpt-4o-mini pricing)
View logs in database:
```sql
SELECT * FROM audit_log WHERE action = 'ai_generation' ORDER BY timestamp DESC;
```
## Excel Data Structure
### User Management
The tool reads metadata from Excel file with two sheets:
**Create New User:**
```python
from src.database import Database
db = Database()
db.create_user(
username='newuser',
password='password123',
email='user@example.com',
full_name='New User',
auth_method='local'
)
```
### Sheet 1: DSB Celum ID to Path mapping (Primary)
- Column B: Celum ID
- Column E: Title
- Column F: External Description/Alt Text
**List All Users:**
```python
users = db.get_all_users()
for user in users:
print(f"{user['username']} - Last login: {user['last_login']}")
```
### Sheet 2: Medsurg Metadata Cheat (Fallback)
- Column: Solventum DAM Asset Path (contains filename)
- Metadata columns for Title and Description
Lookup is performed by filename (without extension), case-insensitive.
---
## Architecture
- `web_app.py` - Flask web application
- `run_gui.py` - GUI launcher
- `src/` - Core modules
- `extractors/` - Content extraction for different file types
- `updaters/` - Metadata update for different file types
- `excel_metadata_lookup.py` - Excel-based metadata lookup
- `main.py` - Core processing logic
- `config.py` - Configuration management
### File Structure
## License
```
oliver-metadata-tool/
├── web_app.py # Flask web application (main entry point)
├── requirements.txt # Python dependencies
├── .env # Environment configuration
├── oliver_metadata.db # SQLite database (auto-created)
├── src/
│ ├── config.py # Configuration management
│ ├── database.py # Database operations
│ ├── auth.py # Authentication logic
│ ├── metadata_analyzer.py # AI metadata generation
│ ├── metadata_importer.py # Import from files
│ ├── template_manager.py # Template system
│ ├── field_mapper.py # Column mapping
│ ├── excel_metadata_lookup.py # Excel lookup
│ ├── extractors/
│ │ ├── pdf_extractor.py
│ │ ├── image_extractor.py
│ │ ├── office_extractor.py
│ │ ├── video_extractor.py
│ │ └── exiftool_extractor.py
│ └── updaters/
│ ├── pdf_updater.py
│ ├── image_updater.py
│ ├── office_updater.py
│ ├── video_updater.py
│ └── exiftool_updater.py
├── templates/
│ ├── index.html # Main UI
│ └── login.html # Login page
└── docs/
└── EXIFTOOL_SETUP.md # ExifTool setup guide
```
Proprietary - Solventum
### Technology Stack
- **Backend:** Flask (Python)
- **Database:** SQLite
- **Frontend:** HTML5, CSS3, JavaScript (Vanilla)
- **Design:** Montserrat font, Dark & Gold theme
- **Authentication:** Flask-Session, werkzeug.security, MSAL
- **AI:** OpenAI API (gpt-4o-mini)
- **Metadata:** PyExifTool, pypdf, python-docx, openpyxl
---
## API Endpoints
### Authentication
- `GET /login` - Login page
- `POST /login` - Authenticate user
- `GET /logout` - Destroy session
- `GET /login/microsoft` - Microsoft SSO redirect
- `GET /auth/callback` - SSO callback
### File Operations
- `POST /upload` - Upload files and generate metadata
- `POST /update-manual` - Update file metadata manually
- `GET /download/<filename>` - Download processed file
### Metadata Sources
- `POST /upload-excel` - Upload Excel file for mapping
- `POST /preview-excel-sheet` - Preview Excel sheet structure
- `POST /configure-excel-mapping` - Configure Excel column mapping
- `POST /import-metadata` - Upload import file for mapping
- `POST /configure-import-mapping` - Configure import column mapping
### Templates
- `GET /templates/list` - List all templates
- `POST /templates/save` - Save new template
- `POST /templates/load` - Load template by name
- `DELETE /templates/delete` - Delete template
- `POST /templates/apply` - Apply template to files
- `POST /templates/preview` - Preview template output
---
## Security & Privacy
### Authentication
- Passwords hashed with werkzeug.security (pbkdf2:sha256)
- Session tokens: 32-byte cryptographically secure random strings
- Sessions expire after 24 hours
- Microsoft SSO via OAuth2 + Azure AD
### Data Protection
- All credentials stored in `.env` (excluded from git)
- Database file excluded from git
- API keys never logged or exposed to frontend
- Audit trail for all user actions
### Production Recommendations
1. **HTTPS:** Use SSL/TLS certificates in production
2. **Database:** Migrate to PostgreSQL for better concurrency
3. **Rate Limiting:** Add rate limits to prevent abuse
4. **CSRF Protection:** Enable Flask-WTF for form security
5. **Error Tracking:** Integrate Sentry or similar service
6. **Backups:** Regular database backups
7. **Monitoring:** Track AI token usage for cost management
---
## Troubleshooting
### Common Issues
**ExifTool not found:**
```bash
# Verify installation
exiftool -ver
# macOS: Reinstall with Homebrew
brew reinstall exiftool
# Linux: Reinstall with apt
sudo apt-get install --reinstall libimage-exiftool-perl
```
**Database locked error:**
```bash
# Stop all instances
lsof -ti:5001 | xargs kill -9
# Restart application
python web_app.py
```
**OpenAI API errors:**
- Check API key in `.env` file
- Verify API key is valid at https://platform.openai.com/api-keys
- Check token usage limits on OpenAI dashboard
**Import failed - column not found:**
- Use the mapping modal to manually select columns
- Check that your file has headers in the first row
- Verify file encoding is UTF-8
---
## Development
### Running Tests
```bash
# Unit tests (if implemented)
pytest tests/
# Manual integration test
python -c "from src.database import Database; from src.config import Config; print('✅ All imports successful')"
```
### Git Workflow
```bash
# Check status
git status
# Add changes
git add .
# Commit with message
git commit -m "Your commit message"
# Push to remote
git push origin main
```
---
## License & Credits
**License:** Corporate License - Oliver Marketing
All rights reserved. Unauthorized copying, distribution, or modification is prohibited.
**Developer:** Vadym Samoilenko
**Company:** Oliver Marketing
**Version:** 3.1 Enterprise Edition
**Release Date:** January 2026
**Third-Party Software:**
- ExifTool by Phil Harvey (Perl Artistic License)
- Flask by Pallets (BSD License)
- OpenAI API (Commercial License)
- PyExifTool (LGPL License)
---
## Support
For issues, questions, or feature requests:
- **Internal Support:** Contact IT department
- **Developer:** Vadym Samoilenko
- **Documentation:** See `docs/` folder
---
## Changelog
### v3.1 (January 2026) - Enterprise Edition
- ✅ User authentication (local + Microsoft SSO)
- ✅ SQLite database with audit logging
- ✅ Smart column mapping for Excel/CSV import
- ✅ Custom metadata fields support
- ✅ AI usage tracking and cost monitoring
- ✅ Dark & Gold UI redesign
- ✅ Template variables and preview
- ✅ Batch selection and CSV export
### v3.0 (January 2026)
- ✅ ExifTool integration (300+ formats)
- ✅ Multiple metadata sources (Excel, AI, Manual, Import)
- ✅ Field mapping with fuzzy matching
- ✅ Metadata templates system
- ✅ Rebranded to Oliver Metadata Tool
### v2.x (Prior)
- Basic Excel lookup functionality
- Multi-format file support
- Web and GUI interfaces

File diff suppressed because it is too large Load diff

View file

@ -4,11 +4,41 @@
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Login - Oliver Metadata Tool</title>
<link href="https://fonts.googleapis.com/css2?family=Montserrat:wght@300;400;500;600;700&display=swap" rel="stylesheet">
<style>
:root {
--primary-gold: #FFC407;
--primary-gold-dark: #e6b007;
--primary-gold-light: #ffcf33;
--dark-primary: #2c2c2c;
--dark-secondary: #1a1a1a;
--white: #ffffff;
--text-primary: #1f2937;
--text-muted: #6b7280;
--overlay-light: rgba(255, 255, 255, 0.95);
--border-light: rgba(255, 255, 255, 0.2);
--shadow-lg: 0 20px 40px rgba(0, 0, 0, 0.1);
--radius-md: 12px;
--radius-xl: 20px;
--font-family: 'Montserrat', -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
--transition-fast: 0.15s ease;
}
* { margin: 0; padding: 0; box-sizing: border-box; }
@keyframes shimmer {
0% { transform: translateX(-100%); }
100% { transform: translateX(100%); }
}
@keyframes pulse {
0%, 100% { transform: scale(1); }
50% { transform: scale(1.05); }
}
body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
font-family: var(--font-family);
background: linear-gradient(135deg, var(--dark-primary) 0%, var(--dark-secondary) 100%);
min-height: 100vh;
display: flex;
align-items: center;
@ -17,9 +47,11 @@
}
.login-container {
background: white;
border-radius: 20px;
box-shadow: 0 20px 60px rgba(0,0,0,0.3);
background: var(--overlay-light);
backdrop-filter: blur(20px);
border-radius: var(--radius-xl);
box-shadow: var(--shadow-lg);
border: 1px solid var(--border-light);
width: 100%;
max-width: 450px;
padding: 40px;
@ -28,17 +60,21 @@
.logo {
text-align: center;
margin-bottom: 30px;
position: relative;
}
.logo h1 {
color: #667eea;
font-size: 28px;
color: var(--primary-gold-dark);
font-size: 32px;
margin-bottom: 10px;
font-weight: 700;
text-shadow: 0 2px 4px rgba(255, 196, 7, 0.2);
}
.logo p {
color: #6c757d;
color: var(--text-muted);
font-size: 14px;
font-weight: 500;
}
.divider {
@ -53,15 +89,16 @@
left: 0;
right: 0;
top: 50%;
height: 1px;
background: #dee2e6;
height: 2px;
background: linear-gradient(90deg, transparent, var(--primary-gold-light), transparent);
}
.divider span {
background: white;
background: var(--overlay-light);
padding: 0 15px;
color: #6c757d;
color: var(--text-muted);
font-size: 13px;
font-weight: 600;
position: relative;
z-index: 1;
}
@ -73,7 +110,7 @@
.form-group label {
display: block;
font-weight: 600;
color: #495057;
color: var(--text-primary);
margin-bottom: 8px;
font-size: 14px;
}
@ -82,25 +119,28 @@
width: 100%;
padding: 12px;
border: 2px solid #dee2e6;
border-radius: 8px;
border-radius: var(--radius-md);
font-size: 14px;
transition: border-color 0.3s;
font-family: var(--font-family);
transition: all var(--transition-fast);
}
.form-group input:focus {
outline: none;
border-color: #667eea;
border-color: var(--primary-gold);
box-shadow: 0 0 0 3px rgba(255, 196, 7, 0.1);
}
.btn {
width: 100%;
padding: 14px;
border: none;
border-radius: 8px;
border-radius: var(--radius-md);
font-size: 16px;
font-weight: 600;
font-family: var(--font-family);
cursor: pointer;
transition: transform 0.2s;
transition: all var(--transition-fast);
}
.btn:hover {
@ -108,60 +148,79 @@
}
.btn-primary {
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
color: white;
background: linear-gradient(135deg, var(--primary-gold), var(--primary-gold-dark));
color: var(--dark-secondary);
margin-bottom: 15px;
box-shadow: 0 4px 12px rgba(255, 196, 7, 0.3);
}
.btn-primary:hover {
box-shadow: 0 6px 16px rgba(255, 196, 7, 0.4);
}
.btn-sso {
background: white;
color: #495057;
border: 2px solid #dee2e6;
background: var(--white);
color: var(--text-primary);
border: 2px solid var(--primary-gold);
}
.btn-sso:hover {
border-color: #667eea;
color: #667eea;
border-color: var(--primary-gold-dark);
background: #fffbf0;
color: var(--primary-gold-dark);
}
.alert {
padding: 12px;
border-radius: 8px;
border-radius: var(--radius-md);
margin-bottom: 20px;
font-size: 14px;
font-weight: 500;
}
.alert-error {
background: #f8d7da;
color: #721c24;
border: 1px solid #f5c6cb;
background: #fee;
color: #c33;
border: 2px solid #fcc;
}
.alert-info {
background: #d1ecf1;
color: #0c5460;
border: 1px solid #bee5eb;
background: #fffbf0;
color: var(--primary-gold-dark);
border: 2px solid var(--primary-gold-light);
}
.test-user-info {
background: #f8f9ff;
border: 2px dashed #667eea;
border-radius: 8px;
background: #fffbf0;
border: 2px dashed var(--primary-gold);
border-radius: var(--radius-md);
padding: 15px;
margin-bottom: 20px;
font-size: 13px;
color: #495057;
color: var(--text-primary);
animation: pulse 3s infinite;
}
.test-user-info strong {
color: #667eea;
color: var(--primary-gold-dark);
font-weight: 600;
}
.test-user-info code {
background: rgba(255, 196, 7, 0.15);
padding: 2px 6px;
border-radius: 4px;
font-family: 'Courier New', monospace;
color: var(--primary-gold-dark);
font-weight: 600;
}
.footer-text {
text-align: center;
margin-top: 20px;
font-size: 12px;
color: #6c757d;
color: var(--text-muted);
font-weight: 500;
}
.microsoft-icon {

View file

@ -6,7 +6,7 @@ Flask-based web app for local or server deployment.
Supports multiple metadata sources: Excel, AI, manual entry, and file import.
"""
from flask import Flask, render_template, request, jsonify, send_file
from flask import Flask, render_template, request, jsonify, send_file, session, redirect, url_for
from werkzeug.utils import secure_filename # noqa: F401 - kept as fallback
from pathlib import Path
import os
@ -259,7 +259,19 @@ def upload_file():
}
# Get metadata lookup (only if using Excel source)
lookup = get_metadata_lookup() if metadata_source == 'excel' else None
excel_session_id = request.form.get('excel_session_id')
lookup = None
if metadata_source == 'excel':
if excel_session_id and excel_session_id in imported_metadata:
# Use uploaded Excel file
lookup = imported_metadata[excel_session_id]
else:
# Try default Excel file if available
try:
lookup = get_metadata_lookup()
except:
return jsonify({'error': 'Please upload an Excel file first using the Upload Excel File button'}), 400
# Get imported metadata (only if using import source)
import_map = None
@ -504,9 +516,22 @@ def update_manual_metadata():
custom_metadata = {
'title': data.get('title', '').strip()[:200],
'subject': data.get('subject', '').strip()[:300],
'keywords': data.get('keywords', '').strip()[:500]
'keywords': data.get('keywords', '').strip()[:500],
'author': data.get('author', '').strip()[:100],
'copyright': data.get('copyright', '').strip()[:150],
'comments': data.get('comments', '').strip()[:500]
}
# Add custom fields if provided
custom_fields = data.get('custom_fields', {})
if custom_fields and isinstance(custom_fields, dict):
for field_name, field_value in custom_fields.items():
# Sanitize custom field names and values
safe_name = str(field_name).strip()[:50]
safe_value = str(field_value).strip()[:200]
if safe_name and safe_value:
custom_metadata[safe_name] = safe_value
# Validate session
if not session_id or session_id not in sessions:
return jsonify({'error': 'Invalid or expired session'}), 400
@ -566,10 +591,178 @@ def download_file(filename):
return send_file(filepath, as_attachment=True)
return jsonify({'error': 'File not found'}), 404
@app.route('/upload-excel', methods=['POST'])
@login_required
def upload_excel():
"""Upload Excel file for Excel Lookup metadata source."""
if 'excel_file' not in request.files:
return jsonify({'error': 'No file provided'}), 400
file = request.files['excel_file']
if file.filename == '':
return jsonify({'error': 'No file selected'}), 400
try:
import pandas as pd
# Save temp file
excel_filename = safe_filename(file.filename)
temp_path = Path(app.config['UPLOAD_FOLDER']) / excel_filename
file.save(str(temp_path))
# Preview Excel structure instead of loading directly
excel_file = pd.ExcelFile(str(temp_path))
sheet_names = excel_file.sheet_names
# Get columns and sample data from first sheet
preview_data = {}
for sheet_name in sheet_names[:5]: # Limit to first 5 sheets
df = pd.read_excel(excel_file, sheet_name=sheet_name, nrows=5)
preview_data[sheet_name] = {
'columns': df.columns.tolist(),
'sample_data': df.head(3).fillna('').to_dict('records')
}
# Store file path temporarily for later configuration
excel_session_id = f"excel_{secrets.token_urlsafe(8)}"
if 'excel_files' not in imported_metadata:
imported_metadata['excel_files'] = {}
imported_metadata['excel_files'][excel_session_id] = {
'path': str(temp_path),
'filename': excel_filename,
'sheet_names': sheet_names
}
return jsonify({
'success': True,
'excel_session_id': excel_session_id,
'filename': excel_filename,
'sheets': sheet_names,
'preview': preview_data,
'message': f'Excel file uploaded. Please configure column mapping.'
})
except Exception as e:
import logging
logging.getLogger(__name__).error(f"Excel upload failed: {e}")
return jsonify({'error': f'Excel upload failed: {str(e)}'}), 500
@app.route('/preview-excel-sheet', methods=['POST'])
@login_required
def preview_excel_sheet():
"""Preview a specific sheet from uploaded Excel file."""
try:
import pandas as pd
data = request.json
excel_session_id = data.get('excel_session_id')
sheet_name = data.get('sheet_name')
if not excel_session_id or excel_session_id not in imported_metadata.get('excel_files', {}):
return jsonify({'error': 'Invalid session ID'}), 400
excel_info = imported_metadata['excel_files'][excel_session_id]
excel_path = excel_info['path']
# Read the specific sheet
df = pd.read_excel(excel_path, sheet_name=sheet_name, nrows=10)
return jsonify({
'success': True,
'columns': df.columns.tolist(),
'sample_data': df.head(5).fillna('').to_dict('records')
})
except Exception as e:
import logging
logging.getLogger(__name__).error(f"Sheet preview failed: {e}")
return jsonify({'error': f'Sheet preview failed: {str(e)}'}), 500
@app.route('/configure-excel-mapping', methods=['POST'])
@login_required
def configure_excel_mapping():
"""Configure Excel column mapping and load metadata."""
try:
import pandas as pd
data = request.json
excel_session_id = data.get('excel_session_id')
sheet_name = data.get('sheet_name')
column_mapping = data.get('column_mapping', {}) # {filename: 'col', title: 'col', ...}
if not excel_session_id or excel_session_id not in imported_metadata.get('excel_files', {}):
return jsonify({'error': 'Invalid session ID'}), 400
excel_info = imported_metadata['excel_files'][excel_session_id]
excel_path = excel_info['path']
# Read the configured sheet
df = pd.read_excel(excel_path, sheet_name=sheet_name)
# Build metadata map using configured columns
metadata_map = {}
filename_col = column_mapping.get('filename')
title_col = column_mapping.get('title')
description_col = column_mapping.get('description')
keywords_col = column_mapping.get('keywords')
if not filename_col:
return jsonify({'error': 'Filename column is required'}), 400
for _, row in df.iterrows():
filename = row.get(filename_col)
if pd.notna(filename) and str(filename).strip():
# Get filename without extension for indexing (case-insensitive)
filename_stem = Path(str(filename).strip()).stem.lower()
metadata = {
'title': str(row.get(title_col, '')).strip() if title_col and pd.notna(row.get(title_col)) else '',
'description': str(row.get(description_col, '')).strip() if description_col and pd.notna(row.get(description_col)) else '',
'keywords': str(row.get(keywords_col, '')).strip() if keywords_col and pd.notna(row.get(keywords_col)) else '',
'original_filename': str(filename).strip()
}
metadata_map[filename_stem] = metadata
# Create a simple lookup object
class ConfiguredExcelLookup:
def __init__(self, metadata_map):
self.metadata_map = metadata_map
self.filename_to_metadata = metadata_map
def lookup_by_filename(self, filename: str):
filename_stem = Path(filename).stem.lower()
return self.metadata_map.get(filename_stem)
lookup = ConfiguredExcelLookup(metadata_map)
# Store configured lookup
imported_metadata[excel_session_id] = lookup
# Get stats
stats = {
'total_records': len(metadata_map),
'with_title': sum(1 for v in metadata_map.values() if v.get('title')),
'with_description': sum(1 for v in metadata_map.values() if v.get('description')),
'with_keywords': sum(1 for v in metadata_map.values() if v.get('keywords'))
}
return jsonify({
'success': True,
'excel_session_id': excel_session_id,
'stats': stats,
'message': f'Configured mapping for {stats["total_records"]} records from sheet "{sheet_name}"'
})
except Exception as e:
import logging
logging.getLogger(__name__).error(f"Excel configuration failed: {e}")
return jsonify({'error': f'Excel configuration failed: {str(e)}'}), 500
@app.route('/import-metadata', methods=['POST'])
@login_required
def import_metadata():
"""Import metadata from external file (CSV, Excel, JSON)."""
"""Upload import file and preview structure for mapping."""
if 'import_file' not in request.files:
return jsonify({'error': 'No file provided'}), 400
@ -578,45 +771,142 @@ def import_metadata():
return jsonify({'error': 'No file selected'}), 400
try:
import pandas as pd
# Save temp file
import_filename = safe_filename(file.filename)
temp_path = Path(app.config['UPLOAD_FOLDER']) / import_filename
file.save(str(temp_path))
# Import based on file type
importer = MetadataImporter()
file_ext = temp_path.suffix.lower()
# Read file and get structure
if file_ext == '.csv':
metadata_map = importer.import_from_csv(str(temp_path))
df = pd.read_csv(str(temp_path), nrows=5, encoding='utf-8')
elif file_ext in ['.xlsx', '.xls']:
metadata_map = importer.import_from_excel(str(temp_path))
df = pd.read_excel(str(temp_path), nrows=5)
elif file_ext == '.json':
metadata_map = importer.import_from_json(str(temp_path))
import json
with open(str(temp_path), 'r', encoding='utf-8') as f:
data = json.load(f)
# Convert to DataFrame
if isinstance(data, list):
df = pd.DataFrame(data[:5])
elif isinstance(data, dict):
df = pd.DataFrame([data])
else:
return jsonify({'error': 'Invalid JSON format'}), 400
else:
return jsonify({'error': f'Unsupported file format: {file_ext}. Supported: .csv, .xlsx, .xls, .json'}), 400
return jsonify({'error': f'Unsupported file format: {file_ext}'}), 400
# Validate import
stats = importer.validate_import(metadata_map)
columns = df.columns.tolist()
sample_data = df.fillna('').to_dict('records')
# Store in global dict with unique session ID
import_session_id = f"import_{len(imported_metadata) + 1}"
# Store file path for later configuration
import_session_id = f"import_{secrets.token_urlsafe(8)}"
if 'import_files' not in imported_metadata:
imported_metadata['import_files'] = {}
imported_metadata['import_files'][import_session_id] = {
'path': str(temp_path),
'filename': import_filename,
'file_type': file_ext
}
return jsonify({
'success': True,
'import_session_id': import_session_id,
'filename': import_filename,
'columns': columns,
'sample_data': sample_data,
'message': f'Import file uploaded. Please configure column mapping.'
})
except Exception as e:
import logging
logging.getLogger(__name__).error(f"Import upload failed: {e}")
return jsonify({'error': f'Import upload failed: {str(e)}'}), 500
@app.route('/configure-import-mapping', methods=['POST'])
@login_required
def configure_import_mapping():
"""Configure import column mapping and load metadata."""
try:
import pandas as pd
import json
data = request.json
import_session_id = data.get('import_session_id')
column_mapping = data.get('column_mapping', {})
if not import_session_id or import_session_id not in imported_metadata.get('import_files', {}):
return jsonify({'error': 'Invalid session ID'}), 400
import_info = imported_metadata['import_files'][import_session_id]
import_path = import_info['path']
file_ext = import_info['file_type']
# Read the full file
if file_ext == '.csv':
df = pd.read_csv(import_path, encoding='utf-8')
elif file_ext in ['.xlsx', '.xls']:
df = pd.read_excel(import_path)
elif file_ext == '.json':
with open(import_path, 'r', encoding='utf-8') as f:
json_data = json.load(f)
if isinstance(json_data, list):
df = pd.DataFrame(json_data)
else:
df = pd.DataFrame([json_data])
# Build metadata map using configured columns
metadata_map = {}
filename_col = column_mapping.get('filename')
title_col = column_mapping.get('title')
subject_col = column_mapping.get('subject')
keywords_col = column_mapping.get('keywords')
if not filename_col:
return jsonify({'error': 'Filename column is required'}), 400
for _, row in df.iterrows():
filename = row.get(filename_col)
if pd.notna(filename) and str(filename).strip():
filename_stem = Path(str(filename).strip()).stem.lower()
metadata = {
'title': str(row.get(title_col, '')).strip() if title_col and pd.notna(row.get(title_col)) else '',
'subject': str(row.get(subject_col, '')).strip() if subject_col and pd.notna(row.get(subject_col)) else '',
'keywords': str(row.get(keywords_col, '')).strip() if keywords_col and pd.notna(row.get(keywords_col)) else '',
'original_filename': str(filename).strip()
}
metadata_map[filename_stem] = metadata
# Store configured metadata map
imported_metadata[import_session_id] = metadata_map
# Clean up temp file
temp_path.unlink()
Path(import_path).unlink(missing_ok=True)
# Get stats
stats = {
'total_records': len(metadata_map),
'with_title': sum(1 for v in metadata_map.values() if v.get('title')),
'with_subject': sum(1 for v in metadata_map.values() if v.get('subject')),
'with_keywords': sum(1 for v in metadata_map.values() if v.get('keywords'))
}
return jsonify({
'success': True,
'import_session_id': import_session_id,
'stats': stats,
'message': f'Imported {stats["total_records"]} metadata records from {import_filename}'
'message': f'Configured mapping for {stats["total_records"]} records'
})
except Exception as e:
import logging
logging.getLogger(__name__).error(f"Import failed: {e}")
return jsonify({'error': f'Import failed: {str(e)}'}), 500
logging.getLogger(__name__).error(f"Import configuration failed: {e}")
return jsonify({'error': f'Import configuration failed: {str(e)}'}), 500
@app.route('/preview-import', methods=['POST'])
@login_required