updated readme

This commit is contained in:
michael 2025-08-14 15:15:47 -05:00
parent 9c01bcba81
commit 10a8e985a8

630
README.md
View file

@ -1,349 +1,529 @@
# Contract Analysis Tool v2.0
A modern, production-ready Retrieval-Augmented Generation (RAG) application for intelligent contract analysis and document Q&A. Built with FastAPI backend and React frontend.
A modern, production-ready Retrieval-Augmented Generation (RAG) application for intelligent contract analysis and document Q&A. Built with FastAPI backend and React frontend with advanced features including SSO integration, context-aware chat, and comprehensive document processing.
![Architecture](https://img.shields.io/badge/Backend-FastAPI-009688)
![Frontend](https://img.shields.io/badge/Frontend-React-61DAFB)
![Database](https://img.shields.io/badge/Database-MongoDB-47A248)
![Cache](https://img.shields.io/badge/Cache-Redis-DC382D)
![Vector Store](https://img.shields.io/badge/VectorDB-ChromaDB-FF6B35)
## 🚀 Features
## 🚀 Key Features
- **Modern Architecture**: FastAPI + React + MongoDB + Redis
- **AI-Powered Analysis**: GPT-4 integration for contract analysis
- **Document Q&A**: Natural language queries with RAG
- **User Management**: Role-based access control
- **Real-time Processing**: Async document processing
- **Intelligent Caching**: Redis-based response caching
- **Scalable Design**: Microservice-ready architecture
### Core Functionality
- **Modern Architecture**: FastAPI + React + MongoDB + Redis + ChromaDB
- **AI-Powered Analysis**: OpenAI GPT-4 integration with contract summarization
- **Advanced RAG System**: Context-aware document Q&A with source citations
- **Document Processing**: Multi-format support (PDF, DOCX, DOC, TXT, CSV, JSON, HTML, MD, RTF)
- **Vector Search**: ChromaDB for semantic similarity search
### Authentication & Security
- **Dual Authentication**: Local JWT + Azure AD/SSO integration
- **Role-Based Access Control**: Admin/User permissions
- **JWT Token Management**: Automatic refresh with 3-hour expiration
- **Secure File Upload**: Validation, sanitization, and size limits
### Advanced Chat System
- **Context-Aware Conversations**: 24-hour rolling context window (max 10 messages)
- **Smart Caching**: Context-dependent responses aren't cached, simple queries are
- **Real-time Statistics**: Response times, cache hit rates, message counts
- **Proper Message Ordering**: Chronological display with accurate timestamps
- **Source Citations**: Direct references to document chunks in responses
### Document Management
- **Batch Processing**: Multiple document uploads with progress tracking
- **Index Organization**: Create themed document collections
- **Processing Pipeline**: PDF parsing → chunking → embedding → vector storage
- **Status Tracking**: Real-time processing and embedding status
- **Contract Summaries**: Automated contract analysis and key point extraction
### Admin Features
- **System Statistics**: Monitor usage, performance, and system health
- **User Management**: Create, edit, and manage user accounts
- **Document Reprocessing**: Retry failed documents
- **Index Management**: Create and manage document indices
- **Advanced RAG Interface**: Admin-specific query tools
## 🏗️ Architecture
```
React Frontend → FastAPI Backend → MongoDB + ChromaDB → OpenAI API
Redis Cache
React Frontend (Vite + Tailwind) → FastAPI Backend → MongoDB + ChromaDB → OpenAI API
Redis Cache
Azure AD/SSO (Optional)
```
**Data Flow:**
1. Documents uploaded through React frontend
2. FastAPI processes with LlamaIndex (chunking, parsing)
3. OpenAI embeddings stored in ChromaDB
4. Metadata and user data in MongoDB
5. RAG queries combine vector search + GPT-4 generation
6. Redis caches responses for performance
## 📋 Prerequisites
- **Python 3.11+**
- **Node.js 18+**
- **MongoDB 7+**
- **Redis 7+**
- **OpenAI API Key**
- **LlamaParse API Key** (optional)
- **Python 3.11+** (Backend)
- **Node.js 18+** (Frontend)
- **MongoDB 7+** (Document metadata)
- **Redis 7+** (Caching - optional)
- **OpenAI API Key** (Required)
- **LlamaParse API Key** (Optional - enhanced PDF processing)
## 🛠️ Installation
## 🛠️ Quick Start
### Option 1: Docker (Recommended)
1. **Clone the repository**
```bash
git clone <repository-url>
cd llama-contracts-master
```
```bash
# Clone repository
git clone <repository-url>
cd llama-contracts-master
2. **Set up environment variables**
```bash
# Backend
cp backend/.env.example backend/.env
# Edit backend/.env with your API keys
# Frontend
cp frontend/.env.example frontend/.env
```
# Configure environment
cp backend/.env.example backend/.env
# Edit backend/.env with your API keys
3. **Start with Docker Compose**
```bash
cd backend
docker-compose up -d
```
# Start backend services
cd backend
docker-compose up -d
4. **Start the frontend**
```bash
cd frontend
npm install
npm run dev
```
# Start frontend
cd ../frontend
npm install
npm run dev
```
### Option 2: Manual Setup
### Option 2: Manual Development Setup
#### Backend Setup
```bash
cd backend
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
1. **Create Python virtual environment**
```bash
cd backend
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
# Configure environment
cp .env.example .env
# Edit .env file with your settings
2. **Install dependencies**
```bash
pip install -r requirements.txt
```
# Start services (macOS with Homebrew)
brew services start mongodb-community
brew services start redis
3. **Set up environment variables**
```bash
cp .env.example .env
# Edit .env with your configuration
```
4. **Start MongoDB and Redis**
```bash
# MongoDB
brew services start mongodb/brew/mongodb-community
# Redis
brew services start redis
```
5. **Start the backend**
```bash
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
```
# Initialize database and start server
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
```
#### Frontend Setup
```bash
cd frontend
npm install
1. **Install dependencies**
```bash
cd frontend
npm install
```
# Configure environment
cp .env.example .env
# Edit frontend/.env
2. **Set up environment variables**
```bash
cp .env.example .env
```
# Start development server
npm run dev
```
3. **Start the development server**
```bash
npm run dev
```
## 🔧 Configuration
### Backend Environment Variables
## ⚙️ Configuration
### Backend Environment Variables (.env)
```env
# Database
# Database Configuration
MONGODB_URL=mongodb://localhost:27017
DATABASE_NAME=contract_analysis
# Redis
# Redis Cache (Optional)
REDIS_URL=redis://localhost:6379
# Authentication
JWT_SECRET_KEY=your-super-secret-jwt-key
JWT_SECRET_KEY=your-super-secret-jwt-key-change-in-production
JWT_ALGORITHM=HS256
JWT_EXPIRE_MINUTES=30
JWT_EXPIRE_MINUTES=180
# OpenAI
# Azure AD/SSO (Optional)
AZURE_CLIENT_ID=your-azure-client-id
AZURE_TENANT_ID=your-azure-tenant-id
AZURE_REDIRECT_URI=http://localhost:3000/auth/callback
SSO_ENABLED=false
ALLOW_LOCAL_ADMIN=true
# AI Services (Required)
OPENAI_API_KEY=your-openai-api-key
LLAMAPARSE_API_KEY=your-llamaparse-api-key
# Application
DEBUG=false
CORS_ORIGINS=["http://localhost:3000"]
# Application Settings
DEBUG=true
CORS_ORIGINS=["http://localhost:3000","http://localhost:3002"]
UPLOAD_DIR=./uploads
INDICES_DIR=./indices
# Cache
CACHE_ENABLED=true
# Performance Settings
CACHE_ENABLED=false # Disabled for development
CACHE_TTL=3600
MAX_DOCUMENT_CHARS=1000000
MAX_SUMMARY_CHARS=100000
```
### Frontend Environment Variables
### Frontend Environment Variables (.env)
```env
VITE_API_URL=http://localhost:8000
VITE_APP_NAME=Contract Analysis Tool
```
## 🚀 Usage
## 🚀 Getting Started
### Swagger Testing
Method 1: Get Token via Swagger UI (Easiest)
1. Go to http://localhost:8000/docs
2. First, initialize the default users by calling
the /api/v1/auth/init-users endpoint
3. Then use the /api/v1/auth/login endpoint with
these credentials:
Admin User:
- email: admin@oliver.agency
- password: admin123
Regular User:
- email: user@oliver.agency
- password: user123
4. Copy the access_token from the response
5. Click the "Authorize" button at the top of
Swagger UI
6. Enter: Bearer YOUR_TOKEN_HERE
### Initial Setup
1. **Access the application**: http://localhost:3000
2. **Initialize default users** (first time only):
```bash
curl -X POST http://localhost:8000/api/v1/auth/init-users
```
### Default Credentials
### 1. Initialize System
```bash
# Create default users (first time only)
curl -X POST http://localhost:8000/api/v1/auth/init-users
```
### 2. Default Credentials
- **Admin**: `admin@oliver.agency` / `admin123`
- **User**: `user@oliver.agency` / `user123`
### Workflow
1. **Login** with admin or user credentials
2. **Create an Index** for your document collection
3. **Upload Documents** to the index
4. **Chat** with your documents using natural language
5. **Manage Users** (admin only)
### 3. Basic Workflow
1. **Login** at http://localhost:3000/login
2. **Create Index** for your document collection
3. **Upload Documents** (supports drag-and-drop)
4. **Wait for Processing** (documents → chunks → embeddings)
5. **Start Chatting** with natural language queries
6. **Review Sources** cited in AI responses
## 📚 API Documentation
- **FastAPI Docs**: http://localhost:8000/docs (development only)
- **ReDoc**: http://localhost:8000/redoc (development only)
### Development URLs
- **Application**: http://localhost:3000
- **API Docs**: http://localhost:8000/docs
- **ReDoc**: http://localhost:8000/redoc
- **Health Check**: http://localhost:8000/health
### Key Endpoints
### Core API Endpoints
- `POST /api/v1/auth/login` - User authentication
#### Authentication
- `POST /api/v1/auth/login` - Local login
- `POST /api/v1/auth/register` - User registration
- `GET /api/v1/auth/me` - Current user info
- `POST /api/v1/auth/refresh` - Token refresh
- `POST /api/v1/auth/sso/validate` - SSO login
- `GET /api/v1/auth/sso/config` - SSO configuration
#### Document Management
- `POST /api/v1/documents/upload` - Upload documents to index
- `GET /api/v1/documents/index/{index_id}` - List documents in index
- `GET /api/v1/documents/{document_id}` - Document details
- `DELETE /api/v1/documents/{document_id}` - Delete document
- `GET /api/v1/documents/{document_id}/summary` - Contract summary
- `POST /api/v1/documents/{document_id}/summary/reprocess` - Regenerate summary
#### Index Management
- `POST /api/v1/indices/create` - Create document index
- `POST /api/v1/documents/upload` - Upload documents
- `POST /api/v1/chat/query` - Query documents
- `GET /api/v1/indices/` - List user indices
#### Chat System
- `POST /api/v1/chat/query` - RAG query with context
- `GET /api/v1/chat/history/{index_id}` - Conversation history
- `DELETE /api/v1/chat/history/{index_id}` - Clear chat history
#### Admin Operations
- `GET /api/v1/admin/stats` - System statistics
- `POST /api/v1/admin/documents/upload-single` - Single document upload
- `POST /api/v1/admin/documents/upload-multiple` - Batch upload
- `POST /api/v1/admin/documents/{document_id}/reprocess` - Reprocess document
- `GET /api/v1/admin/indices` - All system indices
- `POST /api/v1/admin/chat/query` - Admin RAG interface
## 🔒 Security Features
## 🔧 Advanced Features
- **JWT Authentication** with role-based access
- **Input Validation** with Pydantic schemas
- **CORS Protection** for frontend integration
- **File Upload Validation** with type/size checks
- **Rate Limiting** (configurable)
- **Environment Variable Protection**
### Context-Aware Chat System
- **Conversation Memory**: AI remembers last 10 messages within 24 hours
- **Smart Context Usage**: Follow-up questions reference previous conversation
- **Context Indicators**: UI shows when AI uses conversation history
- **Session Statistics**: Track response times and context usage
## ⚡ Performance Features
### Document Processing Pipeline
1. **Upload**: Drag-and-drop or browse files
2. **Validation**: File type, size, and content checks
3. **Processing**: LlamaIndex parsing and chunking
4. **Embedding**: OpenAI embeddings generation
5. **Storage**: ChromaDB vector storage + MongoDB metadata
6. **Indexing**: Ready for semantic search
- **Async Processing** throughout the backend
- **Redis Caching** for API responses
- **Vector Search** with ChromaDB
- **Connection Pooling** for databases
- **Optimized Queries** with MongoDB indexes
### SSO Integration (Optional)
- **Azure AD Support**: Enterprise authentication
- **Local Fallback**: Admin accounts always available
- **Role Mapping**: Automatic role assignment from SSO claims
- **Session Management**: Unified token handling
## 🧪 Development
### Advanced Caching Strategy
- **Smart Cache Logic**: Context-dependent queries bypass cache
- **Simple Query Cache**: Repeated questions served from Redis
- **TTL Management**: Configurable cache expiration
- **Cache Statistics**: Monitor hit rates and performance
### Backend Development
## 🏗️ Project Structure
### Backend (`/backend`)
```
app/
├── main.py # FastAPI application entry point
├── config/
│ ├── settings.py # Environment configuration
│ └── database.py # Database connections
├── api/v1/ # API endpoints
│ ├── auth.py # Authentication routes
│ ├── documents.py # Document management
│ ├── indices.py # Index operations
│ ├── chat.py # Chat/RAG system
│ └── admin.py # Admin operations
├── models/ # Data models
│ ├── user.py # User models
│ ├── document.py # Document models
│ ├── index.py # Index models
│ ├── chat.py # Chat models
│ └── contract_summary.py # Summary models
├── services/ # Business logic
│ ├── document_processor.py # Document processing
│ ├── rag_service.py # RAG implementation
│ ├── chat_context_service.py # Context management
│ ├── contract_summary_service.py # Summarization
│ └── sso_service.py # SSO integration
├── core/ # Core utilities
│ ├── auth.py # Authentication logic
│ ├── security.py # Security utilities
│ ├── cache.py # Caching logic
│ └── chroma_client.py # ChromaDB client
└── utils/ # Helper utilities
└── file_utils.py # File operations
```
### Frontend (`/frontend`)
```
src/
├── App.jsx # Main application with routing
├── pages/ # Page components
│ ├── HomePage.jsx # Landing page
│ ├── Dashboard.jsx # Main dashboard
│ ├── DocumentManager.jsx # Document management
│ ├── ChatInterface.jsx # Chat interface
│ └── AdminPanel.jsx # Admin interface
├── components/ # Reusable components
│ ├── auth/ # Authentication components
│ ├── admin/ # Admin-specific components
│ ├── chat/ # Chat components
│ ├── documents/ # Document components
│ ├── indices/ # Index components
│ └── common/ # Shared components
├── services/ # API service layer
│ ├── authService.js # Authentication API
│ ├── documentService.js # Document API
│ ├── chatService.js # Chat API
│ ├── indexService.js # Index API
│ └── adminService.js # Admin API
├── context/ # React context providers
│ └── AuthContext.jsx # Authentication context
└── utils/ # Frontend utilities
└── constants.js # Application constants
```
## 🧪 Development & Testing
### Backend Testing
```bash
cd backend
source venv/bin/activate
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
# API testing via Swagger
# Visit http://localhost:8000/docs
# Manual testing
python test_chat_fixes.py
```
### Frontend Development
```bash
cd frontend
# Development server with hot reload
npm run dev
# Build for production
npm run build
# Preview production build
npm run preview
# Lint code
npm run lint
```
### Database Migration
The application automatically creates database collections and indexes on startup.
## 📊 Monitoring
### Health Check
### Database Management
```bash
# View MongoDB collections
mongo contract_analysis
# View Redis cache
redis-cli
> KEYS *
# Clear ChromaDB indices
rm -rf backend/indices/chroma_db/
```
## 📊 Monitoring & Health Checks
### System Health
```bash
# Backend health
curl http://localhost:8000/health
# Database connectivity
curl http://localhost:8000/api/v1/admin/stats \
-H "Authorization: Bearer <admin-token>"
```
### System Stats (Admin)
```bash
curl -H "Authorization: Bearer <token>" http://localhost:8000/api/v1/admin/stats
```
### Performance Monitoring
- **Response Times**: Tracked per request with `X-Process-Time` header
- **Cache Hit Rates**: Monitor Redis performance
- **Document Processing**: Track success/failure rates
- **Vector Search**: Monitor ChromaDB query performance
## 🐛 Troubleshooting
### Common Issues
1. **MongoDB Connection Failed**
- Ensure MongoDB is running: `brew services start mongodb-community`
- Check connection string in `.env`
**1. MongoDB Connection Issues**
```bash
# Check if MongoDB is running
brew services list | grep mongodb
brew services start mongodb-community
2. **Redis Connection Failed**
- Ensure Redis is running: `brew services start redis`
- Application will continue without caching if Redis is unavailable
# Check connection string in .env
MONGODB_URL=mongodb://localhost:27017
```
3. **OpenAI API Errors**
- Verify API key in backend `.env`
- Check API quota and billing
**2. Redis Connection Issues**
```bash
# Redis is optional - app continues without caching
brew services start redis
4. **File Upload Issues**
- Check file size limits (50MB default)
- Verify file types are supported
- Ensure upload directory permissions
# Test Redis connection
redis-cli ping
```
### Logs
**3. Document Processing Failures**
- Check OpenAI API key validity
- Verify file format support
- Review file size limits (50MB default)
- Check upload directory permissions
- **Backend logs**: Console output from uvicorn
- **Frontend logs**: Browser console
- **Database logs**: MongoDB logs in data directory
**4. ChromaDB Issues**
```bash
# Clear and reinitialize indices
rm -rf backend/indices/chroma_db/
# Restart backend to recreate
```
## 🔄 Migration from v1.0
**5. Frontend Build Issues**
```bash
cd frontend
rm -rf node_modules package-lock.json
npm install
npm run dev
```
The new system provides complete feature parity with the original PHP application:
### Log Analysis
- **Backend**: Console output from uvicorn
- **Frontend**: Browser developer console
- **Database**: MongoDB logs in system logs
- **Processing**: Check document status in MongoDB
- ✅ All PHP functionality migrated to FastAPI
- ✅ SQLite data can be migrated to MongoDB
- ✅ Existing ChromaDB indices are compatible
- ✅ All document processing features preserved
- ✅ Enhanced performance and security
## 🚀 Production Deployment
## 🚀 Deployment
### Production Deployment
1. **Set production environment variables**
2. **Use production database URLs**
3. **Enable HTTPS with SSL certificates**
4. **Configure reverse proxy (Nginx)**
5. **Set up monitoring and logging**
6. **Regular backups of MongoDB**
### Production Checklist
- [ ] Set strong JWT secret key
- [ ] Configure production database URLs
- [ ] Enable HTTPS with SSL certificates
- [ ] Set up reverse proxy (Nginx/Apache)
- [ ] Configure monitoring and logging
- [ ] Set up regular MongoDB backups
- [ ] Disable debug mode
- [ ] Configure proper CORS origins
- [ ] Set up log rotation
### Docker Production
```bash
# Use production compose file
docker-compose -f docker-compose.prod.yml up -d
# Or build custom images
docker build -t contract-analysis-backend ./backend
docker build -t contract-analysis-frontend ./frontend
```
### Environment Variables for Production
```env
DEBUG=false
JWT_SECRET_KEY=your-ultra-secure-secret-key-here
CORS_ORIGINS=["https://yourdomain.com"]
CACHE_ENABLED=true
```
## 🔐 Security Considerations
### Implemented Security
- **JWT Authentication** with configurable expiration
- **Role-based Authorization** (Admin/User)
- **Input Validation** with Pydantic schemas
- **File Upload Validation** (type, size, content)
- **CORS Protection** with configurable origins
- **Environment Variable Protection**
- **SQL Injection Prevention** (NoSQL with validation)
- **XSS Prevention** (React built-in protection)
### Best Practices
- Regularly rotate JWT secret keys
- Use HTTPS in production
- Keep dependencies updated
- Monitor for security vulnerabilities
- Implement rate limiting for production
- Regular security audits
## 🤝 Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request
1. **Fork** the repository
2. **Create** a feature branch (`git checkout -b feature/amazing-feature`)
3. **Commit** changes (`git commit -m 'Add amazing feature'`)
4. **Push** to branch (`git push origin feature/amazing-feature`)
5. **Open** a Pull Request
### Development Guidelines
- Follow existing code style and conventions
- Add tests for new features
- Update documentation as needed
- Ensure all tests pass before submitting PR
## 📄 License
This project is licensed under the MIT License.
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## 🙏 Acknowledgments
- **OpenAI** - GPT-4 and embedding models
- **LlamaIndex** - RAG framework
- **ChromaDB** - Vector storage
- **FastAPI** - Modern Python web framework
- **React** - Frontend framework
- **[OpenAI](https://openai.com/)** - GPT-4 and embedding models
- **[LlamaIndex](https://www.llamaindex.ai/)** - RAG framework and document processing
- **[ChromaDB](https://www.trychroma.com/)** - Vector database for semantic search
- **[FastAPI](https://fastapi.tiangolo.com/)** - Modern Python web framework
- **[React](https://react.dev/)** - Frontend framework
- **[Tailwind CSS](https://tailwindcss.com/)** - Utility-first CSS framework
- **[Vite](https://vitejs.dev/)** - Fast frontend build tool
---
**Built with ❤️ for intelligent contract analysis**
**Built with ❤️ for intelligent contract analysis and document Q&A**
*For detailed migration information from v1.0, see [MIGRATION_PLAN.md](MIGRATION_PLAN.md)*
*For API testing guidance, see [API_TESTING_GUIDE.md](API_TESTING_GUIDE.md)*