diff --git a/README.md b/README.md index 8c46023..369e752 100644 --- a/README.md +++ b/README.md @@ -1,349 +1,529 @@ # Contract Analysis Tool v2.0 -A modern, production-ready Retrieval-Augmented Generation (RAG) application for intelligent contract analysis and document Q&A. Built with FastAPI backend and React frontend. +A modern, production-ready Retrieval-Augmented Generation (RAG) application for intelligent contract analysis and document Q&A. Built with FastAPI backend and React frontend with advanced features including SSO integration, context-aware chat, and comprehensive document processing. ![Architecture](https://img.shields.io/badge/Backend-FastAPI-009688) ![Frontend](https://img.shields.io/badge/Frontend-React-61DAFB) ![Database](https://img.shields.io/badge/Database-MongoDB-47A248) ![Cache](https://img.shields.io/badge/Cache-Redis-DC382D) +![Vector Store](https://img.shields.io/badge/VectorDB-ChromaDB-FF6B35) -## ๐Ÿš€ Features +## ๐Ÿš€ Key Features -- **Modern Architecture**: FastAPI + React + MongoDB + Redis -- **AI-Powered Analysis**: GPT-4 integration for contract analysis -- **Document Q&A**: Natural language queries with RAG -- **User Management**: Role-based access control -- **Real-time Processing**: Async document processing -- **Intelligent Caching**: Redis-based response caching -- **Scalable Design**: Microservice-ready architecture +### Core Functionality +- **Modern Architecture**: FastAPI + React + MongoDB + Redis + ChromaDB +- **AI-Powered Analysis**: OpenAI GPT-4 integration with contract summarization +- **Advanced RAG System**: Context-aware document Q&A with source citations +- **Document Processing**: Multi-format support (PDF, DOCX, DOC, TXT, CSV, JSON, HTML, MD, RTF) +- **Vector Search**: ChromaDB for semantic similarity search + +### Authentication & Security +- **Dual Authentication**: Local JWT + Azure AD/SSO integration +- **Role-Based Access Control**: Admin/User permissions +- **JWT Token Management**: Automatic refresh with 3-hour expiration +- **Secure File Upload**: Validation, sanitization, and size limits + +### Advanced Chat System +- **Context-Aware Conversations**: 24-hour rolling context window (max 10 messages) +- **Smart Caching**: Context-dependent responses aren't cached, simple queries are +- **Real-time Statistics**: Response times, cache hit rates, message counts +- **Proper Message Ordering**: Chronological display with accurate timestamps +- **Source Citations**: Direct references to document chunks in responses + +### Document Management +- **Batch Processing**: Multiple document uploads with progress tracking +- **Index Organization**: Create themed document collections +- **Processing Pipeline**: PDF parsing โ†’ chunking โ†’ embedding โ†’ vector storage +- **Status Tracking**: Real-time processing and embedding status +- **Contract Summaries**: Automated contract analysis and key point extraction + +### Admin Features +- **System Statistics**: Monitor usage, performance, and system health +- **User Management**: Create, edit, and manage user accounts +- **Document Reprocessing**: Retry failed documents +- **Index Management**: Create and manage document indices +- **Advanced RAG Interface**: Admin-specific query tools ## ๐Ÿ—๏ธ Architecture ``` -React Frontend โ†’ FastAPI Backend โ†’ MongoDB + ChromaDB โ†’ OpenAI API - โ†“ - Redis Cache +React Frontend (Vite + Tailwind) โ†’ FastAPI Backend โ†’ MongoDB + ChromaDB โ†’ OpenAI API + โ†“ + Redis Cache + โ†“ + Azure AD/SSO (Optional) ``` +**Data Flow:** +1. Documents uploaded through React frontend +2. FastAPI processes with LlamaIndex (chunking, parsing) +3. OpenAI embeddings stored in ChromaDB +4. Metadata and user data in MongoDB +5. RAG queries combine vector search + GPT-4 generation +6. Redis caches responses for performance + ## ๐Ÿ“‹ Prerequisites -- **Python 3.11+** -- **Node.js 18+** -- **MongoDB 7+** -- **Redis 7+** -- **OpenAI API Key** -- **LlamaParse API Key** (optional) +- **Python 3.11+** (Backend) +- **Node.js 18+** (Frontend) +- **MongoDB 7+** (Document metadata) +- **Redis 7+** (Caching - optional) +- **OpenAI API Key** (Required) +- **LlamaParse API Key** (Optional - enhanced PDF processing) -## ๐Ÿ› ๏ธ Installation +## ๐Ÿ› ๏ธ Quick Start ### Option 1: Docker (Recommended) -1. **Clone the repository** - ```bash - git clone - cd llama-contracts-master - ``` +```bash +# Clone repository +git clone +cd llama-contracts-master -2. **Set up environment variables** - ```bash - # Backend - cp backend/.env.example backend/.env - # Edit backend/.env with your API keys - - # Frontend - cp frontend/.env.example frontend/.env - ``` +# Configure environment +cp backend/.env.example backend/.env +# Edit backend/.env with your API keys -3. **Start with Docker Compose** - ```bash - cd backend - docker-compose up -d - ``` +# Start backend services +cd backend +docker-compose up -d -4. **Start the frontend** - ```bash - cd frontend - npm install - npm run dev - ``` +# Start frontend +cd ../frontend +npm install +npm run dev +``` -### Option 2: Manual Setup +### Option 2: Manual Development Setup #### Backend Setup +```bash +cd backend +python3 -m venv venv +source venv/bin/activate # Windows: venv\Scripts\activate +pip install -r requirements.txt -1. **Create Python virtual environment** - ```bash - cd backend - python3 -m venv venv - source venv/bin/activate # On Windows: venv\Scripts\activate - ``` +# Configure environment +cp .env.example .env +# Edit .env file with your settings -2. **Install dependencies** - ```bash - pip install -r requirements.txt - ``` +# Start services (macOS with Homebrew) +brew services start mongodb-community +brew services start redis -3. **Set up environment variables** - ```bash - cp .env.example .env - # Edit .env with your configuration - ``` - -4. **Start MongoDB and Redis** - ```bash - # MongoDB - brew services start mongodb/brew/mongodb-community - - # Redis - brew services start redis - ``` - -5. **Start the backend** - ```bash - uvicorn app.main:app --reload --host 0.0.0.0 --port 8000 - ``` +# Initialize database and start server +uvicorn app.main:app --reload --host 0.0.0.0 --port 8000 +``` #### Frontend Setup +```bash +cd frontend +npm install -1. **Install dependencies** - ```bash - cd frontend - npm install - ``` +# Configure environment +cp .env.example .env +# Edit frontend/.env -2. **Set up environment variables** - ```bash - cp .env.example .env - ``` +# Start development server +npm run dev +``` -3. **Start the development server** - ```bash - npm run dev - ``` - -## ๐Ÿ”ง Configuration - -### Backend Environment Variables +## โš™๏ธ Configuration +### Backend Environment Variables (.env) ```env -# Database +# Database Configuration MONGODB_URL=mongodb://localhost:27017 DATABASE_NAME=contract_analysis -# Redis +# Redis Cache (Optional) REDIS_URL=redis://localhost:6379 # Authentication -JWT_SECRET_KEY=your-super-secret-jwt-key +JWT_SECRET_KEY=your-super-secret-jwt-key-change-in-production JWT_ALGORITHM=HS256 -JWT_EXPIRE_MINUTES=30 +JWT_EXPIRE_MINUTES=180 -# OpenAI +# Azure AD/SSO (Optional) +AZURE_CLIENT_ID=your-azure-client-id +AZURE_TENANT_ID=your-azure-tenant-id +AZURE_REDIRECT_URI=http://localhost:3000/auth/callback +SSO_ENABLED=false +ALLOW_LOCAL_ADMIN=true + +# AI Services (Required) OPENAI_API_KEY=your-openai-api-key LLAMAPARSE_API_KEY=your-llamaparse-api-key -# Application -DEBUG=false -CORS_ORIGINS=["http://localhost:3000"] +# Application Settings +DEBUG=true +CORS_ORIGINS=["http://localhost:3000","http://localhost:3002"] UPLOAD_DIR=./uploads INDICES_DIR=./indices -# Cache -CACHE_ENABLED=true +# Performance Settings +CACHE_ENABLED=false # Disabled for development CACHE_TTL=3600 +MAX_DOCUMENT_CHARS=1000000 +MAX_SUMMARY_CHARS=100000 ``` -### Frontend Environment Variables - +### Frontend Environment Variables (.env) ```env VITE_API_URL=http://localhost:8000 VITE_APP_NAME=Contract Analysis Tool ``` -## ๐Ÿš€ Usage +## ๐Ÿš€ Getting Started -### Swagger Testing - Method 1: Get Token via Swagger UI (Easiest) - - 1. Go to http://localhost:8000/docs - 2. First, initialize the default users by calling - the /api/v1/auth/init-users endpoint - 3. Then use the /api/v1/auth/login endpoint with - these credentials: - - Admin User: - - email: admin@oliver.agency - - password: admin123 - - Regular User: - - email: user@oliver.agency - - password: user123 - - 4. Copy the access_token from the response - 5. Click the "Authorize" button at the top of - Swagger UI - 6. Enter: Bearer YOUR_TOKEN_HERE - -### Initial Setup - -1. **Access the application**: http://localhost:3000 -2. **Initialize default users** (first time only): - ```bash - curl -X POST http://localhost:8000/api/v1/auth/init-users - ``` - -### Default Credentials +### 1. Initialize System +```bash +# Create default users (first time only) +curl -X POST http://localhost:8000/api/v1/auth/init-users +``` +### 2. Default Credentials - **Admin**: `admin@oliver.agency` / `admin123` - **User**: `user@oliver.agency` / `user123` -### Workflow - -1. **Login** with admin or user credentials -2. **Create an Index** for your document collection -3. **Upload Documents** to the index -4. **Chat** with your documents using natural language -5. **Manage Users** (admin only) +### 3. Basic Workflow +1. **Login** at http://localhost:3000/login +2. **Create Index** for your document collection +3. **Upload Documents** (supports drag-and-drop) +4. **Wait for Processing** (documents โ†’ chunks โ†’ embeddings) +5. **Start Chatting** with natural language queries +6. **Review Sources** cited in AI responses ## ๐Ÿ“š API Documentation -- **FastAPI Docs**: http://localhost:8000/docs (development only) -- **ReDoc**: http://localhost:8000/redoc (development only) +### Development URLs +- **Application**: http://localhost:3000 +- **API Docs**: http://localhost:8000/docs +- **ReDoc**: http://localhost:8000/redoc +- **Health Check**: http://localhost:8000/health -### Key Endpoints +### Core API Endpoints -- `POST /api/v1/auth/login` - User authentication +#### Authentication +- `POST /api/v1/auth/login` - Local login +- `POST /api/v1/auth/register` - User registration +- `GET /api/v1/auth/me` - Current user info +- `POST /api/v1/auth/refresh` - Token refresh +- `POST /api/v1/auth/sso/validate` - SSO login +- `GET /api/v1/auth/sso/config` - SSO configuration + +#### Document Management +- `POST /api/v1/documents/upload` - Upload documents to index +- `GET /api/v1/documents/index/{index_id}` - List documents in index +- `GET /api/v1/documents/{document_id}` - Document details +- `DELETE /api/v1/documents/{document_id}` - Delete document +- `GET /api/v1/documents/{document_id}/summary` - Contract summary +- `POST /api/v1/documents/{document_id}/summary/reprocess` - Regenerate summary + +#### Index Management - `POST /api/v1/indices/create` - Create document index -- `POST /api/v1/documents/upload` - Upload documents -- `POST /api/v1/chat/query` - Query documents +- `GET /api/v1/indices/` - List user indices + +#### Chat System +- `POST /api/v1/chat/query` - RAG query with context +- `GET /api/v1/chat/history/{index_id}` - Conversation history +- `DELETE /api/v1/chat/history/{index_id}` - Clear chat history + +#### Admin Operations - `GET /api/v1/admin/stats` - System statistics +- `POST /api/v1/admin/documents/upload-single` - Single document upload +- `POST /api/v1/admin/documents/upload-multiple` - Batch upload +- `POST /api/v1/admin/documents/{document_id}/reprocess` - Reprocess document +- `GET /api/v1/admin/indices` - All system indices +- `POST /api/v1/admin/chat/query` - Admin RAG interface -## ๐Ÿ”’ Security Features +## ๐Ÿ”ง Advanced Features -- **JWT Authentication** with role-based access -- **Input Validation** with Pydantic schemas -- **CORS Protection** for frontend integration -- **File Upload Validation** with type/size checks -- **Rate Limiting** (configurable) -- **Environment Variable Protection** +### Context-Aware Chat System +- **Conversation Memory**: AI remembers last 10 messages within 24 hours +- **Smart Context Usage**: Follow-up questions reference previous conversation +- **Context Indicators**: UI shows when AI uses conversation history +- **Session Statistics**: Track response times and context usage -## โšก Performance Features +### Document Processing Pipeline +1. **Upload**: Drag-and-drop or browse files +2. **Validation**: File type, size, and content checks +3. **Processing**: LlamaIndex parsing and chunking +4. **Embedding**: OpenAI embeddings generation +5. **Storage**: ChromaDB vector storage + MongoDB metadata +6. **Indexing**: Ready for semantic search -- **Async Processing** throughout the backend -- **Redis Caching** for API responses -- **Vector Search** with ChromaDB -- **Connection Pooling** for databases -- **Optimized Queries** with MongoDB indexes +### SSO Integration (Optional) +- **Azure AD Support**: Enterprise authentication +- **Local Fallback**: Admin accounts always available +- **Role Mapping**: Automatic role assignment from SSO claims +- **Session Management**: Unified token handling -## ๐Ÿงช Development +### Advanced Caching Strategy +- **Smart Cache Logic**: Context-dependent queries bypass cache +- **Simple Query Cache**: Repeated questions served from Redis +- **TTL Management**: Configurable cache expiration +- **Cache Statistics**: Monitor hit rates and performance -### Backend Development +## ๐Ÿ—๏ธ Project Structure +### Backend (`/backend`) +``` +app/ +โ”œโ”€โ”€ main.py # FastAPI application entry point +โ”œโ”€โ”€ config/ +โ”‚ โ”œโ”€โ”€ settings.py # Environment configuration +โ”‚ โ””โ”€โ”€ database.py # Database connections +โ”œโ”€โ”€ api/v1/ # API endpoints +โ”‚ โ”œโ”€โ”€ auth.py # Authentication routes +โ”‚ โ”œโ”€โ”€ documents.py # Document management +โ”‚ โ”œโ”€โ”€ indices.py # Index operations +โ”‚ โ”œโ”€โ”€ chat.py # Chat/RAG system +โ”‚ โ””โ”€โ”€ admin.py # Admin operations +โ”œโ”€โ”€ models/ # Data models +โ”‚ โ”œโ”€โ”€ user.py # User models +โ”‚ โ”œโ”€โ”€ document.py # Document models +โ”‚ โ”œโ”€โ”€ index.py # Index models +โ”‚ โ”œโ”€โ”€ chat.py # Chat models +โ”‚ โ””โ”€โ”€ contract_summary.py # Summary models +โ”œโ”€โ”€ services/ # Business logic +โ”‚ โ”œโ”€โ”€ document_processor.py # Document processing +โ”‚ โ”œโ”€โ”€ rag_service.py # RAG implementation +โ”‚ โ”œโ”€โ”€ chat_context_service.py # Context management +โ”‚ โ”œโ”€โ”€ contract_summary_service.py # Summarization +โ”‚ โ””โ”€โ”€ sso_service.py # SSO integration +โ”œโ”€โ”€ core/ # Core utilities +โ”‚ โ”œโ”€โ”€ auth.py # Authentication logic +โ”‚ โ”œโ”€โ”€ security.py # Security utilities +โ”‚ โ”œโ”€โ”€ cache.py # Caching logic +โ”‚ โ””โ”€โ”€ chroma_client.py # ChromaDB client +โ””โ”€โ”€ utils/ # Helper utilities + โ””โ”€โ”€ file_utils.py # File operations +``` + +### Frontend (`/frontend`) +``` +src/ +โ”œโ”€โ”€ App.jsx # Main application with routing +โ”œโ”€โ”€ pages/ # Page components +โ”‚ โ”œโ”€โ”€ HomePage.jsx # Landing page +โ”‚ โ”œโ”€โ”€ Dashboard.jsx # Main dashboard +โ”‚ โ”œโ”€โ”€ DocumentManager.jsx # Document management +โ”‚ โ”œโ”€โ”€ ChatInterface.jsx # Chat interface +โ”‚ โ””โ”€โ”€ AdminPanel.jsx # Admin interface +โ”œโ”€โ”€ components/ # Reusable components +โ”‚ โ”œโ”€โ”€ auth/ # Authentication components +โ”‚ โ”œโ”€โ”€ admin/ # Admin-specific components +โ”‚ โ”œโ”€โ”€ chat/ # Chat components +โ”‚ โ”œโ”€โ”€ documents/ # Document components +โ”‚ โ”œโ”€โ”€ indices/ # Index components +โ”‚ โ””โ”€โ”€ common/ # Shared components +โ”œโ”€โ”€ services/ # API service layer +โ”‚ โ”œโ”€โ”€ authService.js # Authentication API +โ”‚ โ”œโ”€โ”€ documentService.js # Document API +โ”‚ โ”œโ”€โ”€ chatService.js # Chat API +โ”‚ โ”œโ”€โ”€ indexService.js # Index API +โ”‚ โ””โ”€โ”€ adminService.js # Admin API +โ”œโ”€โ”€ context/ # React context providers +โ”‚ โ””โ”€โ”€ AuthContext.jsx # Authentication context +โ””โ”€โ”€ utils/ # Frontend utilities + โ””โ”€โ”€ constants.js # Application constants +``` + +## ๐Ÿงช Development & Testing + +### Backend Testing ```bash cd backend source venv/bin/activate -uvicorn app.main:app --reload --host 0.0.0.0 --port 8000 + +# API testing via Swagger +# Visit http://localhost:8000/docs + +# Manual testing +python test_chat_fixes.py ``` ### Frontend Development - ```bash cd frontend + +# Development server with hot reload npm run dev + +# Build for production +npm run build + +# Preview production build +npm run preview + +# Lint code +npm run lint ``` -### Database Migration - -The application automatically creates database collections and indexes on startup. - -## ๐Ÿ“Š Monitoring - -### Health Check - +### Database Management ```bash +# View MongoDB collections +mongo contract_analysis + +# View Redis cache +redis-cli +> KEYS * + +# Clear ChromaDB indices +rm -rf backend/indices/chroma_db/ +``` + +## ๐Ÿ“Š Monitoring & Health Checks + +### System Health +```bash +# Backend health curl http://localhost:8000/health + +# Database connectivity +curl http://localhost:8000/api/v1/admin/stats \ + -H "Authorization: Bearer " ``` -### System Stats (Admin) - -```bash -curl -H "Authorization: Bearer " http://localhost:8000/api/v1/admin/stats -``` +### Performance Monitoring +- **Response Times**: Tracked per request with `X-Process-Time` header +- **Cache Hit Rates**: Monitor Redis performance +- **Document Processing**: Track success/failure rates +- **Vector Search**: Monitor ChromaDB query performance ## ๐Ÿ› Troubleshooting ### Common Issues -1. **MongoDB Connection Failed** - - Ensure MongoDB is running: `brew services start mongodb-community` - - Check connection string in `.env` +**1. MongoDB Connection Issues** +```bash +# Check if MongoDB is running +brew services list | grep mongodb +brew services start mongodb-community -2. **Redis Connection Failed** - - Ensure Redis is running: `brew services start redis` - - Application will continue without caching if Redis is unavailable +# Check connection string in .env +MONGODB_URL=mongodb://localhost:27017 +``` -3. **OpenAI API Errors** - - Verify API key in backend `.env` - - Check API quota and billing +**2. Redis Connection Issues** +```bash +# Redis is optional - app continues without caching +brew services start redis -4. **File Upload Issues** - - Check file size limits (50MB default) - - Verify file types are supported - - Ensure upload directory permissions +# Test Redis connection +redis-cli ping +``` -### Logs +**3. Document Processing Failures** +- Check OpenAI API key validity +- Verify file format support +- Review file size limits (50MB default) +- Check upload directory permissions -- **Backend logs**: Console output from uvicorn -- **Frontend logs**: Browser console -- **Database logs**: MongoDB logs in data directory +**4. ChromaDB Issues** +```bash +# Clear and reinitialize indices +rm -rf backend/indices/chroma_db/ +# Restart backend to recreate +``` -## ๐Ÿ”„ Migration from v1.0 +**5. Frontend Build Issues** +```bash +cd frontend +rm -rf node_modules package-lock.json +npm install +npm run dev +``` -The new system provides complete feature parity with the original PHP application: +### Log Analysis +- **Backend**: Console output from uvicorn +- **Frontend**: Browser developer console +- **Database**: MongoDB logs in system logs +- **Processing**: Check document status in MongoDB -- โœ… All PHP functionality migrated to FastAPI -- โœ… SQLite data can be migrated to MongoDB -- โœ… Existing ChromaDB indices are compatible -- โœ… All document processing features preserved -- โœ… Enhanced performance and security +## ๐Ÿš€ Production Deployment -## ๐Ÿš€ Deployment - -### Production Deployment - -1. **Set production environment variables** -2. **Use production database URLs** -3. **Enable HTTPS with SSL certificates** -4. **Configure reverse proxy (Nginx)** -5. **Set up monitoring and logging** -6. **Regular backups of MongoDB** +### Production Checklist +- [ ] Set strong JWT secret key +- [ ] Configure production database URLs +- [ ] Enable HTTPS with SSL certificates +- [ ] Set up reverse proxy (Nginx/Apache) +- [ ] Configure monitoring and logging +- [ ] Set up regular MongoDB backups +- [ ] Disable debug mode +- [ ] Configure proper CORS origins +- [ ] Set up log rotation ### Docker Production - ```bash +# Use production compose file docker-compose -f docker-compose.prod.yml up -d + +# Or build custom images +docker build -t contract-analysis-backend ./backend +docker build -t contract-analysis-frontend ./frontend ``` +### Environment Variables for Production +```env +DEBUG=false +JWT_SECRET_KEY=your-ultra-secure-secret-key-here +CORS_ORIGINS=["https://yourdomain.com"] +CACHE_ENABLED=true +``` + +## ๐Ÿ” Security Considerations + +### Implemented Security +- **JWT Authentication** with configurable expiration +- **Role-based Authorization** (Admin/User) +- **Input Validation** with Pydantic schemas +- **File Upload Validation** (type, size, content) +- **CORS Protection** with configurable origins +- **Environment Variable Protection** +- **SQL Injection Prevention** (NoSQL with validation) +- **XSS Prevention** (React built-in protection) + +### Best Practices +- Regularly rotate JWT secret keys +- Use HTTPS in production +- Keep dependencies updated +- Monitor for security vulnerabilities +- Implement rate limiting for production +- Regular security audits + ## ๐Ÿค Contributing -1. Fork the repository -2. Create a feature branch -3. Make your changes -4. Add tests if applicable -5. Submit a pull request +1. **Fork** the repository +2. **Create** a feature branch (`git checkout -b feature/amazing-feature`) +3. **Commit** changes (`git commit -m 'Add amazing feature'`) +4. **Push** to branch (`git push origin feature/amazing-feature`) +5. **Open** a Pull Request + +### Development Guidelines +- Follow existing code style and conventions +- Add tests for new features +- Update documentation as needed +- Ensure all tests pass before submitting PR ## ๐Ÿ“„ License -This project is licensed under the MIT License. +This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## ๐Ÿ™ Acknowledgments -- **OpenAI** - GPT-4 and embedding models -- **LlamaIndex** - RAG framework -- **ChromaDB** - Vector storage -- **FastAPI** - Modern Python web framework -- **React** - Frontend framework +- **[OpenAI](https://openai.com/)** - GPT-4 and embedding models +- **[LlamaIndex](https://www.llamaindex.ai/)** - RAG framework and document processing +- **[ChromaDB](https://www.trychroma.com/)** - Vector database for semantic search +- **[FastAPI](https://fastapi.tiangolo.com/)** - Modern Python web framework +- **[React](https://react.dev/)** - Frontend framework +- **[Tailwind CSS](https://tailwindcss.com/)** - Utility-first CSS framework +- **[Vite](https://vitejs.dev/)** - Fast frontend build tool --- -**Built with โค๏ธ for intelligent contract analysis** \ No newline at end of file +**Built with โค๏ธ for intelligent contract analysis and document Q&A** + +*For detailed migration information from v1.0, see [MIGRATION_PLAN.md](MIGRATION_PLAN.md)* +*For API testing guidance, see [API_TESTING_GUIDE.md](API_TESTING_GUIDE.md)* \ No newline at end of file