hp_chatbot/docs/graphRAG_chatbot_documentation.md

# HP GraphRAG Chatbot - Technical Documentation

## Table of Contents

1. [System Overview](#system-overview)
2. [Architecture](#architecture)
3. [Technology Stack](#technology-stack)
4. [Data Flow](#data-flow)
5. [Database Design](#database-design)
6. [API Reference](#api-reference)
7. [User Flow](#user-flow)
8. [Security](#security)
9. [Deployment](#deployment)
10. [Development](#development)
11. [Troubleshooting](#troubleshooting)

---

## System Overview

The HP GraphRAG Chatbot is a sophisticated conversational AI system that combines vector search with knowledge graph capabilities to answer questions about HP marketing materials and brand guidelines. It processes multimodal documents (text + images) and uses a hybrid AI agent approach for intelligent information retrieval and response generation.

### Key Features

- **Multimodal Document Processing**: Extracts text and images from PDFs, PowerPoint, and other marketing documents
- **GraphRAG Architecture**: Combines vector similarity search with knowledge graph community detection
- **Custom ReAct Agent**: Implements reasoning and action patterns for intelligent query processing
- **Session Management**: Maintains conversation context across multiple interactions
- **Image Display**: Shows relevant document screenshots alongside responses
- **Authentication**: Microsoft Authentication Library (MSAL) integration
- **Conversation History**: Persistent storage and retrieval of chat sessions

---

## Architecture

```mermaid
graph TB
    subgraph "Frontend (React)"
        FE[Chat Interface]
        AUTH[MSAL Auth]
        CONV[Conversation Manager]
        UI[UI Components]
    end

    subgraph "Backend (Python/Flask)"
        API[Flask Routes]
        AGENT[ReAct Agent]
        GRAPH[GraphRAG Engine]
        SESSION[Session Manager]
        PARSE[Document Parser]
    end

    subgraph "AI/ML Layer"
        LLM[OpenAI GPT-4]
        EMBED[Text Embeddings]
        LLAMAPARSE[LlamaParse]
    end

    subgraph "Data Storage"
        NEO4J[(Neo4j<br/>Knowledge Graph)]
        MONGO[(MongoDB<br/>Conversations)]
        VECTOR[(Vector Index<br/>LlamaIndex)]
        FILES[File Storage<br/>Images/Documents]
    end

    FE --> API
    AUTH --> API
    CONV --> API

    API --> AGENT
    API --> SESSION
    API --> PARSE

    AGENT --> GRAPH
    GRAPH --> NEO4J
    GRAPH --> VECTOR
    AGENT --> LLM

    PARSE --> LLAMAPARSE
    LLAMAPARSE --> FILES
    PARSE --> EMBED
    EMBED --> VECTOR

    SESSION --> MONGO

    style FE fill:#e1f5fe
    style API fill:#f3e5f5
    style AGENT fill:#e8f5e8
    style NEO4J fill:#fff3e0
    style MONGO fill:#f1f8e9
```

### Component Breakdown

#### Frontend (React)
- **Chat Interface**: Main conversational UI with message bubbles, image viewing, and input handling
- **Authentication**: MSAL-based Microsoft authentication
- **Conversation Manager**: Handles multiple conversation sessions and history
- **Theme Toggle**: Dark/light mode support

#### Backend (Python/Flask)
- **Flask Routes**: RESTful API endpoints for chat, authentication, file serving
- **ReAct Agent**: Custom implementation with reasoning, action, and observation cycles
- **GraphRAG Engine**: Hybrid retrieval combining vector search with graph-based community detection
- **Session Manager**: Maps frontend sessions to database conversations
- **Document Parser**: LlamaParse integration for multimodal document processing

#### Data Layer
- **Neo4j**: Stores knowledge graph with entities, relationships, and communities
- **MongoDB**: Persists user conversations, messages, and session state
- **Vector Index**: LlamaIndex-based semantic search capabilities
- **File Storage**: Local filesystem for processed images and documents

---

## Technology Stack

### Backend
- **Framework**: Flask + Hypercorn (ASGI)
- **AI/ML**:
  - OpenAI GPT-4 (LLM)
  - text-embedding-3-small (embeddings)
  - LlamaParse (document processing)
  - LlamaIndex (vector indexing)
- **Databases**:
  - Neo4j (knowledge graph)
  - MongoDB (conversations)
- **Languages**: Python 3.9+

### Frontend
- **Framework**: React 18 + Vite
- **Styling**: TailwindCSS + Shadcn/ui
- **Authentication**: Microsoft Authentication Library (MSAL)
- **Languages**: JavaScript/JSX

### Infrastructure
- **Web Server**: Hypercorn (ASGI server)
- **Containerization**: Docker support
- **Deployment**: Azure/Cloud-based

---

## Data Flow

```mermaid
sequenceDiagram
    participant User
    participant Frontend
    participant API
    participant Agent
    participant GraphRAG
    participant Neo4j
    participant Vector
    participant OpenAI
    participant MongoDB

    User->>Frontend: Send message
    Frontend->>API: POST /chat
    API->>Agent: Process query

    Agent->>GraphRAG: Retrieve context
    GraphRAG->>Vector: Vector similarity search
    GraphRAG->>Neo4j: Community detection
    GraphRAG->>OpenAI: Generate synthesis
    GraphRAG->>Agent: Combined context

    Agent->>OpenAI: Generate response
    OpenAI->>Agent: Response + reasoning

    Agent->>API: Structured response
    API->>MongoDB: Store conversation
    API->>Frontend: Response with sources/images
    Frontend->>User: Display response
```

### Document Processing Flow

```mermaid
flowchart TD
    UPLOAD[Document Upload] --> PARSE[LlamaParse Processing]
    PARSE --> EXTRACT[Extract Text + Images]
    EXTRACT --> SPLIT[Semantic Splitting]
    SPLIT --> EMBED[Generate Embeddings]
    EMBED --> VECTOR[Store in Vector Index]
    SPLIT --> GRAPH[Extract Entities/Relations]
    GRAPH --> NEO4J[Store in Neo4j]
    EXTRACT --> IMAGES[Save Page Images]
    IMAGES --> STORAGE[File Storage]
    NEO4J --> COMMUNITY[Community Detection]
    COMMUNITY --> CACHE[Cache Communities]
```

---

## Database Design

### Neo4j Knowledge Graph Schema

```mermaid
erDiagram
    Entity {
        string name
        string label
        string description
        dict properties
    }

    Relation {
        string label
        string source_id
        string target_id
        string description
        dict properties
    }

    Community {
        int community_id
        text summary
        list entity_ids
    }

    Entity ||--o{ Relation : participates_in
    Community ||--o{ Entity : contains
```

### MongoDB Collections Schema

```mermaid
erDiagram
    Users {
        ObjectId _id
        string username
        string email
        datetime created_at
        datetime last_login
    }

    Conversations {
        ObjectId _id
        string session_id
        ObjectId user_id
        string title
        datetime created_at
        datetime last_updated
        boolean is_deleted
    }

    Messages {
        ObjectId _id
        ObjectId conversation_id
        string role
        text content
        array sources
        array reasoning
        array images
        datetime timestamp
    }

    Users ||--o{ Conversations : owns
    Conversations ||--o{ Messages : contains
```

---

## API Reference

### Authentication
All API endpoints require authentication via `X-MS-USERNAME` header (except in development mode).

### Core Endpoints

#### POST /chat
Processes chat messages and returns AI responses.

**Request:**
```json
{
    "message": "string",
    "sessionId": "string"
}
```

**Response:**
```json
{
    "status": "success",
    "data": {
        "response": "string",
        "sources": [
            {
                "content": "string",
                "tool_name": "string",
                "retrieval_method": "vector_only|graphrag_hybrid"
            }
        ],
        "reasoning": [
            {
                "type": "ActionReasoningStep|ObservationReasoningStep",
                "action": "string",
                "observation": "string"
            }
        ],
        "images": [
            {
                "filename": "string",
                "document": "string",
                "page": "number",
                "url_encoded_filename": "string"
            }
        ]
    }
}
```

#### GET /status
Returns system initialization status.

**Response:**
```json
{
    "global_status": "initialized",
    "initialized": true,
    "timestamp": "2024-01-01T00:00:00.000Z"
}
```

#### GET /conversations
Retrieves user's conversation history.

**Response:**
```json
{
    "status": "success",
    "conversations": [
        {
            "id": "string",
            "title": "string",
            "created_at": "datetime",
            "last_updated": "datetime",
            "session_id": "string"
        }
    ]
}
```

#### GET /conversations/{id}/messages
Retrieves messages for a specific conversation.

**Response:**
```json
{
    "status": "success",
    "conversation_title": "string",
    "messages": [
        {
            "id": "string",
            "role": "user|assistant",
            "content": "string",
            "timestamp": "datetime",
            "sources": [],
            "reasoning": [],
            "images": []
        }
    ]
}
```

#### GET /images/{filename}
Serves processed document images.

#### POST /reset
Resets the global agent's conversation memory.

#### DELETE /conversations/{id}
Deletes a conversation (soft delete by default).

---

## User Flow

```mermaid
flowchart TD
    START([User Access]) --> AUTH{Authenticated?}
    AUTH -->|No| LOGIN[MSAL Login]
    LOGIN --> AUTH
    AUTH -->|Yes| LOAD[Load Conversations]

    LOAD --> NEWCHAT[Create New Chat]
    NEWCHAT --> INTERFACE[Chat Interface]

    INTERFACE --> INPUT[User Input]
    INPUT --> PROCESS[Process with GraphRAG]
    PROCESS --> RETRIEVE[Hybrid Retrieval]
    RETRIEVE --> GENERATE[Generate Response]
    GENERATE --> DISPLAY[Display with Images]
    DISPLAY --> INPUT

    DISPLAY --> SAVE[Save to History]
    SAVE --> UPDATE[Update Conversation]

    INTERFACE --> HISTORY[View History]
    HISTORY --> SELECT[Select Conversation]
    SELECT --> LOAD_MSG[Load Messages]
    LOAD_MSG --> INTERFACE

    INTERFACE --> EXPORT[Export Brief]
    INTERFACE --> DELETE[Delete Conversation]
```

### Detailed User Journey

1. **Authentication**: User logs in via Microsoft MSAL
2. **Conversation Creation**: System creates new conversation or loads existing
3. **Query Processing**:
   - User sends message
   - GraphRAG performs hybrid retrieval
   - Vector similarity search finds relevant chunks
   - Knowledge graph identifies related communities
   - LLM synthesizes response with reasoning
4. **Response Display**:
   - Text response with markdown support
   - Source attribution tooltips
   - Relevant document images
   - Reasoning chain (if available)
5. **History Management**: Conversations persisted and retrievable

---

## Security

### Authentication
- Microsoft Authentication Library (MSAL) integration
- Azure AD tenant-based access control
- Session-based user identification

### Data Protection
- No sensitive data logged in plain text
- Conversation data encrypted at rest (MongoDB)
- API key management via environment variables
- CORS configuration for cross-origin requests

### Access Control
- User-scoped conversation access
- Session-based authorization
- Development vs production mode differentiation

---

## Deployment

### Environment Configuration

#### Backend (.env)
```bash
# API Keys
OPENAI_API_KEY=your_openai_key
LLAMA_CLOUD_API_KEY=your_llama_cloud_key
ANTHROPIC_API_KEY=your_anthropic_key

# Database Configuration
NEO4J_URL=bolt://localhost:7688
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=hp-graphrag-2024

# Server Configuration
PORT=8746
PRODUCTION=true
LOG_LEVEL=INFO
```

#### Frontend (.env)
```bash
VITE_BACKEND_URL=https://ai-sandbox.oliver.solutions/hp_chatbot_back
VITE_APP_BASE_URL=/hp_chatbot/
```

### Deployment Steps

1. **Database Setup**:
   - Neo4j instance on port 7688
   - MongoDB with authentication
   - Initialize collections via `init_mongodb.py`

2. **Backend Deployment**:
   ```bash
   pip install -r requirements.txt
   python main.py
   ```

3. **Frontend Build**:
   ```bash
   cd chat-interface
   npm install
   npm run build
   # Deploy dist/ contents to /hp_chatbot/ path
   ```

4. **Web Server Configuration**:
   - Configure reverse proxy (nginx/Apache)
   - Set up SSL certificates
   - Configure CORS origins

---

## Development

### Backend Development

```bash
# Setup virtual environment
python -m venv env
source env/bin/activate  # or env\Scripts\activate on Windows

# Install dependencies
pip install -r requirements.txt

# Start development server
python main.py
```

### Frontend Development

```bash
cd chat-interface
npm install
npm run dev
```

### Key Development Commands

| Command | Purpose |
|---------|---------|
| `python main.py` | Start backend server |
| `npm run dev` | Start frontend dev server |
| `npm run build` | Build frontend for production |
| `npm run lint` | Lint frontend code |

### Code Structure

```
hp_graphRAG_bot/
├── Backend (Python)
│   ├── main.py                 # Application entry point
│   ├── ai_core.py             # Core AI engine & ReAct agent
│   ├── graph_rag_integration.py # GraphRAG system
│   ├── routes.py              # Flask API routes
│   ├── session_manager.py     # Session management
│   ├── mongodb_utils.py       # MongoDB operations
│   ├── config.py              # Configuration
│   └── shared_state.py        # Global state management
├── Frontend (React)
│   └── chat-interface/
│       ├── src/
│       │   ├── App.jsx        # Main application component
│       │   ├── components/    # React components
│       │   ├── auth.js        # MSAL authentication
│       │   └── lib/           # Utilities
│       └── dist/              # Production build
└── Data Storage
    ├── uploads/images/        # Processed document images
    ├── index_storage/         # Vector index data
    └── supporting_files/      # Source documents
```

---

## Troubleshooting

### Common Issues

#### Backend Issues

**Problem**: `Global workflow agent not initialized`
**Solution**: Check OpenAI API key and Neo4j connectivity
```bash
# Verify environment variables
echo $OPENAI_API_KEY
# Check Neo4j connection
curl http://localhost:7474
```

**Problem**: `LlamaParse timeout during document processing`
**Solution**: Increase timeout settings in config.py
```python
LLAMA_PARSE_MAX_TIMEOUT = 7200  # 2 hours
```

**Problem**: `MongoDB connection failed`
**Solution**: Verify MongoDB service and credentials
```bash
# Check MongoDB status
brew services list | grep mongodb
# Test connection
mongosh mongodb://hp:hp@localhost:27017/hp_chatbot
```

#### Frontend Issues

**Problem**: `CORS policy blocking requests`
**Solution**: Update CORS_ALLOWED_ORIGINS in backend config.py

**Problem**: `Authentication failures`
**Solution**: Verify MSAL configuration and Azure AD settings

**Problem**: `Images not loading`
**Solution**: Check image file paths and backend /images/ endpoint

### Debug Endpoints

**Development Mode Only:**
- `GET /debug-status` - System state inspection
- `POST /reinitialize` - Force agent reinitialization
- `POST /capture-screenshot` - Manual image extraction testing

### Logging

All components use structured logging:
```python
log_structured('info', 'Event description', {'key': 'value'})
```

Log files locations:
- Backend: `app.log`
- MongoDB operations: `mongodb.log`

---

## Performance Considerations

### Scaling
- **Vector Index**: Consider PostgreSQL pgvector for large deployments
- **Neo4j**: Implement read replicas for query scaling
- **MongoDB**: Use connection pooling and sharding
- **Caching**: Redis for session and community caches

### Optimization
- **GraphRAG Communities**: Pre-computed and cached
- **Image Processing**: Async processing with queue system
- **Memory Management**: Agent memory reset policies
- **Response Time**: Parallel vector and graph retrieval

---

## Future Enhancements

### Planned Features
1. **Multi-tenant Architecture**: Support multiple organizations
2. **Advanced Analytics**: Usage metrics and conversation insights
3. **Enhanced Multimodal**: Video and audio processing
4. **Real-time Collaboration**: Multi-user conversations
5. **API Extensions**: Webhook integrations and external tool calling
6. **Advanced Security**: Role-based access control and audit logging

### Technical Debt
- Implement comprehensive test suite
- Add API rate limiting
- Improve error handling consistency
- Optimize database queries
- Add health check endpoints

---

*Documentation Version: 1.0*
*Last Updated: 2024-01-01*
*System Version: HP GraphRAG Chatbot v1.0*