Full-stack GraphRAG chatbot for HP marketing materials with: - Python/Flask backend with custom ReAct agent (LlamaIndex) - Neo4j knowledge graph + vector search hybrid retrieval - LlamaParse multimodal document processing (text + images) - React/Vite frontend with conversation management - MongoDB conversation persistence - MSAL authentication support Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
666 lines
No EOL
16 KiB
Markdown
666 lines
No EOL
16 KiB
Markdown
# HP GraphRAG Chatbot - Technical Documentation
|
|
|
|
## Table of Contents
|
|
|
|
1. [System Overview](#system-overview)
|
|
2. [Architecture](#architecture)
|
|
3. [Technology Stack](#technology-stack)
|
|
4. [Data Flow](#data-flow)
|
|
5. [Database Design](#database-design)
|
|
6. [API Reference](#api-reference)
|
|
7. [User Flow](#user-flow)
|
|
8. [Security](#security)
|
|
9. [Deployment](#deployment)
|
|
10. [Development](#development)
|
|
11. [Troubleshooting](#troubleshooting)
|
|
|
|
---
|
|
|
|
## System Overview
|
|
|
|
The HP GraphRAG Chatbot is a sophisticated conversational AI system that combines vector search with knowledge graph capabilities to answer questions about HP marketing materials and brand guidelines. It processes multimodal documents (text + images) and uses a hybrid AI agent approach for intelligent information retrieval and response generation.
|
|
|
|
### Key Features
|
|
|
|
- **Multimodal Document Processing**: Extracts text and images from PDFs, PowerPoint, and other marketing documents
|
|
- **GraphRAG Architecture**: Combines vector similarity search with knowledge graph community detection
|
|
- **Custom ReAct Agent**: Implements reasoning and action patterns for intelligent query processing
|
|
- **Session Management**: Maintains conversation context across multiple interactions
|
|
- **Image Display**: Shows relevant document screenshots alongside responses
|
|
- **Authentication**: Microsoft Authentication Library (MSAL) integration
|
|
- **Conversation History**: Persistent storage and retrieval of chat sessions
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Frontend (React)"
|
|
FE[Chat Interface]
|
|
AUTH[MSAL Auth]
|
|
CONV[Conversation Manager]
|
|
UI[UI Components]
|
|
end
|
|
|
|
subgraph "Backend (Python/Flask)"
|
|
API[Flask Routes]
|
|
AGENT[ReAct Agent]
|
|
GRAPH[GraphRAG Engine]
|
|
SESSION[Session Manager]
|
|
PARSE[Document Parser]
|
|
end
|
|
|
|
subgraph "AI/ML Layer"
|
|
LLM[OpenAI GPT-4]
|
|
EMBED[Text Embeddings]
|
|
LLAMAPARSE[LlamaParse]
|
|
end
|
|
|
|
subgraph "Data Storage"
|
|
NEO4J[(Neo4j<br/>Knowledge Graph)]
|
|
MONGO[(MongoDB<br/>Conversations)]
|
|
VECTOR[(Vector Index<br/>LlamaIndex)]
|
|
FILES[File Storage<br/>Images/Documents]
|
|
end
|
|
|
|
FE --> API
|
|
AUTH --> API
|
|
CONV --> API
|
|
|
|
API --> AGENT
|
|
API --> SESSION
|
|
API --> PARSE
|
|
|
|
AGENT --> GRAPH
|
|
GRAPH --> NEO4J
|
|
GRAPH --> VECTOR
|
|
AGENT --> LLM
|
|
|
|
PARSE --> LLAMAPARSE
|
|
LLAMAPARSE --> FILES
|
|
PARSE --> EMBED
|
|
EMBED --> VECTOR
|
|
|
|
SESSION --> MONGO
|
|
|
|
style FE fill:#e1f5fe
|
|
style API fill:#f3e5f5
|
|
style AGENT fill:#e8f5e8
|
|
style NEO4J fill:#fff3e0
|
|
style MONGO fill:#f1f8e9
|
|
```
|
|
|
|
### Component Breakdown
|
|
|
|
#### Frontend (React)
|
|
- **Chat Interface**: Main conversational UI with message bubbles, image viewing, and input handling
|
|
- **Authentication**: MSAL-based Microsoft authentication
|
|
- **Conversation Manager**: Handles multiple conversation sessions and history
|
|
- **Theme Toggle**: Dark/light mode support
|
|
|
|
#### Backend (Python/Flask)
|
|
- **Flask Routes**: RESTful API endpoints for chat, authentication, file serving
|
|
- **ReAct Agent**: Custom implementation with reasoning, action, and observation cycles
|
|
- **GraphRAG Engine**: Hybrid retrieval combining vector search with graph-based community detection
|
|
- **Session Manager**: Maps frontend sessions to database conversations
|
|
- **Document Parser**: LlamaParse integration for multimodal document processing
|
|
|
|
#### Data Layer
|
|
- **Neo4j**: Stores knowledge graph with entities, relationships, and communities
|
|
- **MongoDB**: Persists user conversations, messages, and session state
|
|
- **Vector Index**: LlamaIndex-based semantic search capabilities
|
|
- **File Storage**: Local filesystem for processed images and documents
|
|
|
|
---
|
|
|
|
## Technology Stack
|
|
|
|
### Backend
|
|
- **Framework**: Flask + Hypercorn (ASGI)
|
|
- **AI/ML**:
|
|
- OpenAI GPT-4 (LLM)
|
|
- text-embedding-3-small (embeddings)
|
|
- LlamaParse (document processing)
|
|
- LlamaIndex (vector indexing)
|
|
- **Databases**:
|
|
- Neo4j (knowledge graph)
|
|
- MongoDB (conversations)
|
|
- **Languages**: Python 3.9+
|
|
|
|
### Frontend
|
|
- **Framework**: React 18 + Vite
|
|
- **Styling**: TailwindCSS + Shadcn/ui
|
|
- **Authentication**: Microsoft Authentication Library (MSAL)
|
|
- **Languages**: JavaScript/JSX
|
|
|
|
### Infrastructure
|
|
- **Web Server**: Hypercorn (ASGI server)
|
|
- **Containerization**: Docker support
|
|
- **Deployment**: Azure/Cloud-based
|
|
|
|
---
|
|
|
|
## Data Flow
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant User
|
|
participant Frontend
|
|
participant API
|
|
participant Agent
|
|
participant GraphRAG
|
|
participant Neo4j
|
|
participant Vector
|
|
participant OpenAI
|
|
participant MongoDB
|
|
|
|
User->>Frontend: Send message
|
|
Frontend->>API: POST /chat
|
|
API->>Agent: Process query
|
|
|
|
Agent->>GraphRAG: Retrieve context
|
|
GraphRAG->>Vector: Vector similarity search
|
|
GraphRAG->>Neo4j: Community detection
|
|
GraphRAG->>OpenAI: Generate synthesis
|
|
GraphRAG->>Agent: Combined context
|
|
|
|
Agent->>OpenAI: Generate response
|
|
OpenAI->>Agent: Response + reasoning
|
|
|
|
Agent->>API: Structured response
|
|
API->>MongoDB: Store conversation
|
|
API->>Frontend: Response with sources/images
|
|
Frontend->>User: Display response
|
|
```
|
|
|
|
### Document Processing Flow
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
UPLOAD[Document Upload] --> PARSE[LlamaParse Processing]
|
|
PARSE --> EXTRACT[Extract Text + Images]
|
|
EXTRACT --> SPLIT[Semantic Splitting]
|
|
SPLIT --> EMBED[Generate Embeddings]
|
|
EMBED --> VECTOR[Store in Vector Index]
|
|
SPLIT --> GRAPH[Extract Entities/Relations]
|
|
GRAPH --> NEO4J[Store in Neo4j]
|
|
EXTRACT --> IMAGES[Save Page Images]
|
|
IMAGES --> STORAGE[File Storage]
|
|
NEO4J --> COMMUNITY[Community Detection]
|
|
COMMUNITY --> CACHE[Cache Communities]
|
|
```
|
|
|
|
---
|
|
|
|
## Database Design
|
|
|
|
### Neo4j Knowledge Graph Schema
|
|
|
|
```mermaid
|
|
erDiagram
|
|
Entity {
|
|
string name
|
|
string label
|
|
string description
|
|
dict properties
|
|
}
|
|
|
|
Relation {
|
|
string label
|
|
string source_id
|
|
string target_id
|
|
string description
|
|
dict properties
|
|
}
|
|
|
|
Community {
|
|
int community_id
|
|
text summary
|
|
list entity_ids
|
|
}
|
|
|
|
Entity ||--o{ Relation : participates_in
|
|
Community ||--o{ Entity : contains
|
|
```
|
|
|
|
### MongoDB Collections Schema
|
|
|
|
```mermaid
|
|
erDiagram
|
|
Users {
|
|
ObjectId _id
|
|
string username
|
|
string email
|
|
datetime created_at
|
|
datetime last_login
|
|
}
|
|
|
|
Conversations {
|
|
ObjectId _id
|
|
string session_id
|
|
ObjectId user_id
|
|
string title
|
|
datetime created_at
|
|
datetime last_updated
|
|
boolean is_deleted
|
|
}
|
|
|
|
Messages {
|
|
ObjectId _id
|
|
ObjectId conversation_id
|
|
string role
|
|
text content
|
|
array sources
|
|
array reasoning
|
|
array images
|
|
datetime timestamp
|
|
}
|
|
|
|
Users ||--o{ Conversations : owns
|
|
Conversations ||--o{ Messages : contains
|
|
```
|
|
|
|
---
|
|
|
|
## API Reference
|
|
|
|
### Authentication
|
|
All API endpoints require authentication via `X-MS-USERNAME` header (except in development mode).
|
|
|
|
### Core Endpoints
|
|
|
|
#### POST /chat
|
|
Processes chat messages and returns AI responses.
|
|
|
|
**Request:**
|
|
```json
|
|
{
|
|
"message": "string",
|
|
"sessionId": "string"
|
|
}
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"status": "success",
|
|
"data": {
|
|
"response": "string",
|
|
"sources": [
|
|
{
|
|
"content": "string",
|
|
"tool_name": "string",
|
|
"retrieval_method": "vector_only|graphrag_hybrid"
|
|
}
|
|
],
|
|
"reasoning": [
|
|
{
|
|
"type": "ActionReasoningStep|ObservationReasoningStep",
|
|
"action": "string",
|
|
"observation": "string"
|
|
}
|
|
],
|
|
"images": [
|
|
{
|
|
"filename": "string",
|
|
"document": "string",
|
|
"page": "number",
|
|
"url_encoded_filename": "string"
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
#### GET /status
|
|
Returns system initialization status.
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"global_status": "initialized",
|
|
"initialized": true,
|
|
"timestamp": "2024-01-01T00:00:00.000Z"
|
|
}
|
|
```
|
|
|
|
#### GET /conversations
|
|
Retrieves user's conversation history.
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"status": "success",
|
|
"conversations": [
|
|
{
|
|
"id": "string",
|
|
"title": "string",
|
|
"created_at": "datetime",
|
|
"last_updated": "datetime",
|
|
"session_id": "string"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
#### GET /conversations/{id}/messages
|
|
Retrieves messages for a specific conversation.
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"status": "success",
|
|
"conversation_title": "string",
|
|
"messages": [
|
|
{
|
|
"id": "string",
|
|
"role": "user|assistant",
|
|
"content": "string",
|
|
"timestamp": "datetime",
|
|
"sources": [],
|
|
"reasoning": [],
|
|
"images": []
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
#### GET /images/{filename}
|
|
Serves processed document images.
|
|
|
|
#### POST /reset
|
|
Resets the global agent's conversation memory.
|
|
|
|
#### DELETE /conversations/{id}
|
|
Deletes a conversation (soft delete by default).
|
|
|
|
---
|
|
|
|
## User Flow
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
START([User Access]) --> AUTH{Authenticated?}
|
|
AUTH -->|No| LOGIN[MSAL Login]
|
|
LOGIN --> AUTH
|
|
AUTH -->|Yes| LOAD[Load Conversations]
|
|
|
|
LOAD --> NEWCHAT[Create New Chat]
|
|
NEWCHAT --> INTERFACE[Chat Interface]
|
|
|
|
INTERFACE --> INPUT[User Input]
|
|
INPUT --> PROCESS[Process with GraphRAG]
|
|
PROCESS --> RETRIEVE[Hybrid Retrieval]
|
|
RETRIEVE --> GENERATE[Generate Response]
|
|
GENERATE --> DISPLAY[Display with Images]
|
|
DISPLAY --> INPUT
|
|
|
|
DISPLAY --> SAVE[Save to History]
|
|
SAVE --> UPDATE[Update Conversation]
|
|
|
|
INTERFACE --> HISTORY[View History]
|
|
HISTORY --> SELECT[Select Conversation]
|
|
SELECT --> LOAD_MSG[Load Messages]
|
|
LOAD_MSG --> INTERFACE
|
|
|
|
INTERFACE --> EXPORT[Export Brief]
|
|
INTERFACE --> DELETE[Delete Conversation]
|
|
```
|
|
|
|
### Detailed User Journey
|
|
|
|
1. **Authentication**: User logs in via Microsoft MSAL
|
|
2. **Conversation Creation**: System creates new conversation or loads existing
|
|
3. **Query Processing**:
|
|
- User sends message
|
|
- GraphRAG performs hybrid retrieval
|
|
- Vector similarity search finds relevant chunks
|
|
- Knowledge graph identifies related communities
|
|
- LLM synthesizes response with reasoning
|
|
4. **Response Display**:
|
|
- Text response with markdown support
|
|
- Source attribution tooltips
|
|
- Relevant document images
|
|
- Reasoning chain (if available)
|
|
5. **History Management**: Conversations persisted and retrievable
|
|
|
|
---
|
|
|
|
## Security
|
|
|
|
### Authentication
|
|
- Microsoft Authentication Library (MSAL) integration
|
|
- Azure AD tenant-based access control
|
|
- Session-based user identification
|
|
|
|
### Data Protection
|
|
- No sensitive data logged in plain text
|
|
- Conversation data encrypted at rest (MongoDB)
|
|
- API key management via environment variables
|
|
- CORS configuration for cross-origin requests
|
|
|
|
### Access Control
|
|
- User-scoped conversation access
|
|
- Session-based authorization
|
|
- Development vs production mode differentiation
|
|
|
|
---
|
|
|
|
## Deployment
|
|
|
|
### Environment Configuration
|
|
|
|
#### Backend (.env)
|
|
```bash
|
|
# API Keys
|
|
OPENAI_API_KEY=your_openai_key
|
|
LLAMA_CLOUD_API_KEY=your_llama_cloud_key
|
|
ANTHROPIC_API_KEY=your_anthropic_key
|
|
|
|
# Database Configuration
|
|
NEO4J_URL=bolt://localhost:7688
|
|
NEO4J_USERNAME=neo4j
|
|
NEO4J_PASSWORD=hp-graphrag-2024
|
|
|
|
# Server Configuration
|
|
PORT=8746
|
|
PRODUCTION=true
|
|
LOG_LEVEL=INFO
|
|
```
|
|
|
|
#### Frontend (.env)
|
|
```bash
|
|
VITE_BACKEND_URL=https://ai-sandbox.oliver.solutions/hp_chatbot_back
|
|
VITE_APP_BASE_URL=/hp_chatbot/
|
|
```
|
|
|
|
### Deployment Steps
|
|
|
|
1. **Database Setup**:
|
|
- Neo4j instance on port 7688
|
|
- MongoDB with authentication
|
|
- Initialize collections via `init_mongodb.py`
|
|
|
|
2. **Backend Deployment**:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
python main.py
|
|
```
|
|
|
|
3. **Frontend Build**:
|
|
```bash
|
|
cd chat-interface
|
|
npm install
|
|
npm run build
|
|
# Deploy dist/ contents to /hp_chatbot/ path
|
|
```
|
|
|
|
4. **Web Server Configuration**:
|
|
- Configure reverse proxy (nginx/Apache)
|
|
- Set up SSL certificates
|
|
- Configure CORS origins
|
|
|
|
---
|
|
|
|
## Development
|
|
|
|
### Backend Development
|
|
|
|
```bash
|
|
# Setup virtual environment
|
|
python -m venv env
|
|
source env/bin/activate # or env\Scripts\activate on Windows
|
|
|
|
# Install dependencies
|
|
pip install -r requirements.txt
|
|
|
|
# Start development server
|
|
python main.py
|
|
```
|
|
|
|
### Frontend Development
|
|
|
|
```bash
|
|
cd chat-interface
|
|
npm install
|
|
npm run dev
|
|
```
|
|
|
|
### Key Development Commands
|
|
|
|
| Command | Purpose |
|
|
|---------|---------|
|
|
| `python main.py` | Start backend server |
|
|
| `npm run dev` | Start frontend dev server |
|
|
| `npm run build` | Build frontend for production |
|
|
| `npm run lint` | Lint frontend code |
|
|
|
|
### Code Structure
|
|
|
|
```
|
|
hp_graphRAG_bot/
|
|
├── Backend (Python)
|
|
│ ├── main.py # Application entry point
|
|
│ ├── ai_core.py # Core AI engine & ReAct agent
|
|
│ ├── graph_rag_integration.py # GraphRAG system
|
|
│ ├── routes.py # Flask API routes
|
|
│ ├── session_manager.py # Session management
|
|
│ ├── mongodb_utils.py # MongoDB operations
|
|
│ ├── config.py # Configuration
|
|
│ └── shared_state.py # Global state management
|
|
├── Frontend (React)
|
|
│ └── chat-interface/
|
|
│ ├── src/
|
|
│ │ ├── App.jsx # Main application component
|
|
│ │ ├── components/ # React components
|
|
│ │ ├── auth.js # MSAL authentication
|
|
│ │ └── lib/ # Utilities
|
|
│ └── dist/ # Production build
|
|
└── Data Storage
|
|
├── uploads/images/ # Processed document images
|
|
├── index_storage/ # Vector index data
|
|
└── supporting_files/ # Source documents
|
|
```
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
#### Backend Issues
|
|
|
|
**Problem**: `Global workflow agent not initialized`
|
|
**Solution**: Check OpenAI API key and Neo4j connectivity
|
|
```bash
|
|
# Verify environment variables
|
|
echo $OPENAI_API_KEY
|
|
# Check Neo4j connection
|
|
curl http://localhost:7474
|
|
```
|
|
|
|
**Problem**: `LlamaParse timeout during document processing`
|
|
**Solution**: Increase timeout settings in config.py
|
|
```python
|
|
LLAMA_PARSE_MAX_TIMEOUT = 7200 # 2 hours
|
|
```
|
|
|
|
**Problem**: `MongoDB connection failed`
|
|
**Solution**: Verify MongoDB service and credentials
|
|
```bash
|
|
# Check MongoDB status
|
|
brew services list | grep mongodb
|
|
# Test connection
|
|
mongosh mongodb://hp:hp@localhost:27017/hp_chatbot
|
|
```
|
|
|
|
#### Frontend Issues
|
|
|
|
**Problem**: `CORS policy blocking requests`
|
|
**Solution**: Update CORS_ALLOWED_ORIGINS in backend config.py
|
|
|
|
**Problem**: `Authentication failures`
|
|
**Solution**: Verify MSAL configuration and Azure AD settings
|
|
|
|
**Problem**: `Images not loading`
|
|
**Solution**: Check image file paths and backend /images/ endpoint
|
|
|
|
### Debug Endpoints
|
|
|
|
**Development Mode Only:**
|
|
- `GET /debug-status` - System state inspection
|
|
- `POST /reinitialize` - Force agent reinitialization
|
|
- `POST /capture-screenshot` - Manual image extraction testing
|
|
|
|
### Logging
|
|
|
|
All components use structured logging:
|
|
```python
|
|
log_structured('info', 'Event description', {'key': 'value'})
|
|
```
|
|
|
|
Log files locations:
|
|
- Backend: `app.log`
|
|
- MongoDB operations: `mongodb.log`
|
|
|
|
---
|
|
|
|
## Performance Considerations
|
|
|
|
### Scaling
|
|
- **Vector Index**: Consider PostgreSQL pgvector for large deployments
|
|
- **Neo4j**: Implement read replicas for query scaling
|
|
- **MongoDB**: Use connection pooling and sharding
|
|
- **Caching**: Redis for session and community caches
|
|
|
|
### Optimization
|
|
- **GraphRAG Communities**: Pre-computed and cached
|
|
- **Image Processing**: Async processing with queue system
|
|
- **Memory Management**: Agent memory reset policies
|
|
- **Response Time**: Parallel vector and graph retrieval
|
|
|
|
---
|
|
|
|
## Future Enhancements
|
|
|
|
### Planned Features
|
|
1. **Multi-tenant Architecture**: Support multiple organizations
|
|
2. **Advanced Analytics**: Usage metrics and conversation insights
|
|
3. **Enhanced Multimodal**: Video and audio processing
|
|
4. **Real-time Collaboration**: Multi-user conversations
|
|
5. **API Extensions**: Webhook integrations and external tool calling
|
|
6. **Advanced Security**: Role-based access control and audit logging
|
|
|
|
### Technical Debt
|
|
- Implement comprehensive test suite
|
|
- Add API rate limiting
|
|
- Improve error handling consistency
|
|
- Optimize database queries
|
|
- Add health check endpoints
|
|
|
|
---
|
|
|
|
*Documentation Version: 1.0*
|
|
*Last Updated: 2024-01-01*
|
|
*System Version: HP GraphRAG Chatbot v1.0* |