hp_chatbot/README.md
michael 594f749d4c Initial commit: HP Marketing Materials GraphRAG Chatbot
Full-stack GraphRAG chatbot for HP marketing materials with:
- Python/Flask backend with custom ReAct agent (LlamaIndex)
- Neo4j knowledge graph + vector search hybrid retrieval
- LlamaParse multimodal document processing (text + images)
- React/Vite frontend with conversation management
- MongoDB conversation persistence
- MSAL authentication support

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 08:37:58 -06:00

244 lines
11 KiB
Markdown

# HP Marketing Materials Chatbot
A GraphRAG (Graph Retrieval-Augmented Generation) chatbot that answers questions about HP marketing materials and brand guidelines. Combines vector search with a Neo4j knowledge graph for more comprehensive retrieval, and processes multimodal documents (text + images) using LlamaParse.
## Features
- **Hybrid retrieval**: Vector search + knowledge graph community detection for richer context
- **Multimodal document processing**: Extracts text and page images from PDFs via LlamaParse
- **Custom ReAct agent**: LlamaIndex-based workflow with tool use, reasoning steps, and source citations
- **Conversation persistence**: MongoDB-backed chat history with multi-conversation support
- **Image references**: Responses include relevant document page screenshots
- **Brief export**: Download conversation summaries as Word documents
## Prerequisites
- **Python 3.10+**
- **Node.js 18+**
- **Neo4j** (dedicated instance on port 7688)
- **MongoDB** (with authentication configured)
- **API Keys**: OpenAI (`OPENAI_API_KEY`), LlamaCloud (`LLAMA_CLOUD_API_KEY`)
## Quick Start
### 1. Backend Setup
```bash
# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
# Install dependencies
pip install -r requirements.txt
# Create .env file at project root
cat > .env << 'EOF'
OPENAI_API_KEY=your_openai_key
LLAMA_CLOUD_API_KEY=your_llama_cloud_key
NEO4J_URL=bolt://localhost:7688
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=hp-graphrag-2024
PORT=8746
PRODUCTION=false
LOG_LEVEL=INFO
EOF
# Start the server
python main.py
```
The backend runs on `http://localhost:8746`. On first startup it will:
1. Initialize MongoDB collections and indexes
2. Load or build the vector index from `supporting_files/files_for_rag_store/`
3. Connect to Neo4j and build/load the knowledge graph
4. Build community summaries (cached to `index_storage/graphrag_cache/`)
### 2. Frontend Setup
```bash
cd chat-interface
# Install dependencies
npm install
# Create .env file
cat > .env << 'EOF'
VITE_BACKEND_URL=http://localhost:8746
VITE_APP_BASE_URL=/
EOF
# Start dev server
npm run dev
```
The frontend runs on `http://localhost:5173`.
### 3. Database Setup
**Neo4j:**
- Run a Neo4j instance on port 7688 (port 7687 is reserved for a separate project)
- Credentials: `neo4j` / `hp-graphrag-2024`
- The application auto-populates the graph on first index build
**MongoDB:**
- Create a user `hp` with password `hp` and `authSource=hp_chatbot`
- Database: `hp_chatbot`
- Collections (`users`, `conversations`, `messages`) are auto-created by `init_mongodb.py` on startup
Example MongoDB user setup:
```javascript
use hp_chatbot
db.createUser({
user: "hp",
pwd: "hp",
roles: [{ role: "readWrite", db: "hp_chatbot" }]
})
```
## Project Structure
```
├── main.py # Entry point, Hypercorn ASGI server
├── config.py # Centralized configuration
├── ai_core.py # ReAct agent, document processing, index init
├── graph_rag_integration.py # GraphRAG: extraction, community detection, query engine
├── routes.py # Flask API endpoints
├── shared_state.py # Global state for agent/index/graph (cross-module)
├── session_manager.py # Session-to-conversation mapping
├── mongodb_utils.py # MongoDB CRUD operations
├── json_utils.py # Custom JSON serialization for LlamaIndex types
├── document_generator.py # Markdown-to-Word document conversion
├── utils.py # Logging and file utilities
├── init_mongodb.py # Database initialization script
├── requirements.txt # Python dependencies
├── .env # Environment variables (not committed)
├── supporting_files/
│ └── files_for_rag_store/ # HP marketing documents for indexing
├── uploads/
│ └── images/ # Extracted document page images
├── index_storage/
│ ├── hp_docs_index/ # Persisted vector index
│ └── graphrag_cache/ # Cached community summaries (pickle)
└── chat-interface/ # React frontend
├── src/
│ ├── App.jsx # Main chat interface component
│ ├── auth.js # MSAL authentication
│ ├── components/
│ │ ├── ChatInterface.jsx
│ │ ├── ConversationManager.jsx
│ │ └── ThemeToggle.jsx
│ └── lib/utils.js
├── package.json
└── dist/ # Production build output
```
## Architecture Overview
```
┌─────────────┐ POST /chat ┌──────────────┐
│ React UI │ ──────────────────► │ Flask/ │
│ (App.jsx) │ ◄────────────────── │ Hypercorn │
│ │ JSON response │ (routes.py) │
└─────────────┘ └──────┬───────┘
┌──────▼───────┐
│ Session Mgr │──── MongoDB
│ │ (conversations,
└──────┬───────┘ messages, users)
┌──────▼───────┐
│ ReActAgent2 │
│ (ai_core.py) │
└──────┬───────┘
┌────────────┼────────────┐
│ │
┌──────▼──────┐ ┌──────▼───────┐
│ Vector │ │ GraphRAG │
│ Query Tool │ │ Query Tool │
│ │ │ │
│ LlamaIndex │ │ Vector + │
│ VectorStore │ │ Community │
│ Index │ │ Retrieval │
└─────────────┘ └──────┬───────┘
┌──────▼───────┐
│ Neo4j │
│ Knowledge │
│ Graph │
└──────────────┘
```
## API Reference
| Method | Endpoint | Description |
|--------|----------|-------------|
| `POST` | `/chat` | Send a chat message. Body: `{message, sessionId}` |
| `GET` | `/status?sessionId=` | Check system initialization status |
| `GET` | `/conversations` | List user's conversations (requires `X-MS-USERNAME` header) |
| `POST` | `/conversations/new` | Create a new conversation |
| `GET` | `/conversations/:id/messages` | Get messages for a conversation |
| `DELETE` | `/conversations/:id` | Soft-delete a conversation |
| `POST` | `/reset` | Reset global agent memory. Body: `{sessionId}` |
| `GET` | `/images/:filename` | Serve a document page image |
| `GET` | `/list-images` | List all available images |
| `POST` | `/download-brief` | Generate Word doc. Body: `{brief_content, sessionId}` |
**Authentication**: The frontend sends the MSAL username via `X-MS-USERNAME` header. In development mode (`PRODUCTION=false`), a default `dev_user@local` is used.
## Configuration
All configuration is centralized in `config.py`. Key settings:
| Setting | Default | Description |
|---------|---------|-------------|
| `LLM_MODEL` | `chatgpt-4o-latest` | Main LLM for the ReAct agent |
| `EMBEDDING_MODEL` | `text-embedding-3-small` | Embedding model for vector index |
| `LLM_TEMPERATURE` | `0.3` | LLM temperature |
| `SIMILARITY_TOP_K` | `10` | Number of vector results to retrieve |
| `AGENT_TIMEOUT` | `600s` | Overall agent workflow timeout |
| `LLM_TIMEOUT` | `300s` | Per-LLM-call timeout |
| `SERVER_PORT` | `8746` | Backend server port |
Community summaries use `gpt-4o-mini` for cost efficiency (configured in `graph_rag_integration.py`).
## Adding Documents
Place HP marketing documents (PDF, DOCX, PPTX, TXT) in `supporting_files/files_for_rag_store/`. On the next startup with no existing index, the system will:
1. Parse documents with LlamaParse (text + image extraction)
2. Split into semantic chunks
3. Build a vector index (persisted to `index_storage/hp_docs_index/`)
4. Extract knowledge graph triplets and store in Neo4j
5. Run community detection and cache summaries
To force a full reindex, delete `index_storage/hp_docs_index/` and clear the Neo4j database before restarting.
## Deployment
### Backend
- Set `PRODUCTION=true` environment variable
- Server binds to `0.0.0.0` in production mode
- Configure `CORS_ALLOWED_ORIGINS` in `config.py`
- Production URL: `https://ai-sandbox.oliver.solutions/hp_chatbot_back`
### Frontend
```bash
cd chat-interface
npm run build
```
- Deploy `dist/` contents to the `/hp_chatbot/` path
- Ensure proper MIME types for `.js` files on the web server
- Configure SPA routing (see `web.config` or `.htaccess`)
- Production URL: `https://ai-sandbox.oliver.solutions/hp_chatbot/`
## Troubleshooting
| Issue | Solution |
|-------|----------|
| Backend won't start | Check that Neo4j and MongoDB are running. Verify `OPENAI_API_KEY` is set in `.env` |
| "Agent unavailable" errors | Check startup logs for LLM API test failure. The `/reinitialize` endpoint (dev only) can force re-init |
| No images in responses | Verify `LLAMA_CLOUD_API_KEY` is set. Check that `uploads/images/` contains extracted images |
| CORS errors | Add the frontend origin to `CORS_ALLOWED_ORIGINS` in `config.py` |
| Slow first startup | Initial document processing and graph building can take significant time depending on document volume |
| Neo4j connection refused | Ensure Neo4j is on port 7688 (not 7687, which is a different project) |