hp_chatbot/README.md

# HP Marketing Materials Chatbot

A GraphRAG (Graph Retrieval-Augmented Generation) chatbot that answers questions about HP marketing materials and brand guidelines. Combines vector search with a Neo4j knowledge graph for more comprehensive retrieval, and processes multimodal documents (text + images) using LlamaParse.

## Features

- **Hybrid retrieval**: Vector search + knowledge graph community detection for richer context
- **Multimodal document processing**: Extracts text and page images from PDFs via LlamaParse
- **Custom ReAct agent**: LlamaIndex-based workflow with tool use, reasoning steps, and source citations
- **Conversation persistence**: MongoDB-backed chat history with multi-conversation support
- **Image references**: Responses include relevant document page screenshots
- **Brief export**: Download conversation summaries as Word documents

## Prerequisites

- **Python 3.10+**
- **Node.js 18+**
- **Neo4j** (dedicated instance on port 7688)
- **MongoDB** (with authentication configured)
- **API Keys**: OpenAI (`OPENAI_API_KEY`), LlamaCloud (`LLAMA_CLOUD_API_KEY`)

## Quick Start

### 1. Backend Setup

```bash
# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows

# Install dependencies
pip install -r requirements.txt

# Create .env file at project root
cat > .env << 'EOF'
OPENAI_API_KEY=your_openai_key
LLAMA_CLOUD_API_KEY=your_llama_cloud_key
NEO4J_URL=bolt://localhost:7688
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=hp-graphrag-2024
PORT=8746
PRODUCTION=false
LOG_LEVEL=INFO
EOF

# Start the server
python main.py
```

The backend runs on `http://localhost:8746`. On first startup it will:
1. Initialize MongoDB collections and indexes
2. Load or build the vector index from `supporting_files/files_for_rag_store/`
3. Connect to Neo4j and build/load the knowledge graph
4. Build community summaries (cached to `index_storage/graphrag_cache/`)

### 2. Frontend Setup

```bash
cd chat-interface

# Install dependencies
npm install

# Create .env file
cat > .env << 'EOF'
VITE_BACKEND_URL=http://localhost:8746
VITE_APP_BASE_URL=/
EOF

# Start dev server
npm run dev
```

The frontend runs on `http://localhost:5173`.

### 3. Database Setup

**Neo4j:**
- Run a Neo4j instance on port 7688 (port 7687 is reserved for a separate project)
- Credentials: `neo4j` / `hp-graphrag-2024`
- The application auto-populates the graph on first index build

**MongoDB:**
- Create a user `hp` with password `hp` and `authSource=hp_chatbot`
- Database: `hp_chatbot`
- Collections (`users`, `conversations`, `messages`) are auto-created by `init_mongodb.py` on startup

Example MongoDB user setup:
```javascript
use hp_chatbot
db.createUser({
  user: "hp",
  pwd: "hp",
  roles: [{ role: "readWrite", db: "hp_chatbot" }]
})
```

## Project Structure

```
├── main.py                      # Entry point, Hypercorn ASGI server
├── config.py                    # Centralized configuration
├── ai_core.py                   # ReAct agent, document processing, index init
├── graph_rag_integration.py     # GraphRAG: extraction, community detection, query engine
├── routes.py                    # Flask API endpoints
├── shared_state.py              # Global state for agent/index/graph (cross-module)
├── session_manager.py           # Session-to-conversation mapping
├── mongodb_utils.py             # MongoDB CRUD operations
├── json_utils.py                # Custom JSON serialization for LlamaIndex types
├── document_generator.py        # Markdown-to-Word document conversion
├── utils.py                     # Logging and file utilities
├── init_mongodb.py              # Database initialization script
├── requirements.txt             # Python dependencies
├── .env                         # Environment variables (not committed)
├── supporting_files/
│   └── files_for_rag_store/     # HP marketing documents for indexing
├── uploads/
│   └── images/                  # Extracted document page images
├── index_storage/
│   ├── hp_docs_index/           # Persisted vector index
│   └── graphrag_cache/          # Cached community summaries (pickle)
└── chat-interface/              # React frontend
    ├── src/
    │   ├── App.jsx              # Main chat interface component
    │   ├── auth.js              # MSAL authentication
    │   ├── components/
    │   │   ├── ChatInterface.jsx
    │   │   ├── ConversationManager.jsx
    │   │   └── ThemeToggle.jsx
    │   └── lib/utils.js
    ├── package.json
    └── dist/                    # Production build output
```

## Architecture Overview

```
┌─────────────┐     POST /chat      ┌──────────────┐
│   React UI  │ ──────────────────► │  Flask/       │
│  (App.jsx)  │ ◄────────────────── │  Hypercorn    │
│             │   JSON response     │  (routes.py)  │
└─────────────┘                     └──────┬───────┘
                                           │
                                    ┌──────▼───────┐
                                    │ Session Mgr   │──── MongoDB
                                    │               │     (conversations,
                                    └──────┬───────┘      messages, users)
                                           │
                                    ┌──────▼───────┐
                                    │ ReActAgent2   │
                                    │ (ai_core.py)  │
                                    └──────┬───────┘
                                           │
                              ┌────────────┼────────────┐
                              │                         │
                       ┌──────▼──────┐          ┌──────▼───────┐
                       │ Vector      │          │ GraphRAG      │
                       │ Query Tool  │          │ Query Tool    │
                       │             │          │               │
                       │ LlamaIndex  │          │ Vector +      │
                       │ VectorStore │          │ Community     │
                       │ Index       │          │ Retrieval     │
                       └─────────────┘          └──────┬───────┘
                                                       │
                                                ┌──────▼───────┐
                                                │   Neo4j       │
                                                │   Knowledge   │
                                                │   Graph       │
                                                └──────────────┘
```

## API Reference

| Method | Endpoint | Description |
|--------|----------|-------------|
| `POST` | `/chat` | Send a chat message. Body: `{message, sessionId}` |
| `GET` | `/status?sessionId=` | Check system initialization status |
| `GET` | `/conversations` | List user's conversations (requires `X-MS-USERNAME` header) |
| `POST` | `/conversations/new` | Create a new conversation |
| `GET` | `/conversations/:id/messages` | Get messages for a conversation |
| `DELETE` | `/conversations/:id` | Soft-delete a conversation |
| `POST` | `/reset` | Reset global agent memory. Body: `{sessionId}` |
| `GET` | `/images/:filename` | Serve a document page image |
| `GET` | `/list-images` | List all available images |
| `POST` | `/download-brief` | Generate Word doc. Body: `{brief_content, sessionId}` |

**Authentication**: The frontend sends the MSAL username via `X-MS-USERNAME` header. In development mode (`PRODUCTION=false`), a default `dev_user@local` is used.

## Configuration

All configuration is centralized in `config.py`. Key settings:

| Setting | Default | Description |
|---------|---------|-------------|
| `LLM_MODEL` | `chatgpt-4o-latest` | Main LLM for the ReAct agent |
| `EMBEDDING_MODEL` | `text-embedding-3-small` | Embedding model for vector index |
| `LLM_TEMPERATURE` | `0.3` | LLM temperature |
| `SIMILARITY_TOP_K` | `10` | Number of vector results to retrieve |
| `AGENT_TIMEOUT` | `600s` | Overall agent workflow timeout |
| `LLM_TIMEOUT` | `300s` | Per-LLM-call timeout |
| `SERVER_PORT` | `8746` | Backend server port |

Community summaries use `gpt-4o-mini` for cost efficiency (configured in `graph_rag_integration.py`).

## Adding Documents

Place HP marketing documents (PDF, DOCX, PPTX, TXT) in `supporting_files/files_for_rag_store/`. On the next startup with no existing index, the system will:

1. Parse documents with LlamaParse (text + image extraction)
2. Split into semantic chunks
3. Build a vector index (persisted to `index_storage/hp_docs_index/`)
4. Extract knowledge graph triplets and store in Neo4j
5. Run community detection and cache summaries

To force a full reindex, delete `index_storage/hp_docs_index/` and clear the Neo4j database before restarting.

## Deployment

### Backend
- Set `PRODUCTION=true` environment variable
- Server binds to `0.0.0.0` in production mode
- Configure `CORS_ALLOWED_ORIGINS` in `config.py`
- Production URL: `https://ai-sandbox.oliver.solutions/hp_chatbot_back`

### Frontend
```bash
cd chat-interface
npm run build
```
- Deploy `dist/` contents to the `/hp_chatbot/` path
- Ensure proper MIME types for `.js` files on the web server
- Configure SPA routing (see `web.config` or `.htaccess`)
- Production URL: `https://ai-sandbox.oliver.solutions/hp_chatbot/`

## Troubleshooting

| Issue | Solution |
|-------|----------|
| Backend won't start | Check that Neo4j and MongoDB are running. Verify `OPENAI_API_KEY` is set in `.env` |
| "Agent unavailable" errors | Check startup logs for LLM API test failure. The `/reinitialize` endpoint (dev only) can force re-init |
| No images in responses | Verify `LLAMA_CLOUD_API_KEY` is set. Check that `uploads/images/` contains extracted images |
| CORS errors | Add the frontend origin to `CORS_ALLOWED_ORIGINS` in `config.py` |
| Slow first startup | Initial document processing and graph building can take significant time depending on document volume |
| Neo4j connection refused | Ensure Neo4j is on port 7688 (not 7687, which is a different project) |