History

Vadym Samoilenko ed040ea497 init: add RAG test app with Excel/CSV export - rag_test_app: OpenAI Assistants benchmark tool - TEST_TO_RUN: Barclays test configs (Internal Banners, Social Posts, Display Banners, PPC) - Added report.xlsx + report.csv export alongside HTML report Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>		2026-05-10 13:29:14 +01:00
..
results	init: add RAG test app with Excel/CSV export	2026-05-10 13:29:14 +01:00
tests	init: add RAG test app with Excel/CSV export	2026-05-10 13:29:14 +01:00
cli.py	init: add RAG test app with Excel/CSV export	2026-05-10 13:29:14 +01:00
main.py	init: add RAG test app with Excel/CSV export	2026-05-10 13:29:14 +01:00
README.md	init: add RAG test app with Excel/CSV export	2026-05-10 13:29:14 +01:00
requirements.txt	init: add RAG test app with Excel/CSV export	2026-05-10 13:29:14 +01:00
setup.py	init: add RAG test app with Excel/CSV export	2026-05-10 13:29:14 +01:00

README.md

RAG Testing Application

A comprehensive Python application for automatically testing and evaluating OpenAI assistants with Retrieval-Augmented Generation (RAG) capabilities.

Overview
Features
Installation
Quick Start Guide
Complete User Guide
Command Line Reference
Advanced Usage Examples
Understanding the Results
Troubleshooting

Overview

This tool helps you evaluate and benchmark OpenAI assistants by:

Generating test prompts from your source documents (with multiple prompt styles)
Running prompts against your assistant multiple times to test consistency
Evaluating responses for quality, consistency, accuracy, and completeness
Generating detailed reports with visualizations and metrics

Perfect for: Testing assistants that create content (banners, copy, documents) using RAG knowledge bases.

Features

✅ Multi-document support: Test with individual documents, directories, or specified sets of documents
✅ Multiple prompt types: Generate realistic user tasks, knowledge questions, or business scenarios
✅ Batch processing: Run multiple test configurations in sequence automatically
✅ Timestamped results: Each test run creates a unique timestamped directory - no more overwriting!
✅ Support for DOCX files: Works with both text and Microsoft Word files
✅ Optimized performance: Parallel processing and batch execution for significantly faster testing
✅ Comprehensive evaluation: Assesses responses for quality, accuracy, consistency, and completeness
✅ Interactive reporting: Generates professional HTML reports with detailed visualizations
✅ Performance tracking: Measures and analyzes response times and other key metrics
✅ Data export: Saves all results as JSON for further analysis
✅ Config-based workflow: Easy to set up and customize via configuration files

Installation

Clone this repository:

git clone https://github.com/yourusername/rag-test-app.git
cd rag-test-app

Install the required packages:

pip install -r requirements.txt

Set up your OpenAI API key:

export OPENAI_API_KEY="your-api-key-here"

Quick Start Guide

1. Create a Configuration File

The easiest way to get started:

python cli.py --create-config my_test_config.json

This creates a template configuration file.

2. Edit Your Configuration

Open my_test_config.json and update:

assistant_id: Your OpenAI Assistant ID (e.g., "asst_abc123...")
documents: Paths to your RAG documents
api_key: Your OpenAI API key (or use environment variable)

3. Run Your First Test

python cli.py --config my_test_config.json

4. View Results

Open the generated results/your_test_YYYYMMDD_HHMMSS/report.html in your browser!

Complete User Guide

Configuration File Reference

Here's a complete configuration file with ALL available options:

{
  "assistant_id": "asst_YourAssistantIdHere",
  "documents": [
    "/path/to/your/document1.txt",
    "/path/to/your/document2.docx"
  ],
  "api_key": "YOUR_OPENAI_API_KEY",
  "output_dir": "results",
  "num_questions": 20,
  "iterations": 3,
  "questions_file": "",
  "generate_only": false,
  "verbose": true,
  "model": "gpt-4o",
  "prompt_type": "task-based",
  "parallel": 10,
  "batch_size": 30
}

Configuration Options Explained

Option	Type	Default	Description
`assistant_id`	string	(required)	Your OpenAI Assistant ID starting with "asst_"
`documents`	array	null	List of document file paths (preferred method)
`document`	string	null	Single document or directory path (alternative to `documents`)
`api_key`	string	env var	OpenAI API key (can also use `OPENAI_API_KEY` environment variable)
`output_dir`	string	"results"	Base directory for saving results
`num_questions`	integer	20	Number of test prompts to generate
`iterations`	integer	3	How many times to test each prompt (for consistency checking)
`questions_file`	string	null	Path to pre-generated questions JSON file
`generate_only`	boolean	false	Only generate questions, don't run tests
`verbose`	boolean	false	Enable detailed logging for debugging
`model`	string	"gpt-4o"	GPT model for question generation and evaluation
`prompt_type`	string	"task-based"	Type of prompts to generate (see below)
`parallel`	integer	5	Number of parallel workers for running tests
`batch_size`	integer	same as parallel	Questions per batch (set to 2-3x parallel for best performance)

Prompt Types

NEW FEATURE: Choose how test prompts are generated to match your testing needs.

1. `task-based` (Default - Recommended)

Generates realistic user task requests that emulate how real users interact with your assistant.

Best for: Testing assistants that create content (banners, copy, ads, documents)

Example prompts:

"Create a banner for our new credit card offer with 0% APR"
"Write copy for a savings account promotion targeting young professionals"
"Generate headlines for our mobile banking app launch"
"Design promotional text for a balance transfer campaign"

When to use:

Testing content creation assistants
Simulating real user interactions
Evaluating practical usability

Configuration:

{
  "prompt_type": "task-based"
}

Command line:

python cli.py --config myconfig.json --prompt-type task-based

2. `content-based` (Original)

Generates knowledge questions about the content in your RAG documents.

Best for: Testing document understanding and knowledge retrieval

Example prompts:

"What is the FCA Consumer Duty requirement?"
"Explain the principles of clear customer communication"
"What are the considerations for vulnerable customers?"
"List the regulatory guidelines for financial advertising"

When to use:

Verifying RAG knowledge accuracy
Testing document comprehension
Auditing information retrieval

Configuration:

{
  "prompt_type": "content-based"
}

Command line:

python cli.py --config myconfig.json --prompt-type content-based

3. `scenario-based`

Generates realistic business scenarios that combine tasks with context and requirements.

Best for: Testing complex real-world use cases with constraints

Example prompts:

"We're launching a new credit card for students. Create FCA-compliant banner copy that's clear and accessible"
"Our vulnerable customer initiative needs promotional materials. Write banner text that follows Consumer Duty guidelines"
"Create an internal banner for our mobile banking upgrade targeting existing customers aged 50+"
"We have a new savings product for first-time buyers. Generate compliant promotional copy"

When to use:

Testing compliance and constraints
Simulating real business workflows
Evaluating context handling

Configuration:

{
  "prompt_type": "scenario-based"
}

Command line:

python cli.py --config myconfig.json --prompt-type scenario-based

Comparing Prompt Types

Prompt Type	Use Case	Complexity	Best For
task-based	Simple user requests	Low	Daily user interactions
content-based	Knowledge questions	Medium	RAG accuracy testing
scenario-based	Business scenarios	High	Real-world workflows

Batch Processing

NEW FEATURE: Run multiple test configurations automatically in sequence.

Instead of running tests one at a time, point to a directory containing multiple config files and run them all at once!

Setting Up Batch Tests

Create a directory with multiple configs:

mkdir my_test_suite

Add multiple configuration files:

my_test_suite/
├── test_credit_cards.json
├── test_savings.json
├── test_loans.json
└── test_mobile_banking.json

Run all tests:

python cli.py --config-dir my_test_suite

What Happens

============================================================
BATCH PROCESSING MODE
Found 4 configuration file(s)
============================================================
  • test_credit_cards.json
  • test_savings.json
  • test_loans.json
  • test_mobile_banking.json

>>> Processing 1/4
============================================================
Processing config: test_credit_cards.json
============================================================
[Running tests...]

>>> Processing 2/4
============================================================
Processing config: test_savings.json
============================================================
[Running tests...]

... and so on ...

============================================================
BATCH PROCESSING COMPLETE
============================================================
✓ Successful: 4
Total time: 45.2 minutes

Batch Processing Benefits

✅ Run comprehensive test suites overnight
✅ Compare results across different assistants
✅ Test multiple prompt types automatically
✅ Automated CI/CD testing pipelines
✅ Progress tracking and error reporting

Command Line Options with Batch

You can override settings for all configs:

# Run all configs but use content-based prompts
python cli.py --config-dir my_test_suite --prompt-type content-based

# Run with higher parallelization
python cli.py --config-dir my_test_suite --parallel 15 --batch-size 45

# Generate questions only (no testing)
python cli.py --config-dir my_test_suite --generate-only

Output Directory Structure

NEW FEATURE: Each test run creates a unique timestamped directory - no more overwriting!

Directory Naming

Results are saved as:

{output_dir}/{config_name}_{timestamp}/

Example:

results/
├── test_credit_cards_20251112_143022/
│   ├── report.html
│   ├── test_results.json
│   ├── evaluation.json
│   ├── test_questions.json
│   └── *.png (charts)
├── test_credit_cards_20251112_153045/
│   ├── report.html
│   └── ...
└── test_savings_20251112_160112/
    ├── report.html
    └── ...

Benefits

✅ Never lose results - each run is preserved
✅ Easy comparison - compare results across test runs
✅ Audit trail - complete history of all tests
✅ Organized - group results by test name and time

Customizing Output Location

In config file:

{
  "output_dir": "my_results"
}

Command line:

python cli.py --config myconfig.json --output-dir my_results

Results will be saved to:

my_results/{config_name}_{timestamp}/

Command Line Reference

Full Command Syntax

usage: cli.py [-h] [--config CONFIG] [--config-dir CONFIG_DIR]
              [--create-config OUTPUT_PATH] [--api-key API_KEY]
              [--assistant-id ASSISTANT_ID] [--document DOCUMENT]
              [--documents DOCUMENTS [DOCUMENTS ...]] [--output-dir OUTPUT_DIR]
              [--num-questions NUM_QUESTIONS] [--iterations ITERATIONS]
              [--questions-file QUESTIONS_FILE] [--generate-only] [--verbose]
              [--model MODEL] [--prompt-type {task-based,content-based,scenario-based}]
              [--parallel PARALLEL] [--batch-size BATCH_SIZE]

Common Commands

# Get help
python cli.py --help

# Create a config template
python cli.py --create-config my_config.json

# Run single test with config file
python cli.py --config my_config.json

# Run batch tests
python cli.py --config-dir my_test_suite/

# Run without config (all command line)
python cli.py --assistant-id asst_abc123 --document myfile.txt

# Generate questions only
python cli.py --config my_config.json --generate-only

# Use pre-generated questions
python cli.py --config my_config.json --questions-file results/test_questions.json

# Change prompt type
python cli.py --config my_config.json --prompt-type scenario-based

# High performance mode
python cli.py --config my_config.json --parallel 15 --batch-size 45

Advanced Usage Examples

Example 1: Complete Testing Workflow

# Step 1: Create config
python cli.py --create-config banner_test.json

# Step 2: Edit banner_test.json with your settings

# Step 3: Generate questions first to review
python cli.py --config banner_test.json --generate-only --num-questions 50

# Step 4: Review generated questions in results/*/test_questions.json

# Step 5: Run the full test
python cli.py --config banner_test.json --questions-file results/banner_test_*/test_questions.json

# Step 6: Open report.html to view results

Example 2: Testing Multiple Prompt Types

# Create base config
cat > base_config.json << EOF
{
  "assistant_id": "asst_abc123",
  "documents": ["docs/guidelines.docx"],
  "num_questions": 30,
  "iterations": 5
}
EOF

# Test with task-based prompts
python cli.py --config base_config.json --prompt-type task-based

# Test with content-based prompts
python cli.py --config base_config.json --prompt-type content-based

# Test with scenario-based prompts
python cli.py --config base_config.json --prompt-type scenario-based

# Compare the three result directories!

Example 3: High-Volume Testing

# For testing with many questions and high parallelization
python cli.py --config my_config.json \
  --num-questions 100 \
  --iterations 10 \
  --parallel 20 \
  --batch-size 60 \
  --verbose

Example 4: Continuous Integration

#!/bin/bash
# run_tests.sh - Automated testing script

# Set environment
export OPENAI_API_KEY="your-key"

# Run test suite
python cli.py --config-dir test_configs/

# Check exit code
if [ $? -eq 0 ]; then
  echo "All tests passed!"
else
  echo "Some tests failed!"
  exit 1
fi

Example 5: A/B Testing Different Assistants

// config_assistant_v1.json
{
  "assistant_id": "asst_v1_abc123",
  "documents": ["docs/guidelines.docx"],
  "questions_file": "shared_questions.json",
  "num_questions": 50
}

// config_assistant_v2.json
{
  "assistant_id": "asst_v2_def456",
  "documents": ["docs/guidelines.docx"],
  "questions_file": "shared_questions.json",
  "num_questions": 50
}

# Generate questions once
python cli.py --config config_assistant_v1.json --generate-only

# Test both assistants with same questions
python cli.py --config config_assistant_v1.json
python cli.py --config config_assistant_v2.json

# Compare the results!

Example 6: Multi-Document Testing

{
  "assistant_id": "asst_abc123",
  "documents": [
    "/path/to/consumer_duty.docx",
    "/path/to/fca_guidelines.docx",
    "/path/to/brand_guidelines.txt",
    "/path/to/product_specs.docx"
  ],
  "num_questions": 40,
  "prompt_type": "scenario-based"
}

Understanding the Results

HTML Report

After tests complete, open report.html to see:

1. Summary Metrics

Overall quality score
Average consistency score
Average accuracy score
Average completeness score
Average response time
Total tests run

2. Performance Charts

Scores by Question: Bar chart showing all metric scores for each question
Response Times: How fast the assistant responds
Score Distribution: Histogram of score ranges
Radar Chart: Visual comparison of quality, consistency, accuracy, and completeness

3. Question-by-Question Analysis

For each test prompt:

Question text
Individual scores (quality, consistency, accuracy, completeness)
Evaluation notes and feedback
All response iterations (collapsible)

Evaluation Metrics

Each response is scored 1-10 on four dimensions:

Quality Score (1-10)

Clarity and coherence
Professional tone
No hallucinations
Grammar and readability

Consistency Score (1-10)

Similar answers across iterations
Consistent facts and details
No contradictions
Stable level of detail

Accuracy Score (1-10)

Information matches documents
Correct facts and numbers
No misrepresentations
Proper context interpretation

Completeness Score (1-10)

Addresses all aspects of the question
Includes necessary context
Sufficient detail
No significant omissions

JSON Output Files

`test_questions.json`

{
  "questions": [
    "Create a banner for...",
    "Write copy for...",
    ...
  ]
}

`test_results.json`

{
  "results": [
    {
      "question_id": 0,
      "question": "Create a banner...",
      "iteration": 0,
      "response": "Here's your banner: ...",
      "response_time": 2.34,
      "status": "completed"
    },
    ...
  ]
}

`evaluation.json`

{
  "summary": {
    "total_questions": 20,
    "total_iterations": 60,
    "average_quality": 8.5,
    "average_consistency": 9.2,
    "average_accuracy": 8.8,
    "average_completeness": 8.6,
    "average_response_time": 2.1
  },
  "by_question": [...]
}

Troubleshooting

Common Issues

1. Assistant Not Found Error

ERROR: No assistant found with id 'asst_...'

Solution: Check your assistant ID on https://platform.openai.com/assistants

2. API Rate Limits

Error: Rate limit exceeded

Solution: Reduce parallel workers:

python cli.py --config my_config.json --parallel 3

3. Document Loading Errors

Warning: No content loaded from documents

Solutions:

Check file paths are correct
For .docx files: pip install docx2txt
Verify files are readable (not corrupted)
Supported formats: .txt, .docx only

4. Memory Issues

MemoryError: ...

Solution: Reduce batch size:

python cli.py --config my_config.json --batch-size 10

5. Missing API Key

Error: No OpenAI API key provided

Solutions:

# Option 1: Environment variable
export OPENAI_API_KEY="your-key"

# Option 2: In config file
{
  "api_key": "your-key"
}

# Option 3: Command line
python cli.py --api-key "your-key" ...

Debug Mode

Enable verbose output for detailed logging:

python cli.py --config my_config.json --verbose

Or in config:

{
  "verbose": true
}

Performance Tips

Optimize Parallelization
- Start with parallel: 5
- Increase gradually if no rate limits
- Set batch_size to 2-3x parallel
Balance Speed vs. Cost
- More parallel workers = faster but higher API costs
- More iterations = better consistency data but more tests
Question Generation
- Generate questions once, reuse with questions_file
- Save API calls on repeated tests

Supported File Types

✅ Text files (.txt): Plain text with UTF-8 encoding
✅ Word documents (.docx): Microsoft Word files (requires docx2txt)
❌ PDF files: Not currently supported
❌ Excel/PowerPoint: Not currently supported

Best Practices

1. Start Small

# Test with few questions first
python cli.py --config my_config.json --num-questions 5 --iterations 2

2. Use Configuration Files

Easier to track and version
Reusable across tests
Less prone to typos

3. Organize Your Tests

my_project/
├── configs/
│   ├── test_suite_1/
│   │   ├── credit_cards.json
│   │   └── loans.json
│   └── test_suite_2/
│       └── mobile_banking.json
├── results/
└── docs/

4. Version Control Your Configs

git add configs/
git commit -m "Add test configurations"

5. Archive Important Results

# Save important test results
cp -r results/important_test_20251112_143022 archived_results/

License

MIT

Need Help?

📖 Documentation: You're reading it!
🐛 Issues: Report bugs on GitHub
💡 Feature Requests: Open an issue with your idea
📧 Contact: [your-email@example.com]

Happy Testing! 🚀

README.md

RAG Testing Application

Table of Contents

Overview

Features

Installation

Quick Start Guide

1. Create a Configuration File

2. Edit Your Configuration

3. Run Your First Test

4. View Results

Complete User Guide

Configuration File Reference

Configuration Options Explained

Prompt Types

1. task-based (Default - Recommended)

2. content-based (Original)

3. scenario-based

Comparing Prompt Types

Batch Processing

Setting Up Batch Tests

What Happens

Batch Processing Benefits

Command Line Options with Batch

Output Directory Structure

Directory Naming

Benefits

Customizing Output Location

Command Line Reference

Full Command Syntax

Common Commands

Advanced Usage Examples

Example 1: Complete Testing Workflow

Example 2: Testing Multiple Prompt Types

Example 3: High-Volume Testing

Example 4: Continuous Integration

Example 5: A/B Testing Different Assistants

Example 6: Multi-Document Testing

Understanding the Results

HTML Report

1. Summary Metrics

2. Performance Charts

3. Question-by-Question Analysis

Evaluation Metrics

Quality Score (1-10)

Consistency Score (1-10)

Accuracy Score (1-10)

Completeness Score (1-10)

JSON Output Files

test_questions.json

test_results.json

evaluation.json

Troubleshooting

Common Issues

1. Assistant Not Found Error

2. API Rate Limits

3. Document Loading Errors

4. Memory Issues

5. Missing API Key

Debug Mode

Performance Tips

Supported File Types

Best Practices

1. Start Small

2. Use Configuration Files

3. Organize Your Tests

4. Version Control Your Configs

5. Archive Important Results

License

Need Help?

1. `task-based` (Default - Recommended)

2. `content-based` (Original)

3. `scenario-based`

`test_questions.json`

`test_results.json`

`evaluation.json`