barclays-rag-test/rag_test_app
Vadym Samoilenko ed040ea497 init: add RAG test app with Excel/CSV export
- rag_test_app: OpenAI Assistants benchmark tool
- TEST_TO_RUN: Barclays test configs (Internal Banners, Social Posts, Display Banners, PPC)
- Added report.xlsx + report.csv export alongside HTML report

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 13:29:14 +01:00
..
results init: add RAG test app with Excel/CSV export 2026-05-10 13:29:14 +01:00
tests init: add RAG test app with Excel/CSV export 2026-05-10 13:29:14 +01:00
cli.py init: add RAG test app with Excel/CSV export 2026-05-10 13:29:14 +01:00
main.py init: add RAG test app with Excel/CSV export 2026-05-10 13:29:14 +01:00
README.md init: add RAG test app with Excel/CSV export 2026-05-10 13:29:14 +01:00
requirements.txt init: add RAG test app with Excel/CSV export 2026-05-10 13:29:14 +01:00
setup.py init: add RAG test app with Excel/CSV export 2026-05-10 13:29:14 +01:00

RAG Testing Application

A comprehensive Python application for automatically testing and evaluating OpenAI assistants with Retrieval-Augmented Generation (RAG) capabilities.

Table of Contents

Overview

This tool helps you evaluate and benchmark OpenAI assistants by:

  1. Generating test prompts from your source documents (with multiple prompt styles)
  2. Running prompts against your assistant multiple times to test consistency
  3. Evaluating responses for quality, consistency, accuracy, and completeness
  4. Generating detailed reports with visualizations and metrics

Perfect for: Testing assistants that create content (banners, copy, documents) using RAG knowledge bases.

Features

  • Multi-document support: Test with individual documents, directories, or specified sets of documents
  • Multiple prompt types: Generate realistic user tasks, knowledge questions, or business scenarios
  • Batch processing: Run multiple test configurations in sequence automatically
  • Timestamped results: Each test run creates a unique timestamped directory - no more overwriting!
  • Support for DOCX files: Works with both text and Microsoft Word files
  • Optimized performance: Parallel processing and batch execution for significantly faster testing
  • Comprehensive evaluation: Assesses responses for quality, accuracy, consistency, and completeness
  • Interactive reporting: Generates professional HTML reports with detailed visualizations
  • Performance tracking: Measures and analyzes response times and other key metrics
  • Data export: Saves all results as JSON for further analysis
  • Config-based workflow: Easy to set up and customize via configuration files

Installation

  1. Clone this repository:
git clone https://github.com/yourusername/rag-test-app.git
cd rag-test-app
  1. Install the required packages:
pip install -r requirements.txt
  1. Set up your OpenAI API key:
export OPENAI_API_KEY="your-api-key-here"

Quick Start Guide

1. Create a Configuration File

The easiest way to get started:

python cli.py --create-config my_test_config.json

This creates a template configuration file.

2. Edit Your Configuration

Open my_test_config.json and update:

  • assistant_id: Your OpenAI Assistant ID (e.g., "asst_abc123...")
  • documents: Paths to your RAG documents
  • api_key: Your OpenAI API key (or use environment variable)

3. Run Your First Test

python cli.py --config my_test_config.json

4. View Results

Open the generated results/your_test_YYYYMMDD_HHMMSS/report.html in your browser!

Complete User Guide

Configuration File Reference

Here's a complete configuration file with ALL available options:

{
  "assistant_id": "asst_YourAssistantIdHere",
  "documents": [
    "/path/to/your/document1.txt",
    "/path/to/your/document2.docx"
  ],
  "api_key": "YOUR_OPENAI_API_KEY",
  "output_dir": "results",
  "num_questions": 20,
  "iterations": 3,
  "questions_file": "",
  "generate_only": false,
  "verbose": true,
  "model": "gpt-4o",
  "prompt_type": "task-based",
  "parallel": 10,
  "batch_size": 30
}

Configuration Options Explained

Option Type Default Description
assistant_id string (required) Your OpenAI Assistant ID starting with "asst_"
documents array null List of document file paths (preferred method)
document string null Single document or directory path (alternative to documents)
api_key string env var OpenAI API key (can also use OPENAI_API_KEY environment variable)
output_dir string "results" Base directory for saving results
num_questions integer 20 Number of test prompts to generate
iterations integer 3 How many times to test each prompt (for consistency checking)
questions_file string null Path to pre-generated questions JSON file
generate_only boolean false Only generate questions, don't run tests
verbose boolean false Enable detailed logging for debugging
model string "gpt-4o" GPT model for question generation and evaluation
prompt_type string "task-based" Type of prompts to generate (see below)
parallel integer 5 Number of parallel workers for running tests
batch_size integer same as parallel Questions per batch (set to 2-3x parallel for best performance)

Prompt Types

NEW FEATURE: Choose how test prompts are generated to match your testing needs.

Generates realistic user task requests that emulate how real users interact with your assistant.

Best for: Testing assistants that create content (banners, copy, ads, documents)

Example prompts:

  • "Create a banner for our new credit card offer with 0% APR"
  • "Write copy for a savings account promotion targeting young professionals"
  • "Generate headlines for our mobile banking app launch"
  • "Design promotional text for a balance transfer campaign"

When to use:

  • Testing content creation assistants
  • Simulating real user interactions
  • Evaluating practical usability

Configuration:

{
  "prompt_type": "task-based"
}

Command line:

python cli.py --config myconfig.json --prompt-type task-based

2. content-based (Original)

Generates knowledge questions about the content in your RAG documents.

Best for: Testing document understanding and knowledge retrieval

Example prompts:

  • "What is the FCA Consumer Duty requirement?"
  • "Explain the principles of clear customer communication"
  • "What are the considerations for vulnerable customers?"
  • "List the regulatory guidelines for financial advertising"

When to use:

  • Verifying RAG knowledge accuracy
  • Testing document comprehension
  • Auditing information retrieval

Configuration:

{
  "prompt_type": "content-based"
}

Command line:

python cli.py --config myconfig.json --prompt-type content-based

3. scenario-based

Generates realistic business scenarios that combine tasks with context and requirements.

Best for: Testing complex real-world use cases with constraints

Example prompts:

  • "We're launching a new credit card for students. Create FCA-compliant banner copy that's clear and accessible"
  • "Our vulnerable customer initiative needs promotional materials. Write banner text that follows Consumer Duty guidelines"
  • "Create an internal banner for our mobile banking upgrade targeting existing customers aged 50+"
  • "We have a new savings product for first-time buyers. Generate compliant promotional copy"

When to use:

  • Testing compliance and constraints
  • Simulating real business workflows
  • Evaluating context handling

Configuration:

{
  "prompt_type": "scenario-based"
}

Command line:

python cli.py --config myconfig.json --prompt-type scenario-based

Comparing Prompt Types

Prompt Type Use Case Complexity Best For
task-based Simple user requests Low Daily user interactions
content-based Knowledge questions Medium RAG accuracy testing
scenario-based Business scenarios High Real-world workflows

Batch Processing

NEW FEATURE: Run multiple test configurations automatically in sequence.

Instead of running tests one at a time, point to a directory containing multiple config files and run them all at once!

Setting Up Batch Tests

  1. Create a directory with multiple configs:
mkdir my_test_suite
  1. Add multiple configuration files:
my_test_suite/
├── test_credit_cards.json
├── test_savings.json
├── test_loans.json
└── test_mobile_banking.json
  1. Run all tests:
python cli.py --config-dir my_test_suite

What Happens

============================================================
BATCH PROCESSING MODE
Found 4 configuration file(s)
============================================================
  • test_credit_cards.json
  • test_savings.json
  • test_loans.json
  • test_mobile_banking.json

>>> Processing 1/4
============================================================
Processing config: test_credit_cards.json
============================================================
[Running tests...]

>>> Processing 2/4
============================================================
Processing config: test_savings.json
============================================================
[Running tests...]

... and so on ...

============================================================
BATCH PROCESSING COMPLETE
============================================================
✓ Successful: 4
Total time: 45.2 minutes

Batch Processing Benefits

  • Run comprehensive test suites overnight
  • Compare results across different assistants
  • Test multiple prompt types automatically
  • Automated CI/CD testing pipelines
  • Progress tracking and error reporting

Command Line Options with Batch

You can override settings for all configs:

# Run all configs but use content-based prompts
python cli.py --config-dir my_test_suite --prompt-type content-based

# Run with higher parallelization
python cli.py --config-dir my_test_suite --parallel 15 --batch-size 45

# Generate questions only (no testing)
python cli.py --config-dir my_test_suite --generate-only

Output Directory Structure

NEW FEATURE: Each test run creates a unique timestamped directory - no more overwriting!

Directory Naming

Results are saved as:

{output_dir}/{config_name}_{timestamp}/

Example:

results/
├── test_credit_cards_20251112_143022/
│   ├── report.html
│   ├── test_results.json
│   ├── evaluation.json
│   ├── test_questions.json
│   └── *.png (charts)
├── test_credit_cards_20251112_153045/
│   ├── report.html
│   └── ...
└── test_savings_20251112_160112/
    ├── report.html
    └── ...

Benefits

  • Never lose results - each run is preserved
  • Easy comparison - compare results across test runs
  • Audit trail - complete history of all tests
  • Organized - group results by test name and time

Customizing Output Location

In config file:

{
  "output_dir": "my_results"
}

Command line:

python cli.py --config myconfig.json --output-dir my_results

Results will be saved to:

my_results/{config_name}_{timestamp}/

Command Line Reference

Full Command Syntax

usage: cli.py [-h] [--config CONFIG] [--config-dir CONFIG_DIR]
              [--create-config OUTPUT_PATH] [--api-key API_KEY]
              [--assistant-id ASSISTANT_ID] [--document DOCUMENT]
              [--documents DOCUMENTS [DOCUMENTS ...]] [--output-dir OUTPUT_DIR]
              [--num-questions NUM_QUESTIONS] [--iterations ITERATIONS]
              [--questions-file QUESTIONS_FILE] [--generate-only] [--verbose]
              [--model MODEL] [--prompt-type {task-based,content-based,scenario-based}]
              [--parallel PARALLEL] [--batch-size BATCH_SIZE]

Common Commands

# Get help
python cli.py --help

# Create a config template
python cli.py --create-config my_config.json

# Run single test with config file
python cli.py --config my_config.json

# Run batch tests
python cli.py --config-dir my_test_suite/

# Run without config (all command line)
python cli.py --assistant-id asst_abc123 --document myfile.txt

# Generate questions only
python cli.py --config my_config.json --generate-only

# Use pre-generated questions
python cli.py --config my_config.json --questions-file results/test_questions.json

# Change prompt type
python cli.py --config my_config.json --prompt-type scenario-based

# High performance mode
python cli.py --config my_config.json --parallel 15 --batch-size 45

Advanced Usage Examples

Example 1: Complete Testing Workflow

# Step 1: Create config
python cli.py --create-config banner_test.json

# Step 2: Edit banner_test.json with your settings

# Step 3: Generate questions first to review
python cli.py --config banner_test.json --generate-only --num-questions 50

# Step 4: Review generated questions in results/*/test_questions.json

# Step 5: Run the full test
python cli.py --config banner_test.json --questions-file results/banner_test_*/test_questions.json

# Step 6: Open report.html to view results

Example 2: Testing Multiple Prompt Types

# Create base config
cat > base_config.json << EOF
{
  "assistant_id": "asst_abc123",
  "documents": ["docs/guidelines.docx"],
  "num_questions": 30,
  "iterations": 5
}
EOF

# Test with task-based prompts
python cli.py --config base_config.json --prompt-type task-based

# Test with content-based prompts
python cli.py --config base_config.json --prompt-type content-based

# Test with scenario-based prompts
python cli.py --config base_config.json --prompt-type scenario-based

# Compare the three result directories!

Example 3: High-Volume Testing

# For testing with many questions and high parallelization
python cli.py --config my_config.json \
  --num-questions 100 \
  --iterations 10 \
  --parallel 20 \
  --batch-size 60 \
  --verbose

Example 4: Continuous Integration

#!/bin/bash
# run_tests.sh - Automated testing script

# Set environment
export OPENAI_API_KEY="your-key"

# Run test suite
python cli.py --config-dir test_configs/

# Check exit code
if [ $? -eq 0 ]; then
  echo "All tests passed!"
else
  echo "Some tests failed!"
  exit 1
fi

Example 5: A/B Testing Different Assistants

// config_assistant_v1.json
{
  "assistant_id": "asst_v1_abc123",
  "documents": ["docs/guidelines.docx"],
  "questions_file": "shared_questions.json",
  "num_questions": 50
}

// config_assistant_v2.json
{
  "assistant_id": "asst_v2_def456",
  "documents": ["docs/guidelines.docx"],
  "questions_file": "shared_questions.json",
  "num_questions": 50
}
# Generate questions once
python cli.py --config config_assistant_v1.json --generate-only

# Test both assistants with same questions
python cli.py --config config_assistant_v1.json
python cli.py --config config_assistant_v2.json

# Compare the results!

Example 6: Multi-Document Testing

{
  "assistant_id": "asst_abc123",
  "documents": [
    "/path/to/consumer_duty.docx",
    "/path/to/fca_guidelines.docx",
    "/path/to/brand_guidelines.txt",
    "/path/to/product_specs.docx"
  ],
  "num_questions": 40,
  "prompt_type": "scenario-based"
}

Understanding the Results

HTML Report

After tests complete, open report.html to see:

1. Summary Metrics

  • Overall quality score
  • Average consistency score
  • Average accuracy score
  • Average completeness score
  • Average response time
  • Total tests run

2. Performance Charts

  • Scores by Question: Bar chart showing all metric scores for each question
  • Response Times: How fast the assistant responds
  • Score Distribution: Histogram of score ranges
  • Radar Chart: Visual comparison of quality, consistency, accuracy, and completeness

3. Question-by-Question Analysis

For each test prompt:

  • Question text
  • Individual scores (quality, consistency, accuracy, completeness)
  • Evaluation notes and feedback
  • All response iterations (collapsible)

Evaluation Metrics

Each response is scored 1-10 on four dimensions:

Quality Score (1-10)

  • Clarity and coherence
  • Professional tone
  • No hallucinations
  • Grammar and readability

Consistency Score (1-10)

  • Similar answers across iterations
  • Consistent facts and details
  • No contradictions
  • Stable level of detail

Accuracy Score (1-10)

  • Information matches documents
  • Correct facts and numbers
  • No misrepresentations
  • Proper context interpretation

Completeness Score (1-10)

  • Addresses all aspects of the question
  • Includes necessary context
  • Sufficient detail
  • No significant omissions

JSON Output Files

test_questions.json

{
  "questions": [
    "Create a banner for...",
    "Write copy for...",
    ...
  ]
}

test_results.json

{
  "results": [
    {
      "question_id": 0,
      "question": "Create a banner...",
      "iteration": 0,
      "response": "Here's your banner: ...",
      "response_time": 2.34,
      "status": "completed"
    },
    ...
  ]
}

evaluation.json

{
  "summary": {
    "total_questions": 20,
    "total_iterations": 60,
    "average_quality": 8.5,
    "average_consistency": 9.2,
    "average_accuracy": 8.8,
    "average_completeness": 8.6,
    "average_response_time": 2.1
  },
  "by_question": [...]
}

Troubleshooting

Common Issues

1. Assistant Not Found Error

ERROR: No assistant found with id 'asst_...'

Solution: Check your assistant ID on https://platform.openai.com/assistants

2. API Rate Limits

Error: Rate limit exceeded

Solution: Reduce parallel workers:

python cli.py --config my_config.json --parallel 3

3. Document Loading Errors

Warning: No content loaded from documents

Solutions:

  • Check file paths are correct
  • For .docx files: pip install docx2txt
  • Verify files are readable (not corrupted)
  • Supported formats: .txt, .docx only

4. Memory Issues

MemoryError: ...

Solution: Reduce batch size:

python cli.py --config my_config.json --batch-size 10

5. Missing API Key

Error: No OpenAI API key provided

Solutions:

# Option 1: Environment variable
export OPENAI_API_KEY="your-key"

# Option 2: In config file
{
  "api_key": "your-key"
}

# Option 3: Command line
python cli.py --api-key "your-key" ...

Debug Mode

Enable verbose output for detailed logging:

python cli.py --config my_config.json --verbose

Or in config:

{
  "verbose": true
}

Performance Tips

  1. Optimize Parallelization

    • Start with parallel: 5
    • Increase gradually if no rate limits
    • Set batch_size to 2-3x parallel
  2. Balance Speed vs. Cost

    • More parallel workers = faster but higher API costs
    • More iterations = better consistency data but more tests
  3. Question Generation

    • Generate questions once, reuse with questions_file
    • Save API calls on repeated tests

Supported File Types

  • Text files (.txt): Plain text with UTF-8 encoding
  • Word documents (.docx): Microsoft Word files (requires docx2txt)
  • PDF files: Not currently supported
  • Excel/PowerPoint: Not currently supported

Best Practices

1. Start Small

# Test with few questions first
python cli.py --config my_config.json --num-questions 5 --iterations 2

2. Use Configuration Files

  • Easier to track and version
  • Reusable across tests
  • Less prone to typos

3. Organize Your Tests

my_project/
├── configs/
│   ├── test_suite_1/
│   │   ├── credit_cards.json
│   │   └── loans.json
│   └── test_suite_2/
│       └── mobile_banking.json
├── results/
└── docs/

4. Version Control Your Configs

git add configs/
git commit -m "Add test configurations"

5. Archive Important Results

# Save important test results
cp -r results/important_test_20251112_143022 archived_results/

License

MIT


Need Help?

  • 📖 Documentation: You're reading it!
  • 🐛 Issues: Report bugs on GitHub
  • 💡 Feature Requests: Open an issue with your idea
  • 📧 Contact: [your-email@example.com]

Happy Testing! 🚀