- rag_test_app: OpenAI Assistants benchmark tool - TEST_TO_RUN: Barclays test configs (Internal Banners, Social Posts, Display Banners, PPC) - Added report.xlsx + report.csv export alongside HTML report Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| results | ||
| tests | ||
| cli.py | ||
| main.py | ||
| README.md | ||
| requirements.txt | ||
| setup.py | ||
RAG Testing Application
A comprehensive Python application for automatically testing and evaluating OpenAI assistants with Retrieval-Augmented Generation (RAG) capabilities.
Table of Contents
- Overview
- Features
- Installation
- Quick Start Guide
- Complete User Guide
- Command Line Reference
- Advanced Usage Examples
- Understanding the Results
- Troubleshooting
Overview
This tool helps you evaluate and benchmark OpenAI assistants by:
- Generating test prompts from your source documents (with multiple prompt styles)
- Running prompts against your assistant multiple times to test consistency
- Evaluating responses for quality, consistency, accuracy, and completeness
- Generating detailed reports with visualizations and metrics
Perfect for: Testing assistants that create content (banners, copy, documents) using RAG knowledge bases.
Features
- ✅ Multi-document support: Test with individual documents, directories, or specified sets of documents
- ✅ Multiple prompt types: Generate realistic user tasks, knowledge questions, or business scenarios
- ✅ Batch processing: Run multiple test configurations in sequence automatically
- ✅ Timestamped results: Each test run creates a unique timestamped directory - no more overwriting!
- ✅ Support for DOCX files: Works with both text and Microsoft Word files
- ✅ Optimized performance: Parallel processing and batch execution for significantly faster testing
- ✅ Comprehensive evaluation: Assesses responses for quality, accuracy, consistency, and completeness
- ✅ Interactive reporting: Generates professional HTML reports with detailed visualizations
- ✅ Performance tracking: Measures and analyzes response times and other key metrics
- ✅ Data export: Saves all results as JSON for further analysis
- ✅ Config-based workflow: Easy to set up and customize via configuration files
Installation
- Clone this repository:
git clone https://github.com/yourusername/rag-test-app.git
cd rag-test-app
- Install the required packages:
pip install -r requirements.txt
- Set up your OpenAI API key:
export OPENAI_API_KEY="your-api-key-here"
Quick Start Guide
1. Create a Configuration File
The easiest way to get started:
python cli.py --create-config my_test_config.json
This creates a template configuration file.
2. Edit Your Configuration
Open my_test_config.json and update:
assistant_id: Your OpenAI Assistant ID (e.g., "asst_abc123...")documents: Paths to your RAG documentsapi_key: Your OpenAI API key (or use environment variable)
3. Run Your First Test
python cli.py --config my_test_config.json
4. View Results
Open the generated results/your_test_YYYYMMDD_HHMMSS/report.html in your browser!
Complete User Guide
Configuration File Reference
Here's a complete configuration file with ALL available options:
{
"assistant_id": "asst_YourAssistantIdHere",
"documents": [
"/path/to/your/document1.txt",
"/path/to/your/document2.docx"
],
"api_key": "YOUR_OPENAI_API_KEY",
"output_dir": "results",
"num_questions": 20,
"iterations": 3,
"questions_file": "",
"generate_only": false,
"verbose": true,
"model": "gpt-4o",
"prompt_type": "task-based",
"parallel": 10,
"batch_size": 30
}
Configuration Options Explained
| Option | Type | Default | Description |
|---|---|---|---|
assistant_id |
string | (required) | Your OpenAI Assistant ID starting with "asst_" |
documents |
array | null | List of document file paths (preferred method) |
document |
string | null | Single document or directory path (alternative to documents) |
api_key |
string | env var | OpenAI API key (can also use OPENAI_API_KEY environment variable) |
output_dir |
string | "results" | Base directory for saving results |
num_questions |
integer | 20 | Number of test prompts to generate |
iterations |
integer | 3 | How many times to test each prompt (for consistency checking) |
questions_file |
string | null | Path to pre-generated questions JSON file |
generate_only |
boolean | false | Only generate questions, don't run tests |
verbose |
boolean | false | Enable detailed logging for debugging |
model |
string | "gpt-4o" | GPT model for question generation and evaluation |
prompt_type |
string | "task-based" | Type of prompts to generate (see below) |
parallel |
integer | 5 | Number of parallel workers for running tests |
batch_size |
integer | same as parallel | Questions per batch (set to 2-3x parallel for best performance) |
Prompt Types
NEW FEATURE: Choose how test prompts are generated to match your testing needs.
1. task-based (Default - Recommended)
Generates realistic user task requests that emulate how real users interact with your assistant.
Best for: Testing assistants that create content (banners, copy, ads, documents)
Example prompts:
- "Create a banner for our new credit card offer with 0% APR"
- "Write copy for a savings account promotion targeting young professionals"
- "Generate headlines for our mobile banking app launch"
- "Design promotional text for a balance transfer campaign"
When to use:
- Testing content creation assistants
- Simulating real user interactions
- Evaluating practical usability
Configuration:
{
"prompt_type": "task-based"
}
Command line:
python cli.py --config myconfig.json --prompt-type task-based
2. content-based (Original)
Generates knowledge questions about the content in your RAG documents.
Best for: Testing document understanding and knowledge retrieval
Example prompts:
- "What is the FCA Consumer Duty requirement?"
- "Explain the principles of clear customer communication"
- "What are the considerations for vulnerable customers?"
- "List the regulatory guidelines for financial advertising"
When to use:
- Verifying RAG knowledge accuracy
- Testing document comprehension
- Auditing information retrieval
Configuration:
{
"prompt_type": "content-based"
}
Command line:
python cli.py --config myconfig.json --prompt-type content-based
3. scenario-based
Generates realistic business scenarios that combine tasks with context and requirements.
Best for: Testing complex real-world use cases with constraints
Example prompts:
- "We're launching a new credit card for students. Create FCA-compliant banner copy that's clear and accessible"
- "Our vulnerable customer initiative needs promotional materials. Write banner text that follows Consumer Duty guidelines"
- "Create an internal banner for our mobile banking upgrade targeting existing customers aged 50+"
- "We have a new savings product for first-time buyers. Generate compliant promotional copy"
When to use:
- Testing compliance and constraints
- Simulating real business workflows
- Evaluating context handling
Configuration:
{
"prompt_type": "scenario-based"
}
Command line:
python cli.py --config myconfig.json --prompt-type scenario-based
Comparing Prompt Types
| Prompt Type | Use Case | Complexity | Best For |
|---|---|---|---|
| task-based | Simple user requests | Low | Daily user interactions |
| content-based | Knowledge questions | Medium | RAG accuracy testing |
| scenario-based | Business scenarios | High | Real-world workflows |
Batch Processing
NEW FEATURE: Run multiple test configurations automatically in sequence.
Instead of running tests one at a time, point to a directory containing multiple config files and run them all at once!
Setting Up Batch Tests
- Create a directory with multiple configs:
mkdir my_test_suite
- Add multiple configuration files:
my_test_suite/
├── test_credit_cards.json
├── test_savings.json
├── test_loans.json
└── test_mobile_banking.json
- Run all tests:
python cli.py --config-dir my_test_suite
What Happens
============================================================
BATCH PROCESSING MODE
Found 4 configuration file(s)
============================================================
• test_credit_cards.json
• test_savings.json
• test_loans.json
• test_mobile_banking.json
>>> Processing 1/4
============================================================
Processing config: test_credit_cards.json
============================================================
[Running tests...]
>>> Processing 2/4
============================================================
Processing config: test_savings.json
============================================================
[Running tests...]
... and so on ...
============================================================
BATCH PROCESSING COMPLETE
============================================================
✓ Successful: 4
Total time: 45.2 minutes
Batch Processing Benefits
- ✅ Run comprehensive test suites overnight
- ✅ Compare results across different assistants
- ✅ Test multiple prompt types automatically
- ✅ Automated CI/CD testing pipelines
- ✅ Progress tracking and error reporting
Command Line Options with Batch
You can override settings for all configs:
# Run all configs but use content-based prompts
python cli.py --config-dir my_test_suite --prompt-type content-based
# Run with higher parallelization
python cli.py --config-dir my_test_suite --parallel 15 --batch-size 45
# Generate questions only (no testing)
python cli.py --config-dir my_test_suite --generate-only
Output Directory Structure
NEW FEATURE: Each test run creates a unique timestamped directory - no more overwriting!
Directory Naming
Results are saved as:
{output_dir}/{config_name}_{timestamp}/
Example:
results/
├── test_credit_cards_20251112_143022/
│ ├── report.html
│ ├── test_results.json
│ ├── evaluation.json
│ ├── test_questions.json
│ └── *.png (charts)
├── test_credit_cards_20251112_153045/
│ ├── report.html
│ └── ...
└── test_savings_20251112_160112/
├── report.html
└── ...
Benefits
- ✅ Never lose results - each run is preserved
- ✅ Easy comparison - compare results across test runs
- ✅ Audit trail - complete history of all tests
- ✅ Organized - group results by test name and time
Customizing Output Location
In config file:
{
"output_dir": "my_results"
}
Command line:
python cli.py --config myconfig.json --output-dir my_results
Results will be saved to:
my_results/{config_name}_{timestamp}/
Command Line Reference
Full Command Syntax
usage: cli.py [-h] [--config CONFIG] [--config-dir CONFIG_DIR]
[--create-config OUTPUT_PATH] [--api-key API_KEY]
[--assistant-id ASSISTANT_ID] [--document DOCUMENT]
[--documents DOCUMENTS [DOCUMENTS ...]] [--output-dir OUTPUT_DIR]
[--num-questions NUM_QUESTIONS] [--iterations ITERATIONS]
[--questions-file QUESTIONS_FILE] [--generate-only] [--verbose]
[--model MODEL] [--prompt-type {task-based,content-based,scenario-based}]
[--parallel PARALLEL] [--batch-size BATCH_SIZE]
Common Commands
# Get help
python cli.py --help
# Create a config template
python cli.py --create-config my_config.json
# Run single test with config file
python cli.py --config my_config.json
# Run batch tests
python cli.py --config-dir my_test_suite/
# Run without config (all command line)
python cli.py --assistant-id asst_abc123 --document myfile.txt
# Generate questions only
python cli.py --config my_config.json --generate-only
# Use pre-generated questions
python cli.py --config my_config.json --questions-file results/test_questions.json
# Change prompt type
python cli.py --config my_config.json --prompt-type scenario-based
# High performance mode
python cli.py --config my_config.json --parallel 15 --batch-size 45
Advanced Usage Examples
Example 1: Complete Testing Workflow
# Step 1: Create config
python cli.py --create-config banner_test.json
# Step 2: Edit banner_test.json with your settings
# Step 3: Generate questions first to review
python cli.py --config banner_test.json --generate-only --num-questions 50
# Step 4: Review generated questions in results/*/test_questions.json
# Step 5: Run the full test
python cli.py --config banner_test.json --questions-file results/banner_test_*/test_questions.json
# Step 6: Open report.html to view results
Example 2: Testing Multiple Prompt Types
# Create base config
cat > base_config.json << EOF
{
"assistant_id": "asst_abc123",
"documents": ["docs/guidelines.docx"],
"num_questions": 30,
"iterations": 5
}
EOF
# Test with task-based prompts
python cli.py --config base_config.json --prompt-type task-based
# Test with content-based prompts
python cli.py --config base_config.json --prompt-type content-based
# Test with scenario-based prompts
python cli.py --config base_config.json --prompt-type scenario-based
# Compare the three result directories!
Example 3: High-Volume Testing
# For testing with many questions and high parallelization
python cli.py --config my_config.json \
--num-questions 100 \
--iterations 10 \
--parallel 20 \
--batch-size 60 \
--verbose
Example 4: Continuous Integration
#!/bin/bash
# run_tests.sh - Automated testing script
# Set environment
export OPENAI_API_KEY="your-key"
# Run test suite
python cli.py --config-dir test_configs/
# Check exit code
if [ $? -eq 0 ]; then
echo "All tests passed!"
else
echo "Some tests failed!"
exit 1
fi
Example 5: A/B Testing Different Assistants
// config_assistant_v1.json
{
"assistant_id": "asst_v1_abc123",
"documents": ["docs/guidelines.docx"],
"questions_file": "shared_questions.json",
"num_questions": 50
}
// config_assistant_v2.json
{
"assistant_id": "asst_v2_def456",
"documents": ["docs/guidelines.docx"],
"questions_file": "shared_questions.json",
"num_questions": 50
}
# Generate questions once
python cli.py --config config_assistant_v1.json --generate-only
# Test both assistants with same questions
python cli.py --config config_assistant_v1.json
python cli.py --config config_assistant_v2.json
# Compare the results!
Example 6: Multi-Document Testing
{
"assistant_id": "asst_abc123",
"documents": [
"/path/to/consumer_duty.docx",
"/path/to/fca_guidelines.docx",
"/path/to/brand_guidelines.txt",
"/path/to/product_specs.docx"
],
"num_questions": 40,
"prompt_type": "scenario-based"
}
Understanding the Results
HTML Report
After tests complete, open report.html to see:
1. Summary Metrics
- Overall quality score
- Average consistency score
- Average accuracy score
- Average completeness score
- Average response time
- Total tests run
2. Performance Charts
- Scores by Question: Bar chart showing all metric scores for each question
- Response Times: How fast the assistant responds
- Score Distribution: Histogram of score ranges
- Radar Chart: Visual comparison of quality, consistency, accuracy, and completeness
3. Question-by-Question Analysis
For each test prompt:
- Question text
- Individual scores (quality, consistency, accuracy, completeness)
- Evaluation notes and feedback
- All response iterations (collapsible)
Evaluation Metrics
Each response is scored 1-10 on four dimensions:
Quality Score (1-10)
- Clarity and coherence
- Professional tone
- No hallucinations
- Grammar and readability
Consistency Score (1-10)
- Similar answers across iterations
- Consistent facts and details
- No contradictions
- Stable level of detail
Accuracy Score (1-10)
- Information matches documents
- Correct facts and numbers
- No misrepresentations
- Proper context interpretation
Completeness Score (1-10)
- Addresses all aspects of the question
- Includes necessary context
- Sufficient detail
- No significant omissions
JSON Output Files
test_questions.json
{
"questions": [
"Create a banner for...",
"Write copy for...",
...
]
}
test_results.json
{
"results": [
{
"question_id": 0,
"question": "Create a banner...",
"iteration": 0,
"response": "Here's your banner: ...",
"response_time": 2.34,
"status": "completed"
},
...
]
}
evaluation.json
{
"summary": {
"total_questions": 20,
"total_iterations": 60,
"average_quality": 8.5,
"average_consistency": 9.2,
"average_accuracy": 8.8,
"average_completeness": 8.6,
"average_response_time": 2.1
},
"by_question": [...]
}
Troubleshooting
Common Issues
1. Assistant Not Found Error
ERROR: No assistant found with id 'asst_...'
Solution: Check your assistant ID on https://platform.openai.com/assistants
2. API Rate Limits
Error: Rate limit exceeded
Solution: Reduce parallel workers:
python cli.py --config my_config.json --parallel 3
3. Document Loading Errors
Warning: No content loaded from documents
Solutions:
- Check file paths are correct
- For
.docxfiles:pip install docx2txt - Verify files are readable (not corrupted)
- Supported formats:
.txt,.docxonly
4. Memory Issues
MemoryError: ...
Solution: Reduce batch size:
python cli.py --config my_config.json --batch-size 10
5. Missing API Key
Error: No OpenAI API key provided
Solutions:
# Option 1: Environment variable
export OPENAI_API_KEY="your-key"
# Option 2: In config file
{
"api_key": "your-key"
}
# Option 3: Command line
python cli.py --api-key "your-key" ...
Debug Mode
Enable verbose output for detailed logging:
python cli.py --config my_config.json --verbose
Or in config:
{
"verbose": true
}
Performance Tips
-
Optimize Parallelization
- Start with
parallel: 5 - Increase gradually if no rate limits
- Set
batch_sizeto 2-3xparallel
- Start with
-
Balance Speed vs. Cost
- More parallel workers = faster but higher API costs
- More iterations = better consistency data but more tests
-
Question Generation
- Generate questions once, reuse with
questions_file - Save API calls on repeated tests
- Generate questions once, reuse with
Supported File Types
- ✅ Text files (
.txt): Plain text with UTF-8 encoding - ✅ Word documents (
.docx): Microsoft Word files (requiresdocx2txt) - ❌ PDF files: Not currently supported
- ❌ Excel/PowerPoint: Not currently supported
Best Practices
1. Start Small
# Test with few questions first
python cli.py --config my_config.json --num-questions 5 --iterations 2
2. Use Configuration Files
- Easier to track and version
- Reusable across tests
- Less prone to typos
3. Organize Your Tests
my_project/
├── configs/
│ ├── test_suite_1/
│ │ ├── credit_cards.json
│ │ └── loans.json
│ └── test_suite_2/
│ └── mobile_banking.json
├── results/
└── docs/
4. Version Control Your Configs
git add configs/
git commit -m "Add test configurations"
5. Archive Important Results
# Save important test results
cp -r results/important_test_20251112_143022 archived_results/
License
MIT
Need Help?
- 📖 Documentation: You're reading it!
- 🐛 Issues: Report bugs on GitHub
- 💡 Feature Requests: Open an issue with your idea
- 📧 Contact: [your-email@example.com]
Happy Testing! 🚀