Relia is an E2E testing framework for LLMs, designed to help you build AI benchmarks tailored to your specific use cases. It identifies the most suitable LLM model for your needs and ensures that model upgrades do not cause performance regressions through continuous testing. Built specifically for function calling (or "tool use") scenarios, which are at the core of agent-based AI applications.
Try now
When selecting models, identify the best LLM for your specific use case, ensuring high performance and cost efficiency.
This test plan compares the success rates of three LLMs (OpenAI, Fireworks, and Groq) to identify the most accurate and efficient model for a specific task.
Model | Success Rate | Suite | Round | Result |
---|---|---|---|---|
openAI gpt-3.5-turbo | 100.00% | 1 | 1 | |
2 | ||||
3 | ||||
2 | 1 | |||
2 | ||||
3 | ||||
fireworks accounts/fireworks/models/firefunction-v1 | 100.00% | 1 | 1 | |
2 | ||||
3 | ||||
2 | 1 | |||
2 | ||||
3 | ||||
groq llama3-70b-8192 | 50.00% | 1 | 1 | |
2 | ||||
3 | ||||
2 | 1 | [ { "name": "get_stock_price", "arguments": { "symbol": "TSLA" "symbol": "Tesla" } } ] | ||
2 | [ { "name": "get_stock_price", "arguments": { "symbol": "TSLA" "symbol": "Tesla" } } ] | |||
3 | [ { "name": "get_stock_price", "arguments": { "symbol": "TSLA" "symbol": "Tesla" } } ] |
When developing an application, compare the results of multiple sets of prompts on the same model to understand the impact of different prompts and complete optimization.
This test plan aims to compare the effectiveness of different prompt engineering strategies.
Model | Success Rate | Suite | Round | Result |
---|---|---|---|---|
openAI gpt-3.5-turbo | 50.00% | 1 | 1 | [ { "name": "get_product_price", "arguments": { "product_name": "iPhone" "product_name": "apple phone" } } ] |
2 | [ { "name": "get_product_price", "arguments": { "product_name": "iPhone" "product_name": "Apple Phone" } } ] | |||
2 | 1 | |||
2 |
After the application is released, continuously test different versions of the same model to avoid regressions during upgrades.
This test plan aims to prevent regression in model upgrades by comparing the performance of multiple versions of OpenAI's GPT-4 models.
Model | Success Rate | Suite | Round | Result |
---|---|---|---|---|
openAI gpt-4-0613 | 37.50% | 1 | 1 | [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "Stanford University", "degree": "Bachelor's", "field": "Computer Science", "year": 2015 }, { "institution": "MIT", "degree": "Master's", "field": "Computer Science", "field": "", "year": 2018 } ] } } ] |
2 | ||||
3 | ||||
4 | [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "Stanford University", "degree": "Bachelor's", "field": "Computer Science", "year": 2015 }, { "institution": "MIT", "degree": "Master's", "field": "Computer Science", "field": "", "year": 2018 } ] } } ] | |||
2 | 1 | [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "UC Berkeley", "degree": "Bachelor's", "field": "Mechanical Engineering", "year": 2010 }, { "institution": "Caltech", "degree": "PhD", "field": "Mechanical Engineering", "field": "", "year": 2015 } ] } } ] | ||
2 | ||||
3 | [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "UC Berkeley", "degree": "Bachelor's", "field": "Mechanical Engineering", "year": 2010 }, { "institution": "Caltech", "degree": "PhD", "field": "Mechanical Engineering", "field": "unknown", "year": 2015 } ] } } ] | |||
4 | [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "UC Berkeley", "degree": "Bachelor's", "field": "Mechanical Engineering", "year": 2010 }, { "institution": "Caltech", "degree": "PhD", "field": "Mechanical Engineering", "field": "", "year": 2015 } ] } } ] | |||
openAI gpt-4-1106-preview | 50.00% | 1 | 1 | [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "Stanford University", "degree": "Bachelor's", "field": "Computer Science", "year": 2015 }, { "institution": "MIT", "degree": "Master's", "field": "Computer Science", "field": "", "year": 2018 } ] } } ] |
2 | [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "Stanford University", "degree": "Bachelor's", "field": "Computer Science", "year": 2015 }, { "institution": "MIT", "degree": "Master's", "field": "Computer Science", "field": "", "year": 2018 } ] } } ] | |||
3 | [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "Stanford University", "degree": "Bachelor's", "field": "Computer Science", "year": 2015 }, { "institution": "MIT", "degree": "Master's", "field": "Computer Science", "field": "", "year": 2018 } ] } } ] | |||
4 | [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "Stanford University", "degree": "Bachelor's", "field": "Computer Science", "year": 2015 }, { "institution": "MIT", "degree": "Master's", "field": "Computer Science", "field": "", "year": 2018 } ] } } ] | |||
2 | 1 | |||
2 | ||||
3 | ||||
4 | ||||
openAI gpt-4-0125-preview | 50.00% | 1 | 1 | [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "Stanford University", "degree": "Bachelor's", "field": "Computer Science", "year": 2015 }, { "institution": "MIT", "degree": "Master's", "field": "Computer Science", "field": " ", "year": 2018 } ] } } ] |
2 | [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "Stanford University", "degree": "Bachelor's", "field": "Computer Science", "year": 2015 }, { "institution": "MIT", "degree": "Master's", "field": "Computer Science", "field": "", "year": 2018 } ] } } ] | |||
3 | [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "Stanford University", "degree": "Bachelor's", "field": "Computer Science", "year": 2015 }, { "institution": "MIT", "degree": "Master's", "field": "Computer Science", "field": "", "year": 2018 } ] } } ] | |||
4 | [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "Stanford University", "degree": "Bachelor's", "field": "Computer Science", "year": 2015 }, { "institution": "MIT", "degree": "Master's", "field": "Computer Science", "field": "", "year": 2018 } ] } } ] | |||
2 | 1 | |||
2 | ||||
3 | ||||
4 | ||||
openAI gpt-4-turbo-2024-04-09 | 87.50% | 1 | 1 | |
2 | ||||
3 | [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "Stanford University", "degree": "Bachelor's", "field": "Computer Science", "year": 2015 }, { "institution": "MIT", "degree": "Master's", "field": "Computer Science", "field": "unspecified", "year": 2018 } ] } } ] | |||
4 | ||||
2 | 1 | |||
2 | ||||
3 | ||||
4 |