Relia:

Project Goals

Relia is an E2E testing framework for LLMs, designed to help you build AI benchmarks tailored to your specific use cases. It identifies the most suitable LLM model for your needs and ensures that model upgrades do not cause performance regressions through continuous testing. Built specifically for function calling (or "tool use") scenarios, which are at the core of agent-based AI applications.
Try now

Use Cases

Selecting the Most Suitable LLM

When selecting models, identify the best LLM for your specific use case, ensuring high performance and cost efficiency.

This test plan compares the success rates of three LLMs (OpenAI, Fireworks, and Groq) to identify the most accurate and efficient model for a specific task.

Model	Success Rate	Suite	Round	Result
openAI gpt-3.5-turbo	100.00%	1	1
			2
			3
		2	1
			2
			3
fireworks accounts/fireworks/models/firefunction-v1	100.00%	1	1
			2
			3
		2	1
			2
			3
groq llama3-70b-8192	50.00%	1	1
			2
			3
		2	1	Show Error [ { "name": "get_stock_price", "arguments": { "symbol": "TSLA" "symbol": "Tesla" } } ]
			2	Show Error [ { "name": "get_stock_price", "arguments": { "symbol": "TSLA" "symbol": "Tesla" } } ]
			3	Show Error [ { "name": "get_stock_price", "arguments": { "symbol": "TSLA" "symbol": "Tesla" } } ]

Optimizing Prompts

When developing an application, compare the results of multiple sets of prompts on the same model to understand the impact of different prompts and complete optimization.

This test plan aims to compare the effectiveness of different prompt engineering strategies.

Model	Success Rate	Suite	Round	Result
openAI gpt-3.5-turbo	50.00%	1	1	Show Error [ { "name": "get_product_price", "arguments": { "product_name": "iPhone" "product_name": "apple phone" } } ]
		1	2	Show Error [ { "name": "get_product_price", "arguments": { "product_name": "iPhone" "product_name": "Apple Phone" } } ]
		2	1
		2	2

Continuous Testing to Prevent Regressions

After the application is released, continuously test different versions of the same model to avoid regressions during upgrades.

This test plan aims to prevent regression in model upgrades by comparing the performance of multiple versions of OpenAI's GPT-4 models.

Model	Success Rate	Suite	Round	Result
openAI gpt-4-0613	37.50%	1	1	Show Error [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "Stanford University", "degree": "Bachelor's", "field": "Computer Science", "year": 2015 }, { "institution": "MIT", "degree": "Master's", "field": "Computer Science", "field": "", "year": 2018 } ] } } ]
			2
			3
			4	Show Error [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "Stanford University", "degree": "Bachelor's", "field": "Computer Science", "year": 2015 }, { "institution": "MIT", "degree": "Master's", "field": "Computer Science", "field": "", "year": 2018 } ] } } ]
		2	1	Show Error [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "UC Berkeley", "degree": "Bachelor's", "field": "Mechanical Engineering", "year": 2010 }, { "institution": "Caltech", "degree": "PhD", "field": "Mechanical Engineering", "field": "", "year": 2015 } ] } } ]
			2
			3	Show Error [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "UC Berkeley", "degree": "Bachelor's", "field": "Mechanical Engineering", "year": 2010 }, { "institution": "Caltech", "degree": "PhD", "field": "Mechanical Engineering", "field": "unknown", "year": 2015 } ] } } ]
			4	Show Error [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "UC Berkeley", "degree": "Bachelor's", "field": "Mechanical Engineering", "year": 2010 }, { "institution": "Caltech", "degree": "PhD", "field": "Mechanical Engineering", "field": "", "year": 2015 } ] } } ]
openAI gpt-4-1106-preview	50.00%	1	1	Show Error [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "Stanford University", "degree": "Bachelor's", "field": "Computer Science", "year": 2015 }, { "institution": "MIT", "degree": "Master's", "field": "Computer Science", "field": "", "year": 2018 } ] } } ]
			2	Show Error [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "Stanford University", "degree": "Bachelor's", "field": "Computer Science", "year": 2015 }, { "institution": "MIT", "degree": "Master's", "field": "Computer Science", "field": "", "year": 2018 } ] } } ]
			3	Show Error [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "Stanford University", "degree": "Bachelor's", "field": "Computer Science", "year": 2015 }, { "institution": "MIT", "degree": "Master's", "field": "Computer Science", "field": "", "year": 2018 } ] } } ]
			4	Show Error [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "Stanford University", "degree": "Bachelor's", "field": "Computer Science", "year": 2015 }, { "institution": "MIT", "degree": "Master's", "field": "Computer Science", "field": "", "year": 2018 } ] } } ]
		2	1
			2
			3
			4
openAI gpt-4-0125-preview	50.00%	1	1	Show Error [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "Stanford University", "degree": "Bachelor's", "field": "Computer Science", "year": 2015 }, { "institution": "MIT", "degree": "Master's", "field": "Computer Science", "field": " ", "year": 2018 } ] } } ]
			2	Show Error [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "Stanford University", "degree": "Bachelor's", "field": "Computer Science", "year": 2015 }, { "institution": "MIT", "degree": "Master's", "field": "Computer Science", "field": "", "year": 2018 } ] } } ]
			3	Show Error [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "Stanford University", "degree": "Bachelor's", "field": "Computer Science", "year": 2015 }, { "institution": "MIT", "degree": "Master's", "field": "Computer Science", "field": "", "year": 2018 } ] } } ]
			4	Show Error [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "Stanford University", "degree": "Bachelor's", "field": "Computer Science", "year": 2015 }, { "institution": "MIT", "degree": "Master's", "field": "Computer Science", "field": "", "year": 2018 } ] } } ]
		2	1
			2
			3
			4
openAI gpt-4-turbo-2024-04-09	87.50%	1	1
			2
			3	Show Error [ { "name": "extract_education_info", "arguments": { "education": [ { "institution": "Stanford University", "degree": "Bachelor's", "field": "Computer Science", "year": 2015 }, { "institution": "MIT", "degree": "Master's", "field": "Computer Science", "field": "unspecified", "year": 2018 } ] } } ]
			4
		2	1
			2
			3
			4

Roadmap

core API

Expand support to include more LLM providers.

UI/UX

Develop a form UI for editing test plans, making it easier and more intuitive to create and manage tests.

core API

Enable customization of provider titles and suite titles in test reports for better organization and clarity.

Cloud Service

Improve the efficiency and reliability of executing large-scale test plans.

Cloud Service

Implement persistent storage for test plans and reports.

core API

Allow custom scoring for different suites to better evaluate and compare the performance of test cases.

Feel free to follow our project on GitHub, X, and Bilibili.