Regression testing
Modsys gives access to a large number of providers, integrations and libraries to boost accuracy for your AI models.
With Modsys, you can tune LLM prompts systematically across many relevant test cases. By evaluating and comparing LLM outputs to build decision making workflows. Users can test prompt quality and catch regressions faster.
Subfeatures
prompt testing
: You can compare prompt outputs against your models, utilizing this approach to start you can gain insight into model quality.grading
: You can setup automatic evaluation for your model outputs, thus extending testing through visiblepass
orfail
states.retraining (suggestion)
: Users have the ability to retrain models/update classifiers using suggestive response to grading and prompt testing states.
Our shared approach
Our tool is inspired from modsys
, we've extended the module beyond its existing base to support our developers.
A blurb from Serious LLM development requires a systematic approach to prompt engineering. modsys
core purpose is to support performance management, monitoring and evaluation of AI models using a well-structured, repeatable, and customizable process.
The goal: informed, data-driven decisions for prompt tuning and building custom decision making workflows. At a high level, here's how to use modsys
:
Define your test cases
: Identify the scenarios and inputs that are relevant to your application. Create a set of prompts and test cases that closely represent these scenarios.Configure your evaluation
: Set up your evaluation by specifying the prompts, test cases, and API providers you want to use. You can customize the evaluation process by configuring the input and output formats, the level of concurrency, and other options.Run the evaluation
: Execute the evaluation using the command-line tool or library. Modsys will evaluate your prompts against the specified API providers, generating side-by-side comparisons of their outputs.Analyze the results
: Review results in a structured format, such as CSV, JSON, YAML, or HTML, to make informed decisions about the best model and prompt choices for your application.Reporting evaluations
: Escalate the presence of Accuracy issues to an AI Vulnerability Database Provider (AVID). We plan to support future iterations against ethics, security and other performance measures.