Rhesis: Collaborative Testing for LLM & Agentic Applications

<a href="https://github.com/rhesis-ai/rhesis/blob/main/LICENSE"> <img src="https://img.shields.io/badge/license-MIT%20%2B%20Enterprise-blue" alt="License"> </a> <a href="https://pypi.org/project/rhesis-sdk/"> <img src="https://img.shields.io/pypi/v/rhesis-sdk" alt="PyPI Version"> </a> <a href="https://pypi.org/project/rhesis-sdk/"> <img src="https://img.shields.io/pypi/pyversions/rhesis-sdk" alt="Python Versions"> </a> <a href="https://codecov.io/gh/rhesis-ai/rhesis"> <img src="https://codecov.io/gh/rhesis-ai/rhesis/graph/badge.svg?token=1XQV983JEJ" alt="codecov"> </a> <a href="https://discord.rhesis.ai"> <img src="https://img.shields.io/discord/1340989671601209408?color=7289da&label=Discord&logo=discord&logoColor=white" alt="Discord"> </a> <a href="https://www.linkedin.com/company/rhesis-ai"> <img src="https://img.shields.io/badge/LinkedIn-Rhesis_AI-blue?logo=linkedin" alt="LinkedIn"> </a> <a href="https://huggingface.co/rhesis"> <img src="https://img.shields.io/badge/🤗-Rhesis-yellow" alt="Hugging Face"> </a> <a href="https://docs.rhesis.ai"> <img src="https://img.shields.io/badge/docs-rhesis.ai-blue" alt="Documentation"> </a> <a href="https://rhesis.ai">Website</a> · <a href="https://docs.rhesis.ai">Docs</a> · <a href="https://discord.rhesis.ai">Discord</a> · <a href="https://github.com/rhesis-ai/rhesis/blob/main/CHANGELOG.md">Changelog</a> <h3 align="center">More than just evals. Collaborative agent testing for teams.</h3> Generate tests from requirements, simulate conversation flows, detect adversarial behaviors, evaluate with 60+ metrics, and trace failures with OpenTelemetry. Engineers and domain experts, working together. <a href="https://rhesis.ai/?video=open" target="_blank"> <img src=".github/images/GH_Short_Demo.png" loading="lazy" width="1080" alt="Rhesis Platform Overview - Click to watch demo"> </a>

Core features

Test generation

AI-Powered Synthesis - Describe requirements in plain language. Rhesis generates hundreds of test scenarios including edge cases and adversarial prompts.

Knowledge-Aware - Connect context sources via file upload or MCP (Notion, GitHub, Jira, Confluence) for better test generation.

Single-turn & conversation simulation

Single-turn for Q&A validation. Conversation simulation for dialogue flows.

Penelope Agent simulates realistic conversations to test context retention, role adherence, and dialogue coherence across extended interactions.

Adversarial testing (red-teaming)

Polyphemus Agent proactively finds vulnerabilities:

Jailbreak attempts and prompt injection
PII leakage and data extraction
Harmful content generation
Role violation and instruction bypassing

Garak Integration - Built-in support for garak, the LLM vulnerability scanner, for comprehensive security testing.

60+ pre-built metrics

| Framework | Example Metrics | |-----------|-----------------| | RAGAS | Context relevance, faithfulness, answer accuracy | | DeepEval | Bias, toxicity, PII leakage, role violation, turn relevancy, knowledge retention | | Garak | Jailbreak detection, prompt injection, XSS, malware generation, data leakage | | Custom | NumericJudge, CategoricalJudge for domain-specific evaluation |

All metrics include LLM-as-Judge reasoning explanations.

Traces & observability

Monitor your LLM applications with OpenTelemetry-based tracing:

from rhesis.sdk.decorators import observe

@observe.llm(model="gpt-4")
def generate_response(prompt: str) -> str:
    # Your LLM call here
    return response

Track LLM calls, latency, token usage, and link traces to test results for debugging.

Bring your own model

Use any LLM provider for test generation and evaluation:

Cloud: OpenAI, Anthropic, Google Gemini, Mistral, Cohere, Groq, Together AI

Local/Self-hosted: Ollama, vLLM, LiteLLM

See Model Configuration Docs for setup instructions.

Why Rhesis?

Platform for teams. SDK for developers.

Use the collaborative platform for team-based testing: product managers define requirements, domain experts review results, engineers integrate via CI/CD. Or integrate directly with the Python SDK for code-first workflows.

The testing lifecycle

Six integrated phases from project setup to team collaboration:

| Phase | What You Do | |--------------------------------|-------------| | 1. Projects | Configure your AI application, upload & connect context sources (files, docs), set up SDK connectors | | 2. Requirements | Define expected behaviors (what your app should and shouldn't do), cover all relevant aspects from product, marketing, customer support, legal and compliance teams | | 3. Metrics | Select from 60+ pre-built metrics or create custom LLM-as-Judge evaluations to assess whether your requirements are met | | 4. Tests | Generate single-turn and conversation simulation test scenarios. Organize in test sets and understand your test coverage | | 5. Execution | Run tests via UI, SDK, or API; integrate into CI/CD pipelines; collect traces during execution | | 6. Collaboration | Review results with your team through comments, tasks, workflows, and side-by-side comparisons |

Rhesis vs...

| Instead of... | Rhesis gives you... | |---------------|---------------------| | Manual testing | AI-generated test cases based on your context, hundreds in minutes | | Traditional test frameworks | Non-deterministic output handling built-in | | LLM observability tools | Pre-production validation, not post-production monitoring | | Red-teaming services | Continuous, self-service adversarial testing, not one-time audits |

What you can test

| Use Case | What Rhesis Tests | |----------|-------------------| | Conversational AI | Conversation simulation, role adherence, knowledge retention | | RAG Systems | Context relevance, faithfulness, hallucination detection | | NL-to-SQL / NL-to-Code | Query accuracy, syntax validation, edge case handling | | Agentic Systems | Tool selection, goal achievement, multi-agent coordination |

SDK: Code-first testing

Test your Python functions directly with the @endpoint decorator:

from rhesis.sdk.decorators import endpoint

@endpoint(name="my-chatbot")
def chat(message: str) -> str:
    # Your LLM logic here
    return response

Features: Zero configuration, automatic parameter binding, auto-reconnection, environment management (dev/staging/production).

Generate tests programmatically:

from rhesis.sdk.synthesizers import PromptSynthesizer

synthesizer = PromptSynthesizer(
    prompt="Generate tests for a medical chatbot that must never provide diagnosis"
)
test_set = synthesizer.generate(num_tests=10)

Deployment options

| Option | Best For | Setup Time | |--------|----------|------------| | Rhesis Cloud | Teams wanting managed deployment | Instant | | Docker | Local development and testing | 5 minutes | | Kubernetes | Production self-hosting | See docs |

Quick Start

Option 1: Cloud (fastest) - app.rhesis.ai - Managed service, just connect your app

Option 2: Self-host with Docker

git clone https://github.com/rhesis-ai/rhesis.git && cd rhesis && ./rh start

Access: Frontend at localhost:3000, API at localhost:8080/docs

Commands: ./rh logs · ./rh stop · ./rh restart · ./rh delete

Note: This setup enables auto-login for local testing. For production, see Self-hosting Documentation.

Option 3: Python SDK

pip install rhesis-sdk

Integrations

Connect Rhesis to your LLM stack:

| Integration | Languages | Description | |-------------|-----------|-------------| | Rhesis SDK | Python, JS/TS | Native SDK with decorators for endpoints and observability. Full control over test execution and tracing. | | OpenAI | Python | Drop-in replacement for OpenAI SDK. Automatic instrumentation with zero code changes. | | Anthropic | Python | Native support for Claude models with automatic tracing. | | LangChain | Python | Add Rhesis callback handler to your LangChain app for automatic tracing and test execution. | | LangGraph | Python | Built-in integration for LangGraph agent workflows with full observability. | | AutoGen | Python | Automatic instrumentation for Microsoft AutoGen multi-agent conversations. | | LiteLLM | Python | Unified interface for 100+ LLMs (OpenAI, Azure, Anthropic, Cohere, Ollama, vLLM, HuggingFace, Replicate). | | Google Gemini | Python | Native integration for Google's Gemini models. | | Ollama | Python | Local LLM deployment with Ollama integration. | | OpenRouter | Python | Access to multiple LLM providers through OpenRouter. | | Vertex AI | Python | Google Cloud Vertex AI model su

Rhesis

Install / Use

README