Memory layer for your LLM evaluations.
TestRelic's DeepEval pytest plugin captures every eval run — metrics, scores, reasons, prompts — into shared application memory. So your AI coding agent doesn't just have a harness to run evals; it has the team's eval history as context. Drop-in for any DeepEval test suite.
An AI agent harness for testing is the runtime, memory, and tool layer that turns a general-purpose LLM into a senior testing engineer for your specific app. TestRelic provides the memory leg. DeepEval provides the eval runner. Cursor, Claude Code, Copilot, and Codex are the LLMs that read from that memory over MCP.
Quickstart
Three lines from pip to first eval upload.
No conftest changes. No SDK init in your test files. Drop the plugin in, log in, and run your existing DeepEval suite.
pip install testrelic-deepeval
testrelic login
deepeval test run tests/pip install testrelic-deepevalPulls the pytest plugin from PyPI. No conftest plumbing required — the entry point auto-attaches when both DeepEval and testrelic-deepeval are present.
testrelic loginBrowser-based login flow that writes your API key to ~/.testrelic/credentials. One-time setup; CI uses TESTRELIC_API_KEY instead.
deepeval test run tests/Your existing DeepEval suite runs as-is. The TestRelic plugin auto-captures the TestRun and uploads metrics, cases, scores, and reasons to TestRelic.
What the memory looks like
Every eval run, captured and queryable.
Metrics, scores, thresholds, reasons, and evaluation models — uploaded automatically, served back to your AI agent on demand.
$ deepeval test run tests/test_chat_assistant.py
eval run dashboard
248
Cases evaluated
6
Metrics tracked
0.81
Avg score
Eval-run memory queryable from Cursor and Claude Code over MCP.
Who uses the DeepEval memory layer
Same eval runs, four different audiences reading the memory through their own AI agent.
Prompt
“Why did AnswerRelevancy drop on the customer-support assistant last week?”
TestRelic pulls the eval-run history, surfaces the prompt template change in commit a7f3c12, and shows the 14 failing cases.
Prompt
“Show eval scores across every team that runs DeepEval in our org.”
Org-level eval memory aggregates by team, by metric, by model — no manual rollups.
Prompt
“Which prompts regressed this sprint?”
Cursor reads TestRelic's eval memory and answers without grepping CI logs or eval JSON dumps.
Prompt
“Give me regression history for the checkout-flow eval suite.”
Eval runs stop being throwaway CI artifacts. Every run is queryable history.
Comparisons
Where TestRelic fits in your eval stack.
Memory layer, agent harness, DeepEval, Confident AI — what does what.
Production-ready for pytest integration, CLI, and dataset operations. Tracing is not yet shipped — we'll announce when it is.
If DeepEval supports the metric, the plugin captures it — name, score, threshold, reason, evaluation model.
No conftest plumbing. The plugin auto-attaches when both DeepEval and testrelic-deepeval are installed.
Disabled by default without an API key
If TESTRELIC_API_KEY is unset, the plugin is a no-op — your DeepEval suite never fails because of a missing TestRelic credential. Safe to add to existing suites.