DeepEval IntegrationAlpha v0.1.2pip install testrelic-deepeval

Memory layer for your LLM evaluations.

TestRelic's DeepEval pytest plugin captures every eval run — metrics, scores, reasons, prompts — into shared application memory. So your AI coding agent doesn't just have a harness to run evals; it has the team's eval history as context. Drop-in for any DeepEval test suite.

An AI agent harness for testing is the runtime, memory, and tool layer that turns a general-purpose LLM into a senior testing engineer for your specific app. TestRelic provides the memory leg. DeepEval provides the eval runner. Cursor, Claude Code, Copilot, and Codex are the LLMs that read from that memory over MCP.

All metricsAnswerRelevancy · Hallucination · Faithfulness · G-Eval · custom
Pytest-nativeauto-attach, no conftest
Alpha v0.1.2on PyPI today

Quickstart

Three lines from pip to first eval upload.

No conftest changes. No SDK init in your test files. Drop the plugin in, log in, and run your existing DeepEval suite.

~/your-project · zsh
pip install testrelic-deepeval
testrelic login
deepeval test run tests/
1pip install testrelic-deepeval

Pulls the pytest plugin from PyPI. No conftest plumbing required — the entry point auto-attaches when both DeepEval and testrelic-deepeval are present.

2testrelic login

Browser-based login flow that writes your API key to ~/.testrelic/credentials. One-time setup; CI uses TESTRELIC_API_KEY instead.

3deepeval test run tests/

Your existing DeepEval suite runs as-is. The TestRelic plugin auto-captures the TestRun and uploads metrics, cases, scores, and reasons to TestRelic.

What the memory looks like

Every eval run, captured and queryable.

Metrics, scores, thresholds, reasons, and evaluation models — uploaded automatically, served back to your AI agent on demand.

deepeval test run · AnswerRelevancy 0.83 · uploaded

$ deepeval test run tests/test_chat_assistant.py

eval run dashboard

248

Cases evaluated

6

Metrics tracked

0.81

Avg score

AnswerRelevancy ≥ 0.7
Faithfulness ≥ 0.7
Hallucination ≤ 0.3

Eval-run memory queryable from Cursor and Claude Code over MCP.

Who uses the DeepEval memory layer

Same eval runs, four different audiences reading the memory through their own AI agent.

AI Engineer

Prompt

Why did AnswerRelevancy drop on the customer-support assistant last week?

TestRelic pulls the eval-run history, surfaces the prompt template change in commit a7f3c12, and shows the 14 failing cases.

Platform team

Prompt

Show eval scores across every team that runs DeepEval in our org.

Org-level eval memory aggregates by team, by metric, by model — no manual rollups.

Engineering Lead

Prompt

Which prompts regressed this sprint?

Cursor reads TestRelic's eval memory and answers without grepping CI logs or eval JSON dumps.

QA

Prompt

Give me regression history for the checkout-flow eval suite.

Eval runs stop being throwaway CI artifacts. Every run is queryable history.

Comparisons

Where TestRelic fits in your eval stack.

Memory layer, agent harness, DeepEval, Confident AI — what does what.

Alpha
v0.1.2 on PyPI

Production-ready for pytest integration, CLI, and dataset operations. Tracing is not yet shipped — we'll announce when it is.

All DeepEval metrics
AnswerRelevancy · Hallucination · Faithfulness · G-Eval · custom

If DeepEval supports the metric, the plugin captures it — name, score, threshold, reason, evaluation model.

Drop-in
Pytest entry point

No conftest plumbing. The plugin auto-attaches when both DeepEval and testrelic-deepeval are installed.

Disabled by default without an API key

If TESTRELIC_API_KEY is unset, the plugin is a no-op — your DeepEval suite never fails because of a missing TestRelic credential. Safe to add to existing suites.