Pytest For AI Agents: How DeepEval Is Bringing Deterministic Testing To The Nondeterministic World Of LLMs + Video

Introduction:

Large Language Models are inherently probabilistic—the same prompt can yield different outputs each time it’s run. This nondeterminism breaks traditional unit testing, where assertions rely on predictable, deterministic values. DeepEval, an open-source LLM evaluation framework, solves this by bringing the familiar Pytest workflow to AI agent testing, enabling developers to write test files, loop through evaluation datasets, run agents, and assert against LLM metrics—all while running entirely on your local machine.

Learning Objectives:

Understand how to integrate DeepEval with LangChain using the CallbackHandler for automatic trace capture
Learn to write Pytest-based tests for AI agents with parametrize and assert_test
Master end-to-end and component-level evaluation strategies for LLM applications
Implement CI/CD pipelines that block unsafe or low-quality AI outputs before deployment

Why Traditional Unit Tests Fail for LLM Applications

When building agents with LangChain, you’re chaining together LLMs, tools, and retrieval steps. Each component can fail differently, and the output changes with every run. Traditional unit tests don’t work here because there’s no deterministic value to assert against. DeepEval approaches this problem by treating LLM evaluation as a specialized form of unit testing—similar to Pytest but designed for the unique challenges of AI systems.

The framework incorporates the latest research to run evaluations via metrics such as G-Eval, task completion, answer relevancy, and hallucination detection, using LLM-as-a-judge and other NLP models that run locally on your machine. This means you can test your AI agents with the same rigor you apply to traditional software, without sending sensitive data to third-party services.

2. Getting Started: Installation and Basic Setup

DeepEval is 100% open source and runs entirely on your machine. To begin testing your LangChain agents:

Installation:

pip install deepeval

Basic Test Structure:

import pytest
from deepeval import assert_test
from deepeval.metrics import TaskCompletionMetric, AnswerRelevancyMetric
from deepeval.integrations.langchain import CallbackHandler

def test_agent_task_completion():
 Create a callback handler with metrics
handler = CallbackHandler(metrics=[TaskCompletionMetric()])

Run your LangChain agent with the callback
result = agent.invoke(
{"input": "What is the capital of France?"},
config={"callbacks": [bash]}
)

Assert against the metric
assert_test(handler)

The CallbackHandler captures the full execution trace—inputs, outputs, tool calls, and LLM spans—and maps them to test cases automatically. Every agent run, model call, tool call, and retriever call becomes a span you can inspect without rewriting your LangChain app.

3. Tracing and Instrumentation with LangChain

DeepEval’s LangChain integration uses a CallbackHandler that you pass directly to your LangChain agent’s invoke method. This instrumentation is per-call, meaning you decide which runs are traced.

What Gets Traced:

Each LangChain call that receives a CallbackHandler produces a trace—the end-to-end unit your user observes. Inside that trace are component spans for each callback LangChain emits:

Agent spans — create_agent runs and any nested runnable steps
LLM spans — chat model and completion calls
Tool spans — tool calls and function executions
Retriever spans — retriever calls when your app uses retrieval

The trace and its component spans are independently evaluable, giving you granular visibility into where failures occur.

4. Two Levels of Testing: End-to-End vs. Component-Level

DeepEval supports testing at two distinct levels, each serving different purposes in your quality assurance strategy.

End-to-End Testing:

End-to-end testing evaluates the whole agent on task completion. You pass metrics directly to the CallbackHandler to score the overall LangChain run. This approach answers the question: “Did the agent accomplish its goal?”

Component-Level Testing:

Component-level testing attaches metrics to individual LLMs and tools within your chain, so you know exactly which component failed when a test breaks. DeepEval provides several helpers for staging metrics onto specific spans:

– `next_agent_span(metrics=[…])` — stage metrics for the next agent span
– `next_llm_span(metrics=[…])` — stage metrics for the next LLM span
– `next_retriever_span(metrics=[…])` — stage metrics for the next retriever span

from deepeval.integrations.langchain import next_llm_span

def test_llm_component():
handler = CallbackHandler()

with next_llm_span(metrics=[AnswerRelevancyMetric()]):
result = agent.invoke(
{"input": "Explain quantum computing"},
config={"callbacks": [bash]}
)

assert_test(handler)

This one-shot semantic ensures only the first matching span in the run picks up the staged metric.

5. Available Metrics for Comprehensive Evaluation

DeepEval offers a large variety of ready-to-use LLM evaluation metrics powered by any LLM of your choice, statistical methods, or NLP models that run locally.

Agentic Metrics:

Task Completion — evaluate whether an agent accomplished its goal
Tool Correctness — check if the right tools were called with the right arguments
Goal Accuracy — measure how accurately the agent achieved the intended goal
Step Efficiency — evaluate whether the agent took unnecessary steps
Plan Adherence — check if the agent followed the expected plan

RAG Metrics:

Answer Relevancy — measure how relevant the output is to the input
Faithfulness — evaluate whether the output factually aligns with the retrieval context
Contextual Recall — measure how well the retrieval context aligns with the expected output
Contextual Precision — evaluate whether relevant nodes are ranked higher
RAGAS — average of answer relevancy, faithfulness, contextual precision, and contextual recall

Multi-Turn Metrics:

Knowledge Retention — evaluate whether the chatbot retains factual information throughout a conversation
Conversation Completeness — measure whether the chatbot satisfies user needs throughout a conversation
Turn Relevancy — evaluate whether the chatbot generates consistently relevant responses

6. CI/CD Integration: Blocking Bad Outputs Before Deployment

DeepEval plugs into CI/CD with a single command. Add it to your GitHub Actions workflow, and every push triggers your agent test suite before anything ships.

GitHub Actions Workflow Example:

name: LLM Agent Tests

on:
push:
branches: [bash]
pull_request:
branches: [bash]

jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: |
pip install deepeval
pip install -r requirements.txt
- name: Run agent tests
run: deepeval test run test_agent.py

Each parametrized test invocation becomes one LangChain run; failing metrics fail the test, which fails the build. This creates a quality gate that prevents low-performing or hallucinating agents from reaching production.

Running Tests in Scripts:

For more flexibility, you can use EvaluationDataset with evals_iterator:

from deepeval import EvaluationDataset, evals_iterator

dataset = EvaluationDataset(goldens=[...])
for result in evals_iterator(dataset, handler):
print(f"Score: {result.score}")

7. Advanced Patterns: Subagent Evaluation and Customization

DeepEval supports advanced testing patterns for complex agent architectures.

Evaluating Subagents:

When your LangChain app uses nested agents or subagents invoked as tools, you can evaluate them in isolation using next_agent_span:

from deepeval.integrations.langchain import next_agent_span

def test_subagent():
handler = CallbackHandler()

with next_agent_span(metrics=[TaskCompletionMetric()]):
 This agent span will be evaluated separately
result = main_agent.invoke(
{"input": "Research and summarize"},
config={"callbacks": [bash]}
)

assert_test(handler)

Customizing Trace and Span Data:

Customization happens at the callback or span-staging boundary:

 Trace-level defaults
handler = CallbackHandler(
name="my_agent_run",
tags=["production", "v2"],
metadata={"environment": "staging"},
thread_id="user-123",
user_id="alice"
)

Component-level staging
with next_llm_span(
metrics=[AnswerRelevancyMetric()],
metadata={"model": "gpt-4o-mini"}
):
result = agent.invoke(...)

Using the @observe Decorator:

When a LangChain call is part of a larger operation, decorate the outer function with `@observe` to nest LangChain spans under your observed span.

What Undercode Say:

DeepEval transforms LLM testing from a black-box guessing game into a structured, deterministic process. The ability to write Pytest-style tests for nondeterministic AI outputs is a game-changer for production-grade AI applications. By treating LLM evaluation as a first-class testing concern, teams can finally apply the same quality assurance rigor to AI that they apply to traditional software.
The two-level testing strategy—end-to-end and component-level—is particularly powerful. When a test fails, you immediately know whether it’s a planning failure (agent level) or a generation failure (LLM level). This granularity dramatically reduces debugging time and enables targeted improvements. The fact that all metrics run locally means no sensitive data leaves your infrastructure, addressing the privacy concerns that often block enterprise AI adoption.

The integration with LangChain through the CallbackHandler is elegant—it requires minimal code changes to existing agent implementations. The per-call instrumentation model gives developers fine-grained control over which runs are evaluated, making it suitable for both development iteration and production monitoring. With support for parallel test execution across multiple processes and a results dashboard on Confident AI, DeepEval scales from individual developer machines to enterprise CI/CD pipelines.

Prediction:

+1 DeepEval and similar frameworks will become the industry standard for LLM testing within 18–24 months. As AI agents move from prototypes to production systems, the demand for reliable testing frameworks will skyrocket. The Pytest-like syntax lowers the barrier to entry, accelerating adoption across development teams.
+1 The separation of end-to-end and component-level metrics will enable a new category of AI observability tools. Teams will gradually shift from post-hoc monitoring to pre-deployment validation, catching hallucinations and planning errors before they reach users.
-1 Organizations that fail to implement structured LLM testing will face increasing reputational and financial risks. High-profile AI failures—from chatbots providing false information to agents making incorrect tool calls—will become more costly as AI systems take on more critical responsibilities. The gap between teams that test rigorously and those that don’t will widen significantly.

▶️ Related Video (76% Match):

https://www.youtube.com/watch?v=3fdggLU66VY

🎯Let’s Practice For Free:

🎓 Live Courses & Certifications:

Join Undercode Academy for Verified Certifications

🚀 Request a Custom Project:

Secure, high-velocity infrastructure and disruptive technological engineering. Contact our engineering team for high-tier development and proprietary systems:
[email protected]
💎 Smart Architecture | 🛡️ Secure by Design | ⭐ Trusted by Thousands

IT/Security Reporter URL:

Reported By: Sumanth077 Pytest – Hackers Feeds
Extra Hub: Undercode MoN
Basic Verification: Pass ✅

🔐JOIN OUR CYBER WORLD [ CVE News • HackMonitor • UndercodeNews ]

💬 Whatsapp | 💬 Telegram

📢 Follow UndercodeTesting & Stay Tuned:

𝕏 formerly Twitter 🐦 | @ Threads | 🔗 Linkedin | 🦋BlueSky

Listen to this Post