Type-safe LLM agents with PydanticAI

Pydantic AI is a new agent framework by the company behind Pydantic, the popular data validation library. Pydantic has transformed how I write Python, so I’m excited for their take on agents. In this article I’ll walk through an example app and comment on my experience developing with PydanticAI.

PydanticAI version 0.0.13

PydanticAI is in beta. This article is based on version 0.0.13. Code examples may not work with future versions. Limitations that are mentioned may be lifted in future versions.

What is an agent?

The term “agent” in the context of LLMs refers to a while loop that calls an LLM to solve a problem. The LLM may be equipped with tools, meaning functions that it can supply arguments to and receive results from. To cut through the marketing hype, I suggest just reading the code for PydanticAI’s Agent.run() method.

As an agent framework, PydanticAI lets developers define workflows wherein an LLM interprets a user’s query and can use tools in multiple steps to answer the question or perform a task. Type safety is a big deal in agent development - the LLM has to call tools with the correct arguments and the tools have to return the correct data type. PydanticAI brings the type safety of Pydantic to this space. This also speeds up development, because type checkers like mypy and pyright can catch errors before the code is run.

In addition to type safety, PydanticAI offers:

streaming responses, including structured responses
support for async tool calling
support for multiple LLM providers, including OpenAI, Groq, Anthropic, Gemini, Ollama and Mistral, with more to come
optional integration with Logfire, a commercial service by the Pydantic team for logging LLM calls

Example app: Market research knowledge manager

Large companies conduct market research to understand their customers, competition and market trends. Over time, they amass a library of thousands of reports, tables and transcripts. Knowledge management becomes a challenge, because teams are not aware of existing research.

Let’s build an example agent that answers questions based on information in a database with multiple tables. Our final agentic RAG system will enable an interaction like this:

%%{init: {
  'theme': 'base',
  'themeVariables': {
    'primaryColor': '#ffffff',
    'primaryTextColor': '#2d3748',
    'primaryBorderColor': '#90cdf4',
    'lineColor': '#64748b',
    'secondaryColor': '#ffffff',
    'tertiaryColor': '#ffffff',
    'fontSize': '22px',
    'labelFontSize': '18px',
    'edgeLabelFontSize': '18px'
  }
}}%%
graph LR
    %% Define styles
    classDef default fill:#ffffff,stroke:#90cdf4,stroke-width:2px
    classDef highlight fill:#fdf2f8,stroke:#ed64a6,stroke-width:3px
    classDef api fill:#ffffff,stroke:#4fd1c5,stroke-width:2px

    User([User]) --> |"What reports do we have about electric vehicles?"| Agent
    Agent --> |"Analyze user query"| Groq[LLM Provider Groq]
    Groq --> |"Tool selection"| Agent
    
    Agent --> |"Search topic='Automotive'"| Tool1[tool: search_reports_by_field]
    Agent --> |"Search 'electric vehicles'"| Tool2[tool: search_reports_by_title_similarity]
    
    Tool1 --> |"Query"| DB[(DuckDB)]
    Tool2 --> |"Vector similarity"| DB
    
    Tool1 --> |"Found 2 reports"| Agent
    Tool2 --> |"Found similar titles"| Agent
    
    Agent --> |"There are 2 reports about EVs:
    1. German EV Market Analysis 2024
    2. EV Adoption in Asia"| User

    %% Apply styles
    class Groq api
    class DB highlight

    %% Links between nodes
    linkStyle default stroke:#64748b,stroke-width:2px

Database

I’m using DuckDB to create an in-memory database which will be made available to the agent.

import duckdb

con = duckdb.connect()

1: Create a local database. In production you’d want to use a persistent database.

I’ll insert a set of reports into the database. The data included is fictional and was generated by an LLM. The data consists of 40 reports like this:

import polars as pl
from great_tables import GT

reports = pl.read_csv("data/reports.csv")
GT(reports.head(5))

id	year	institute	country	topic	title
1	2018	Research DNA GmbH	Germany	Automotive	Global Electric Vehicle Market Outlook 2018-2023
2	2018	Market Insights Inc.	USA	Healthcare	Digital Health Market Size and Growth Analysis
3	2018	Global Trends Research	UK	FMCG	Premium Beauty and Personal Care Market Trends
4	2018	Data Analytics Group	Canada	Electronics	Smartphone Industry Competitive Analysis
5	2018	Innovative Solutions Ltd.	Australia	Insurance	Insurtech Market Landscape and Opportunities

To make the title searchable, I’ll embed it using an OpenAI embedding endpoint. The result will be stored in a new column with 1536 dimensions.

from openai import OpenAI
from tqdm import tqdm


def embed_text(text: str) -> list[float]:
    client = OpenAI()
    model = "text-embedding-3-small"
    return client.embeddings.create(input=text, model=model).data[0].embedding


title_embeddings = [embed_text(title) for title in tqdm(reports["title"])]

reports = reports.with_columns(
    pl.Series(
        name="title_embedding",
        values=title_embeddings,
        dtype=pl.Array(inner=pl.Float64, shape=1536),
    )
)

Now, I’ll insert the data including the embeddings into the database. The embeddings are stored in a fixed-size ARRAY column. The co-location of the structured data and the embeddings in the same table is convenient for our use case.

con.execute(
    """
    CREATE OR REPLACE TABLE reports AS
    SELECT
        id::integer AS id,
        year::integer AS year,
        institute::varchar AS institute,
        country::varchar AS country,
        topic::varchar AS topic,
        title::varchar AS title,
        title_embedding::float[1536] AS title_embedding
    FROM reports;
    """
)

1: This works because DuckDB can read from a Polars DataFrame.

con.execute("INSTALL vss;")
con.execute("LOAD vss;")

con.execute(
    "CREATE INDEX titles_hnsw_index ON reports USING HNSW(title_embedding) WITH (metric='cosine');"
)

I also create a hierarchical navigable small world (HNSW) index on the title embeddings. This enables approximate nearest neighbor search in O(log n). It’s enabled by the vss extension. Note that persistence to disk is experimental, so I wouldn’t recommend it for production yet.

Agent

Let’s set up an agent powered by the Groq inference API. It serves a range of open source models. Specifically, I’ll use the llama-3.3-70b-versatile model released by Meta on December 6th. Artificial Analysis has a detailed report showing that it advanced the speed-accuracy trade-off. The model has tool calling capabilities, which are critical for our use case.

from pydantic_ai import Agent

agent = Agent(
    model="groq:llama-3.3-70b-versatile",
    system_prompt="You are a market research expert and answer questions using a database of reports.",
)

result = agent.run_sync("Who are you?")
print(result.data)

1: See the KnownModelName documentation for a list of supported models.

I am a market research expert, providing insights and analysis based on a vast database of reports and studies. My expertise spans various industries, including consumer goods, technology, healthcare, and finance. I can help answer questions, provide data-driven insights, and offer market trends and analysis to support business decisions.

My database includes reports from reputable sources, such as market research firms, academic institutions, and industry associations. I can access a wide range of topics, including market size and growth, consumer behavior, competitor analysis, and emerging trends.

What specific area of market research would you like to explore?

Tools

The agent’s job will be to answer questions based on the reports in the database. It needs a way to access the database. We can give it a tool, meaning a function that it can call, to query the database. First, it needs a database connection.

from dataclasses import dataclass


@dataclass
class AgentDependencies:
    db: duckdb.DuckDBPyConnection


deps = AgentDependencies(db=con)

1: A dataclass that contains dependencies needed by the agent. Additional dependencies can be added as needed.
2: This is the connection that has the connection to the in-memory DuckDBdatabase.

Next, let’s give the agent a tool to search the database of reports. Based on the user’s question, it can choose which field to search. The result is always a markdown-formatted table with one row per report.

import json
from typing import Literal
from pydantic_ai import RunContext
from pydantic import validate_call, Field


def df_to_str(df: pl.DataFrame) -> str:
    return json.dumps(df.to_dicts())


@agent.tool
@validate_call(config={"arbitrary_types_allowed": True})
def search_reports_by_field(
    ctx: RunContext[AgentDependencies],
    field: Literal["id", "year", "institute", "country", "topic"],
    value: str = Field(
        description="The value to search for in the field. Case insensitive."
    ),
) -> str:
    base_query = """
        SELECT id, year, institute, country, topic, title 
        FROM reports 
        WHERE {}
    """

    if field in ["id", "year"]:
        value = int(value)
        where_clause = f"{field} = ?"
    else:
        where_clause = f"lower({field}) = lower(?)"

    final_query = base_query.format(where_clause)
    df = ctx.deps.db.execute(final_query, [value]).pl()

    if df.shape[0] == 0:
        return "No reports found. Try a different field or value, or use the title similarity tool."
    return df_to_str(df)

1: A record-oriented JSON representation of the data frame is understand by an LLM.
2: Use the @agent.tool decorator to register the function as a tool.
3: Use the @validate_call decorator to enable type checking of the function arguments. This makes sure that only the fields present in the database can be used. arbitrary_types_allowed is required because the RunContext type is not a standard type.
4: The RunContext type hint is required for the tool to access the dependencies.
5: Tell the model about the available fields in the database and validate that only those are selected.
6: The database query returns a polars DataFrame.
7: Provide a clear message if no reports are found and hint that another function (which will be introduced later) can be used for fuzzy matching.

This lets the agent execute searches based on the exact match of a field.

deps = AgentDependencies(db=con)
result = agent.run_sync("Which reports do we have from Germany?", deps=deps)
print(result.data)

We have four reports from Germany:

1. "Global Electric Vehicle Market Outlook 2018-2023" by Research DNA GmbH (2018) - Automotive topic
2. "Digital Advertising Spend Analysis" by Tech Innovations Ltd. (2020) - Media topic
3. "Beverage Market Competitive Analysis" by Research DNA GmbH (2022) - FMCG topic
4. "Medical Imaging Equipment Market Size" by Tech Innovations Ltd. (2024) - Healthcare topic

Let me know if you'd like more information about any of these reports.

It works, the agent found the 4 reports from Germany. Let’s check the exact tool call:

agent.last_run_messages

[ModelRequest(parts=[SystemPromptPart(content='You are a market research expert and answer questions using a database of reports.', part_kind='system-prompt'), UserPromptPart(content='Which reports do we have from Germany?', timestamp=datetime.datetime(2024, 12, 18, 17, 38, 58, 663721, tzinfo=datetime.timezone.utc), part_kind='user-prompt')], kind='request'),
 ModelResponse(parts=[ToolCallPart(tool_name='search_reports_by_field', args=ArgsJson(args_json='{"field": "country", "value": "Germany"}'), tool_call_id='call_be61', part_kind='tool-call')], timestamp=datetime.datetime(2024, 12, 18, 17, 38, 58, tzinfo=datetime.timezone.utc), kind='response'),
 ModelRequest(parts=[ToolReturnPart(tool_name='search_reports_by_field', content='[{"id": 1, "year": 2018, "institute": "Research DNA GmbH", "country": "Germany", "topic": "Automotive", "title": "Global Electric Vehicle Market Outlook 2018-2023"}, {"id": 12, "year": 2020, "institute": "Tech Innovations Ltd.", "country": "Germany", "topic": "Media", "title": "Digital Advertising Spend Analysis"}, {"id": 21, "year": 2022, "institute": "Research DNA GmbH", "country": "Germany", "topic": "FMCG", "title": "Beverage Market Competitive Analysis"}, {"id": 32, "year": 2024, "institute": "Tech Innovations Ltd.", "country": "Germany", "topic": "Healthcare", "title": "Medical Imaging Equipment Market Size"}]', tool_call_id='call_be61', timestamp=datetime.datetime(2024, 12, 18, 17, 38, 59, 27429, tzinfo=datetime.timezone.utc), part_kind='tool-return')], kind='request'),
 ModelResponse(parts=[TextPart(content='We have four reports from Germany:\n\n1. "Global Electric Vehicle Market Outlook 2018-2023" by Research DNA GmbH (2018) - Automotive topic\n2. "Digital Advertising Spend Analysis" by Tech Innovations Ltd. (2020) - Media topic\n3. "Beverage Market Competitive Analysis" by Research DNA GmbH (2022) - FMCG topic\n4. "Medical Imaging Equipment Market Size" by Tech Innovations Ltd. (2024) - Healthcare topic\n\nLet me know if you\'d like more information about any of these reports.', part_kind='text')], timestamp=datetime.datetime(2024, 12, 18, 17, 38, 59, tzinfo=datetime.timezone.utc), kind='response')]

Here, the model correctly translated the user’s question into the tool call with the arguments {"field": "country", "value": "Germany"}.

To make it easier to evaluate the agent’s output and also make its results useable by other tools, we can create a response model that includes the ids of the identified reports.

from pydantic import BaseModel


class AgentResponse(BaseModel):
    text: str = Field(
        description="Answer to the user's question in informal language. Don't include the report ids."
    )
    relevant_report_ids: set[int] = Field(
        description="Set of 'id' integer values of the reports that are relevant to the user's question. Only include ids retrieved by the search tools. Never make up ids. Not all ids returned by the search tools are relevant."
    )


typed_agent = Agent(
    model="groq:llama-3.3-70b-versatile",
    system_prompt="You are a market research expert and answer questions using a database of reports.",
    result_type=AgentResponse,
    result_retries=3,
)

1: This description fixes a common mistake: the LLM would answer with made up ids like 123, 456 when it didn’t find any reports.
2: Give the agent a chance to retry if it doesn’t return a valid structured output on the first try.

The AgentResponse model is used to validate the agent’s output. It will always include a set of integer ids. In an app, these could be used to provide links to the reports.

result = typed_agent.run_sync("Which reports do we have from Germany? Tell me their titles and ids", deps=deps)
print(result.data)

text='The reports from Germany are titled Global Electric Vehicle Market Outlook 2018-2023, Digital Advertising Spend Analysis, Beverage Market Competitive Analysis and Medical Imaging Equipment Market Size.' relevant_report_ids={32, 1, 12, 21}

Now we have an agent that returns a type-checked structured response. Note that I’ve omitted the re-registration of the tool to the new agent instance for brevity.

However, requests may not exactly match the fields in the database, so let’s also add the ability to search for similar titles.

@typed_agent.tool
@validate_call(config={"arbitrary_types_allowed": True})
def search_reports_by_title_similarity(
    ctx: RunContext[AgentDependencies],
    title: str = Field(
        description="The title of the report to search for with vector similarity."
    ),
) -> str:
    # Embed the title given by the user
    try:
        title_embedding = embed_text(title)
    except Exception as e:
        return f"Error embedding title: {e}"

    # Search for similar titles
    title_embedding_str = "[" + ",".join(map(str, title_embedding)) + "]"
    query = """
        SELECT id, year, institute, country, topic, title
        FROM reports
        ORDER BY array_distance(title_embedding, ?::FLOAT[1536])  
        LIMIT 5;
    """
    df = ctx.deps.db.execute(query, [title_embedding_str]).pl()

    return (
        df_to_str(df)
        + "\n\n These reports have titles similar to the query, but may not be relevant to the user's question."
    )

1: The title is embedded and formatted as a DuckDB array.
2: The array_distance function computes the cosine similarity between the query embedding and the title embeddings in the database.
3: The note about relevance is added to make it clear that these are just the most similar, not necessarily relevant. Otherwise the agent would return all reports with similar titles.

Let’s ask the agent about a topic that is not in the database to see how it uses the title similarity tool.

result = agent.run_sync("Do we have reports about quantum computing?", deps=deps)
print(result.data)

<function=search_reports_by_field {"field": "topic", "value": "quantum computing"}</function>

That worked as expected.

Evals

Automated evaluations are necessary to ensure that an agent is working as expected, and to switch out models, prompts and tools without breaking the app. PydanticAI offers tools for testing the code (without running a model) and for evaluations. Let’s set up a simple evaluation that checks whether the agent correctly answers questions about the database. We measure the precision (how many of the results found are relevant) and recall (how many of the relevant results are found).

examples = [
    {
        "question": "How many reports do we have from Germany?",
        "relevant_report_ids": {1, 12, 21, 32},
    },
    {
        "question": "For which countries to we have reports mentioning electric vehicles?",
        "relevant_report_ids": {1, 25},
    },
    {
        "question": "What reports do we have about the gaming industry?",
        "relevant_report_ids": {22, 30},
    },
    {
        "question": "What reports do we have about the pet care industry?",
        "relevant_report_ids": {27},
    },
    {
        "question": "Which reports discuss cyber security insurance?",
        "relevant_report_ids": {29},
    },
    {
        "question": "What healthcare reports were published in 2024?",
        "relevant_report_ids": {32, 38},
    },
    {
        "question": "Which reports are about the smartphone or mobile phone market?",
        "relevant_report_ids": {4, 40},
    },
    {
        "question": "What reports do we have from Market Insights Inc.?",
        "relevant_report_ids": {2, 22},
    },
]

from collections import Counter


def eval_example(
    example: dict[str, str | set[int]], print_errors: bool = False
) -> dict[str, int]:
    result = typed_agent.run_sync(example["question"], deps=deps)
    act, exp = result.data.relevant_report_ids, example["relevant_report_ids"]
    metrics = Counter(
        {
            "tp": len(act & exp),
            "fp": len(act - exp),
            "fn": len(exp - act),
        }
    )

    if print_errors and (metrics["fp"] > 0 or metrics["fn"] > 0):
        print("Error in evaluation:")
        print(f"  Question: {example['question']}")
        print(f"  Found: {act}")
        print(f"  Expected: {exp}")

    return metrics


metric_totals = Counter()

for example in tqdm(examples):
    metrics = eval_example(example)
    metric_totals += metrics

precision = metric_totals["tp"] / (metric_totals["tp"] + metric_totals["fp"])
recall = metric_totals["tp"] / (metric_totals["tp"] + metric_totals["fn"])

print(f"Precision: {precision:.2f}, Recall: {recall:.2f}")

1: Use set operations to compare the expected and found ids.
2: This should be parallelized if the number of examples is large.
3: Precision and recall could also be combined into the F1 score, which is their harmonic mean.

  0%|          | 0/8 [00:00<?, ?it/s] 12%|█▎        | 1/8 [00:00<00:05,  1.19it/s] 25%|██▌       | 2/8 [00:07<00:24,  4.02s/it] 38%|███▊      | 3/8 [00:18<00:37,  7.43s/it] 50%|█████     | 4/8 [00:34<00:43, 10.85s/it] 62%|██████▎   | 5/8 [00:50<00:37, 12.51s/it] 75%|███████▌  | 6/8 [01:05<00:26, 13.41s/it] 88%|████████▊ | 7/8 [01:27<00:16, 16.16s/it]100%|██████████| 8/8 [01:39<00:00, 14.97s/it]100%|██████████| 8/8 [01:39<00:00, 12.44s/it]

Precision: 1.00, Recall: 0.62

This is a joint evaluation of the agent, the tools and the database. What’s missing is an evaluation of the generated text. In a real RAG system, you’d also want separate evaluations of retrieval and result ranking.

Discussion

Comparison to other libraries

PydanticAI is a late entrant to the agent framework space. It joins several established libraries including:

Library	Description	Github Stars ⭐
AutoGPT	AI automation platform with frontend, server and monitoring	169k
LangChain	Package ecosystem for LLM applications	96k
autogen	Multi-agent AI chat framework by Microsoft	36k
crewAI	Framework for orchestrating role-based AI agents	22k
swarm	Educational framework for multi-agent apps by OpenAI	17k
phidata	Multi-agent backend and chat frontend	16k

There are dozens of other libraries with fewer stars. In addition, there are libraries specialized for RAG like LlamaIndex and Haystack. The competition landscape doesn’t show signs of consolidation or slowing down.

Development team

Pydantic Services, the company behind Pydantic, has raised a $12.5m Series A in October 2024. This is great news for the project: funding pays for full time developers. It also raises the question of how Pydantic will make money, and the answer to that is Logfire subscriptions. This is a good model that gives long-term stability to the project and follows the lead of LangChain with its commercial product, LangSmith. I just hope that the integration remains optional. While Logfire looks great, my team already uses Weave by Weights & Biases, and having to switch would be a barrier to adopting PydanticAI.

Review

Pros ✅

Sensible abstractions that don’t get in the way and enable coding in a Pythonic style.
Type safety and integration with Pydantic.
Support for streaming responses and async tool calling. This is critical for live chat applications.
Pydantic is familiar to many Python developers who will have an easier time learning PydanticAI.
High quality documentation and examples that also cover tests and evals.
Strong reputation of the Pydantic team and high responsiveness in Github issues.

Cons ❌

Launches into a competitive market with many established libraries.
Early stage of development, so expect breaking changes.
Many concepts to learn, but mild compared to langchain which invented its own domain-specific language LCEL.
No support for multimodal (image, audio, video) inputs and out yet, but it’s planned.
Economic incentives to lock users into Logfire. This hasn’t happened but is a risk.

I’m looking forward to an opportunity to build a full-scale application with PydanticAI. The best place to get started is the PydanticAI documentation.

Not every app needs an agent framework

A lot can be accomplished by single API calls or by specifying a fixed sequence of calls. That would also work for the example app shown in this article. Unless you truly need the flexibility of an agent framework, you may be better off with plain Python. If all you need is Pydantic + LLM calls, you can use instructor. OpenAI even supports structured outputs based on Pydantic models without an additional library.