Pydantic AI is a new agent framework by the company behind Pydantic, the popular data validation library. Pydantic has transformed how I write Python, so I’m excited for their take on agents. In this article I’ll walk through an example app and comment on my experience developing with PydanticAI.
PydanticAI is in beta. This article is based on version 0.0.13. Code examples may not work with future versions. Limitations that are mentioned may be lifted in future versions.
The term “agent” in the context of LLMs refers to a while loop that calls an LLM to solve a problem. The LLM may be equipped with tools, meaning functions that it can supply arguments to and receive results from. To cut through the marketing hype, I suggest just reading the code for PydanticAI’s Agent.run()
method.
As an agent framework, PydanticAI lets developers define workflows wherein an LLM interprets a user’s query and can use tools in multiple steps to answer the question or perform a task. Type safety is a big deal in agent development - the LLM has to call tools with the correct arguments and the tools have to return the correct data type. PydanticAI brings the type safety of Pydantic to this space. This also speeds up development, because type checkers like mypy and pyright can catch errors before the code is run.
In addition to type safety, PydanticAI offers:
- streaming responses, including structured responses
- support for async tool calling
- support for multiple LLM providers, including OpenAI, Groq, Anthropic, Gemini, Ollama and Mistral, with more to come
- optional integration with Logfire, a commercial service by the Pydantic team for logging LLM calls
Example app: Market research knowledge manager
Large companies conduct market research to understand their customers, competition and market trends. Over time, they amass a library of thousands of reports, tables and transcripts. Knowledge management becomes a challenge, because teams are not aware of existing research.
Let’s build an example agent that answers questions based on information in a database with multiple tables. Our final agentic RAG system will enable an interaction like this:
Database
I’m using DuckDB to create an in-memory database which will be made available to the agent.
import duckdb
= duckdb.connect() con
- 1
- Create a local database. In production you’d want to use a persistent database.
I’ll insert a set of reports into the database. The data included is fictional and was generated by an LLM. The data consists of 40 reports like this:
import polars as pl
from great_tables import GT
= pl.read_csv("data/reports.csv")
reports 5)) GT(reports.head(
id | year | institute | country | topic | title |
---|---|---|---|---|---|
1 | 2018 | Research DNA GmbH | Germany | Automotive | Global Electric Vehicle Market Outlook 2018-2023 |
2 | 2018 | Market Insights Inc. | USA | Healthcare | Digital Health Market Size and Growth Analysis |
3 | 2018 | Global Trends Research | UK | FMCG | Premium Beauty and Personal Care Market Trends |
4 | 2018 | Data Analytics Group | Canada | Electronics | Smartphone Industry Competitive Analysis |
5 | 2018 | Innovative Solutions Ltd. | Australia | Insurance | Insurtech Market Landscape and Opportunities |
To make the title searchable, I’ll embed it using an OpenAI embedding endpoint. The result will be stored in a new column with 1536 dimensions.
from openai import OpenAI
from tqdm import tqdm
def embed_text(text: str) -> list[float]:
= OpenAI()
client = "text-embedding-3-small"
model return client.embeddings.create(input=text, model=model).data[0].embedding
= [embed_text(title) for title in tqdm(reports["title"])]
title_embeddings
= reports.with_columns(
reports
pl.Series(="title_embedding",
name=title_embeddings,
values=pl.Array(inner=pl.Float64, shape=1536),
dtype
) )
Now, I’ll insert the data including the embeddings into the database. The embeddings are stored in a fixed-size ARRAY
column. The co-location of the structured data and the embeddings in the same table is convenient for our use case.
con.execute("""
CREATE OR REPLACE TABLE reports AS
SELECT
id::integer AS id,
year::integer AS year,
institute::varchar AS institute,
country::varchar AS country,
topic::varchar AS topic,
title::varchar AS title,
title_embedding::float[1536] AS title_embedding
FROM reports;
"""
)
- 1
- This works because DuckDB can read from a Polars DataFrame.
"INSTALL vss;")
con.execute("LOAD vss;")
con.execute(
con.execute("CREATE INDEX titles_hnsw_index ON reports USING HNSW(title_embedding) WITH (metric='cosine');"
)
I also create a hierarchical navigable small world (HNSW) index on the title embeddings. This enables approximate nearest neighbor search in O(log n). It’s enabled by the vss extension. Note that persistence to disk is experimental, so I wouldn’t recommend it for production yet.
Agent
Let’s set up an agent powered by the Groq inference API. It serves a range of open source models. Specifically, I’ll use the llama-3.3-70b-versatile
model released by Meta on December 6th. Artificial Analysis has a detailed report showing that it advanced the speed-accuracy trade-off. The model has tool calling capabilities, which are critical for our use case.
from pydantic_ai import Agent
= Agent(
agent ="groq:llama-3.3-70b-versatile",
model="You are a market research expert and answer questions using a database of reports.",
system_prompt
)
= agent.run_sync("Who are you?")
result print(result.data)
- 1
- See the KnownModelName documentation for a list of supported models.
I am a market research expert, providing insights and analysis based on a vast database of reports and studies. My expertise spans various industries, including consumer goods, technology, healthcare, and finance. I can help answer questions, provide data-driven insights, and offer market trends and analysis to support business decisions.
My database includes reports from reputable sources, such as market research firms, academic institutions, and industry associations. I can access a wide range of topics, including market size and growth, consumer behavior, competitor analysis, and emerging trends.
What specific area of market research would you like to explore?
Tools
The agent’s job will be to answer questions based on the reports in the database. It needs a way to access the database. We can give it a tool, meaning a function that it can call, to query the database. First, it needs a database connection.
from dataclasses import dataclass
@dataclass
class AgentDependencies:
db: duckdb.DuckDBPyConnection
= AgentDependencies(db=con) deps
- 1
- A dataclass that contains dependencies needed by the agent. Additional dependencies can be added as needed.
- 2
- This is the connection that has the connection to the in-memory DuckDBdatabase.
Next, let’s give the agent a tool to search the database of reports. Based on the user’s question, it can choose which field to search. The result is always a markdown-formatted table with one row per report.
import json
from typing import Literal
from pydantic_ai import RunContext
from pydantic import validate_call, Field
def df_to_str(df: pl.DataFrame) -> str:
return json.dumps(df.to_dicts())
@agent.tool
@validate_call(config={"arbitrary_types_allowed": True})
def search_reports_by_field(
ctx: RunContext[AgentDependencies],"id", "year", "institute", "country", "topic"],
field: Literal[str = Field(
value: ="The value to search for in the field. Case insensitive."
description
),-> str:
) = """
base_query SELECT id, year, institute, country, topic, title
FROM reports
WHERE {}
"""
if field in ["id", "year"]:
= int(value)
value = f"{field} = ?"
where_clause else:
= f"lower({field}) = lower(?)"
where_clause
= base_query.format(where_clause)
final_query = ctx.deps.db.execute(final_query, [value]).pl()
df
if df.shape[0] == 0:
return "No reports found. Try a different field or value, or use the title similarity tool."
return df_to_str(df)
- 1
- A record-oriented JSON representation of the data frame is understand by an LLM.
- 2
-
Use the
@agent.tool
decorator to register the function as a tool. - 3
-
Use the
@validate_call
decorator to enable type checking of the function arguments. This makes sure that only the fields present in the database can be used.arbitrary_types_allowed
is required because theRunContext
type is not a standard type. - 4
-
The
RunContext
type hint is required for the tool to access the dependencies. - 5
- Tell the model about the available fields in the database and validate that only those are selected.
- 6
- The database query returns a polars DataFrame.
- 7
- Provide a clear message if no reports are found and hint that another function (which will be introduced later) can be used for fuzzy matching.
This lets the agent execute searches based on the exact match of a field.
= AgentDependencies(db=con)
deps = agent.run_sync("Which reports do we have from Germany?", deps=deps)
result print(result.data)
We have four reports from Germany:
1. "Global Electric Vehicle Market Outlook 2018-2023" by Research DNA GmbH (2018) - Automotive topic
2. "Digital Advertising Spend Analysis" by Tech Innovations Ltd. (2020) - Media topic
3. "Beverage Market Competitive Analysis" by Research DNA GmbH (2022) - FMCG topic
4. "Medical Imaging Equipment Market Size" by Tech Innovations Ltd. (2024) - Healthcare topic
Let me know if you'd like more information about any of these reports.
It works, the agent found the 4 reports from Germany. Let’s check the exact tool call:
agent.last_run_messages
[ModelRequest(parts=[SystemPromptPart(content='You are a market research expert and answer questions using a database of reports.', part_kind='system-prompt'), UserPromptPart(content='Which reports do we have from Germany?', timestamp=datetime.datetime(2024, 12, 18, 17, 38, 58, 663721, tzinfo=datetime.timezone.utc), part_kind='user-prompt')], kind='request'),
ModelResponse(parts=[ToolCallPart(tool_name='search_reports_by_field', args=ArgsJson(args_json='{"field": "country", "value": "Germany"}'), tool_call_id='call_be61', part_kind='tool-call')], timestamp=datetime.datetime(2024, 12, 18, 17, 38, 58, tzinfo=datetime.timezone.utc), kind='response'),
ModelRequest(parts=[ToolReturnPart(tool_name='search_reports_by_field', content='[{"id": 1, "year": 2018, "institute": "Research DNA GmbH", "country": "Germany", "topic": "Automotive", "title": "Global Electric Vehicle Market Outlook 2018-2023"}, {"id": 12, "year": 2020, "institute": "Tech Innovations Ltd.", "country": "Germany", "topic": "Media", "title": "Digital Advertising Spend Analysis"}, {"id": 21, "year": 2022, "institute": "Research DNA GmbH", "country": "Germany", "topic": "FMCG", "title": "Beverage Market Competitive Analysis"}, {"id": 32, "year": 2024, "institute": "Tech Innovations Ltd.", "country": "Germany", "topic": "Healthcare", "title": "Medical Imaging Equipment Market Size"}]', tool_call_id='call_be61', timestamp=datetime.datetime(2024, 12, 18, 17, 38, 59, 27429, tzinfo=datetime.timezone.utc), part_kind='tool-return')], kind='request'),
ModelResponse(parts=[TextPart(content='We have four reports from Germany:\n\n1. "Global Electric Vehicle Market Outlook 2018-2023" by Research DNA GmbH (2018) - Automotive topic\n2. "Digital Advertising Spend Analysis" by Tech Innovations Ltd. (2020) - Media topic\n3. "Beverage Market Competitive Analysis" by Research DNA GmbH (2022) - FMCG topic\n4. "Medical Imaging Equipment Market Size" by Tech Innovations Ltd. (2024) - Healthcare topic\n\nLet me know if you\'d like more information about any of these reports.', part_kind='text')], timestamp=datetime.datetime(2024, 12, 18, 17, 38, 59, tzinfo=datetime.timezone.utc), kind='response')]
Here, the model correctly translated the user’s question into the tool call with the arguments {"field": "country", "value": "Germany"}
.
To make it easier to evaluate the agent’s output and also make its results useable by other tools, we can create a response model that includes the ids of the identified reports.
from pydantic import BaseModel
class AgentResponse(BaseModel):
str = Field(
text: ="Answer to the user's question in informal language. Don't include the report ids."
description
)set[int] = Field(
relevant_report_ids: ="Set of 'id' integer values of the reports that are relevant to the user's question. Only include ids retrieved by the search tools. Never make up ids. Not all ids returned by the search tools are relevant."
description
)
= Agent(
typed_agent ="groq:llama-3.3-70b-versatile",
model="You are a market research expert and answer questions using a database of reports.",
system_prompt=AgentResponse,
result_type=3,
result_retries )
- 1
- This description fixes a common mistake: the LLM would answer with made up ids like 123, 456 when it didn’t find any reports.
- 2
- Give the agent a chance to retry if it doesn’t return a valid structured output on the first try.
The AgentResponse
model is used to validate the agent’s output. It will always include a set of integer ids. In an app, these could be used to provide links to the reports.
= typed_agent.run_sync("Which reports do we have from Germany? Tell me their titles and ids", deps=deps)
result print(result.data)
text='The reports from Germany are titled Global Electric Vehicle Market Outlook 2018-2023, Digital Advertising Spend Analysis, Beverage Market Competitive Analysis and Medical Imaging Equipment Market Size.' relevant_report_ids={32, 1, 12, 21}
Now we have an agent that returns a type-checked structured response. Note that I’ve omitted the re-registration of the tool to the new agent instance for brevity.
However, requests may not exactly match the fields in the database, so let’s also add the ability to search for similar titles.
@typed_agent.tool
@validate_call(config={"arbitrary_types_allowed": True})
def search_reports_by_title_similarity(
ctx: RunContext[AgentDependencies],str = Field(
title: ="The title of the report to search for with vector similarity."
description
),-> str:
) # Embed the title given by the user
try:
= embed_text(title)
title_embedding except Exception as e:
return f"Error embedding title: {e}"
# Search for similar titles
= "[" + ",".join(map(str, title_embedding)) + "]"
title_embedding_str = """
query SELECT id, year, institute, country, topic, title
FROM reports
ORDER BY array_distance(title_embedding, ?::FLOAT[1536])
LIMIT 5;
"""
= ctx.deps.db.execute(query, [title_embedding_str]).pl()
df
return (
df_to_str(df)+ "\n\n These reports have titles similar to the query, but may not be relevant to the user's question."
)
- 1
- The title is embedded and formatted as a DuckDB array.
- 2
-
The
array_distance
function computes the cosine similarity between the query embedding and the title embeddings in the database. - 3
- The note about relevance is added to make it clear that these are just the most similar, not necessarily relevant. Otherwise the agent would return all reports with similar titles.
Let’s ask the agent about a topic that is not in the database to see how it uses the title similarity tool.
= agent.run_sync("Do we have reports about quantum computing?", deps=deps)
result print(result.data)
<function=search_reports_by_field {"field": "topic", "value": "quantum computing"}</function>
That worked as expected.
Evals
Automated evaluations are necessary to ensure that an agent is working as expected, and to switch out models, prompts and tools without breaking the app. PydanticAI offers tools for testing the code (without running a model) and for evaluations. Let’s set up a simple evaluation that checks whether the agent correctly answers questions about the database. We measure the precision (how many of the results found are relevant) and recall (how many of the relevant results are found).
= [
examples
{"question": "How many reports do we have from Germany?",
"relevant_report_ids": {1, 12, 21, 32},
},
{"question": "For which countries to we have reports mentioning electric vehicles?",
"relevant_report_ids": {1, 25},
},
{"question": "What reports do we have about the gaming industry?",
"relevant_report_ids": {22, 30},
},
{"question": "What reports do we have about the pet care industry?",
"relevant_report_ids": {27},
},
{"question": "Which reports discuss cyber security insurance?",
"relevant_report_ids": {29},
},
{"question": "What healthcare reports were published in 2024?",
"relevant_report_ids": {32, 38},
},
{"question": "Which reports are about the smartphone or mobile phone market?",
"relevant_report_ids": {4, 40},
},
{"question": "What reports do we have from Market Insights Inc.?",
"relevant_report_ids": {2, 22},
}, ]
from collections import Counter
def eval_example(
dict[str, str | set[int]], print_errors: bool = False
example: -> dict[str, int]:
) = typed_agent.run_sync(example["question"], deps=deps)
result = result.data.relevant_report_ids, example["relevant_report_ids"]
act, exp = Counter(
metrics
{"tp": len(act & exp),
"fp": len(act - exp),
"fn": len(exp - act),
}
)
if print_errors and (metrics["fp"] > 0 or metrics["fn"] > 0):
print("Error in evaluation:")
print(f" Question: {example['question']}")
print(f" Found: {act}")
print(f" Expected: {exp}")
return metrics
= Counter()
metric_totals
for example in tqdm(examples):
= eval_example(example)
metrics += metrics
metric_totals
= metric_totals["tp"] / (metric_totals["tp"] + metric_totals["fp"])
precision = metric_totals["tp"] / (metric_totals["tp"] + metric_totals["fn"])
recall
print(f"Precision: {precision:.2f}, Recall: {recall:.2f}")
- 1
- Use set operations to compare the expected and found ids.
- 2
- This should be parallelized if the number of examples is large.
- 3
- Precision and recall could also be combined into the F1 score, which is their harmonic mean.
0%| | 0/8 [00:00<?, ?it/s] 12%|█▎ | 1/8 [00:00<00:05, 1.19it/s] 25%|██▌ | 2/8 [00:07<00:24, 4.02s/it] 38%|███▊ | 3/8 [00:18<00:37, 7.43s/it] 50%|█████ | 4/8 [00:34<00:43, 10.85s/it] 62%|██████▎ | 5/8 [00:50<00:37, 12.51s/it] 75%|███████▌ | 6/8 [01:05<00:26, 13.41s/it] 88%|████████▊ | 7/8 [01:27<00:16, 16.16s/it]100%|██████████| 8/8 [01:39<00:00, 14.97s/it]100%|██████████| 8/8 [01:39<00:00, 12.44s/it]
Precision: 1.00, Recall: 0.62
This is a joint evaluation of the agent, the tools and the database. What’s missing is an evaluation of the generated text. In a real RAG system, you’d also want separate evaluations of retrieval and result ranking.
Discussion
Comparison to other libraries
PydanticAI is a late entrant to the agent framework space. It joins several established libraries including:
Library | Description | Github Stars ⭐ |
---|---|---|
AutoGPT | AI automation platform with frontend, server and monitoring | 169k |
LangChain | Package ecosystem for LLM applications | 96k |
autogen | Multi-agent AI chat framework by Microsoft | 36k |
crewAI | Framework for orchestrating role-based AI agents | 22k |
swarm | Educational framework for multi-agent apps by OpenAI | 17k |
phidata | Multi-agent backend and chat frontend | 16k |
There are dozens of other libraries with fewer stars. In addition, there are libraries specialized for RAG like LlamaIndex and Haystack. The competition landscape doesn’t show signs of consolidation or slowing down.
Development team
Pydantic Services, the company behind Pydantic, has raised a $12.5m Series A in October 2024. This is great news for the project: funding pays for full time developers. It also raises the question of how Pydantic will make money, and the answer to that is Logfire subscriptions. This is a good model that gives long-term stability to the project and follows the lead of LangChain with its commercial product, LangSmith. I just hope that the integration remains optional. While Logfire looks great, my team already uses Weave by Weights & Biases, and having to switch would be a barrier to adopting PydanticAI.
Review
Pros ✅
- Sensible abstractions that don’t get in the way and enable coding in a Pythonic style.
- Type safety and integration with Pydantic.
- Support for streaming responses and async tool calling. This is critical for live chat applications.
- Pydantic is familiar to many Python developers who will have an easier time learning PydanticAI.
- High quality documentation and examples that also cover tests and evals.
- Strong reputation of the Pydantic team and high responsiveness in Github issues.
Cons ❌
- Launches into a competitive market with many established libraries.
- Early stage of development, so expect breaking changes.
- Many concepts to learn, but mild compared to langchain which invented its own domain-specific language LCEL.
- No support for multimodal (image, audio, video) inputs and out yet, but it’s planned.
- Economic incentives to lock users into Logfire. This hasn’t happened but is a risk.
I’m looking forward to an opportunity to build a full-scale application with PydanticAI. The best place to get started is the PydanticAI documentation.
A lot can be accomplished by single API calls or by specifying a fixed sequence of calls. That would also work for the example app shown in this article. Unless you truly need the flexibility of an agent framework, you may be better off with plain Python. If all you need is Pydantic + LLM calls, you can use instructor. OpenAI even supports structured outputs based on Pydantic models without an additional library.
Preview photo by MagicPattern on Unsplash