Data frame wars: Choosing a Python dataframe library as a dplyr user

I’m a long time R user and lately I’ve seen more and more signals that it’s worth investing into Python. I use it for NLP with spaCy and to build functions on AWS Lambda. Further, there are many more data API libraries and machine learning libraries for Python than for R.

Note

This article was written at the end of 2021 with the latest versions of the libraries and the number of Github stars at that time.

Adopting Python means making choices on which libraries to invest time into learning. Manipulating data frames is one of the most common data science activities, so choosing the right library for it is key.

Michael Chow, developer of siuba, a Python port of dplyr on top of pandas wrote describes the situation well:

It seems like there’s been a lot of frustration surfacing on twitter lately from people coming from R—especially if they’ve used dplyr and ggplot—towards pandas and matplotlib. I can relate. I’m developing a port of dplyr to python. But in the end, it’s probably helpful to view these libraries as foundational to a lot of other, higher-level libraries (some of which will hopefully get things right for you!).

The higher-level libraries he mentions come with a problem : There’s no universal standard.

In a discussion of the polars library on Hacker News the user “civilized” put the dplyr user perspective more bluntly:

In my world, anything that isn’t “identical to R’s dplyr API but faster” just isn’t quite worth switching for. There’s absolutely no contest: dplyr has the most productive API and that matters to me more than anything else.

I’m more willing to compromise though, so here’s a comparison of the strongest contenders.

The contenders

The database-like ops benchmark on H2Oai is a helpful performance comparison.

I’m considering these libraries:

Pandas: The most commonly used library and the one with the most tutorials and Stack Overflow answers available.
siuba: A port of dplyr to Python, built on top of pandas. Not in the benchmark. Performance probably similar to pandas or worse due to translation.
Polars: The fastest library available. According to the benchmark, it runs 3-10x faster than Pandas.
Duckdb: Use an in-memory OLAP database instead of a dataframe and write SQL. In R, this can also be queried via dbplyr.
ibis. Backend-agnostic wrapper for pandas and SQL engines.

There are more options. I excluded the others for these reasons:

Slower than polars and not with a readability focus (dask, Arrow, Modin, pydatatable)
Requires or is optmized for running on a remote server (Spark, ClickHouse and most other SQL databases).
Not meant for OLAP (sqlite)
Not in Python (DataFrames.jl)
Meant for GPU (cuDF)

Github stars as a proxy for popularity

The benchmark provides a comparison of performance, but another important factor is popularity and maturity. A more mature library has a more stable API, better test coverage and there is more help available online, such as on StackOverflow. One way to measure popularity is the number of stars that the package repository has on Github.

library(ggplot2)
libs <- data.frame(
    library = c("pandas", "siuba", "polars", "duckdb", "dplyr", "data.table", "pydatatable", "dtplyr", "tidytable", "ibis"),
    language = c("Python", "Python", "Python", "SQL", "R", "R", "Python", "R", "R", "Python"),
    stars = c(32100, 732, 3900, 4100, 3900, 2900, 1400, 542, 285, 1600)
)

ggplot(libs, aes(x = reorder(library, -stars), y = stars, fill = language)) +
    geom_col() +
    labs(
        title = "Pandas is by far the most popular choice",
        subtitle = "Comparison of Github stars on 2021-12-25",
        fill = "Language",
        x = "Library",
        y = "Github stars"
    )

Github stars are not a perfect proxy. For instance, dplyr is more mature than its star count suggests. Comparing the completeness of the documentation and tutorials for dplyr and polars reveals that it’s a day and night difference.

With the quantitative comparison out of the way, here’s a qualitative comparison of the Python packages. I’m speaking of my personal opinion of these packages - not a general comparison. My reference is my current use of dplyr in R. When I need more performance, I use tidytable to get most of the speed of data.table with the grammar of dplyr and eager evaluation. Another alternative is dtplyr, which translates dplyr to data.table with lazy evaluation. I also use dbplyr, which translates dplyr to SQL.

I’ll compare the libraries by running a data transformation pipeline involving import from CSV, mutate, filter, sort, join, group by and summarize. I’ll use the nycflights13 dataset, which is featured in Hadley Wickham’s R for Data Science.

dplyr: Reference in R

Let’s start with a reference implementation in dplyr. The dataset is available as a package, so I skip the CSV import.

library(dplyr, warn.conflicts = FALSE)
library(nycflights13)
library(reactable)

# Take a look at the tables
reactable(head(flights, 10))

reactable(head(airlines, 10))

The flights tables has 336776 rows, one for each flight of an airplane. The airlines table has 16 rows, one for each airline mapping the full name of the company to a code.

Let’s find the airline with the highest arrival delays in January 2013.

flights |>
    filter(year == 2013, month == 1, !is.na(arr_delay)) |>
    mutate(arr_delay = replace(arr_delay, arr_delay < 0, 0)) |>
    left_join(airlines, by = "carrier") |>
    group_by(airline = name) |>
    summarise(flights = n(), mean_delay = mean(arr_delay)) |>
    arrange(desc(mean_delay))

# A tibble: 16 × 3
   airline                     flights mean_delay
   <chr>                         <int>      <dbl>
 1 SkyWest Airlines Inc.             1     107   
 2 Hawaiian Airlines Inc.           31      48.8 
 3 ExpressJet Airlines Inc.       3964      29.6 
 4 Frontier Airlines Inc.           59      23.9 
 5 Mesa Airlines Inc.               39      20.4 
 6 Endeavor Air Inc.              1480      19.3 
 7 Alaska Airlines Inc.             62      17.6 
 8 Envoy Air                      2203      14.3 
 9 Southwest Airlines Co.          985      13.0 
10 JetBlue Airways                4413      12.9 
11 United Air Lines Inc.          4590      11.9 
12 American Airlines Inc.         2724      11.0 
13 AirTran Airways Corporation     324       9.95
14 US Airways Inc.                1554       9.11
15 Delta Air Lines Inc.           3655       8.07
16 Virgin America                  314       3.17

Some values in arr_delay are negative, indicating that the flight was faster than expected. I replaced these values with 0 because I don’t want them to cancel out delays of other flights. I joined to the airlines table to get the full names of the airlines.

I export the flights and airlines tables to CSV to hand them over to Python.

# Write to temporary files
flights_path <- tempfile(fileext = ".csv")
airlines_path <- tempfile(fileext = ".csv")

data.table::fwrite(flights, flights_path, row.names = FALSE)
data.table::fwrite(airlines, airlines_path, row.names = FALSE)

To access the file from Python, the path is handed over:

# Hand over the path from R
flights_path = r["flights_path"]
airlines_path = r["airlines_path"]

For more details on how this works with the reticulate package, check this documentation.

Pandas: Most popular

The following sections follow a pattern: read in from CSV, then build a query.

import pandas as pd

# Import from CSV
flights_pd = pd.read_csv(flights_path)
airlines_pd = pd.read_csv(airlines_path)

pandas.read_csv reads the header and conveniently infers the column types.

(
    flights_pd.query("year == 2013 & month == 1 & arr_delay.notnull()")
    .assign(arr_delay=flights_pd.arr_delay.clip(lower=0))
    .merge(airlines_pd, how="left", on="carrier")
    .rename(columns={"name": "airline"})
    .groupby("airline")
    .agg(flights=("airline", "count"), mean_delay=("arr_delay", "mean"))
    .sort_values(by="mean_delay", ascending=False)
)

                             flights  mean_delay
airline                                         
SkyWest Airlines Inc.              1  107.000000
Hawaiian Airlines Inc.            31   48.774194
ExpressJet Airlines Inc.        3964   29.642785
Frontier Airlines Inc.            59   23.881356
Mesa Airlines Inc.                39   20.410256
Endeavor Air Inc.               1480   19.321622
Alaska Airlines Inc.              62   17.645161
Envoy Air                       2203   14.303677
Southwest Airlines Co.           985   12.964467
JetBlue Airways                 4413   12.919329
United Air Lines Inc.           4590   11.851852
American Airlines Inc.          2724   10.953377
AirTran Airways Corporation      324    9.953704
US Airways Inc.                 1554    9.111326
Delta Air Lines Inc.            3655    8.070315
Virgin America                   314    3.165605

I chose to use the pipeline syntax from pandas - another option is to modify the dataset in place. That has a lower memory footprint, but can’t be run repeatedly for the same result, such as in interactive use in a notebook.

Here, the query() function is slightly awkward with the long string argument. The groupby doesn’t allow renaming on the fly like dplyr, though I don’t consider that a real drawback. Perhaps it’s clearer to rename explicitly anyway.

Pandas has the widest API, offering hundreds of functions for every conceivable manipulation. The clip function used here is one such example. One difference to dplyr is that pandas uses its own methods .mean(), rather than using external ones such as base::mean(). That means using custom functions instead carries a performance penalty.

As we’ll see later, pandas is the backend for siuba and ibis, which boil down to pandas code.

One difference to all other discussed solutions is that pandas uses a row index. Base R also has this with row names, but the tidyverse and tibbles have largely removed them from common use. I never missed row names. At the times I had to work with them in pandas they were more confusing than helpful. The documentation of polars puts it more bluntly:

No index. They are not needed. Not having them makes things easier. Convince me otherwise

That’s quite passive aggressive, but I do agree and wish pandas didn’t have it.

siuba: dplyr in Python

import siuba as si

# Import from CSV
flights_si = pd.read_csv(r["flights_path"])
airlines_si = pd.read_csv(r["airlines_path"])

As siuba is just an alternative way of writing some pandas commands, we read the data just like in the pandas implementation.

(
    flights_si
    >> si.filter(si._.year == 2013, si._.month == 1, si._.arr_delay.notnull())
    >> si.mutate(arr_delay=si._.arr_delay.clip(lower=0))
    >> si.left_join(si._, airlines_si, on="carrier")
    >> si.rename(airline=si._.name)
    >> si.group_by(si._.airline)
    >> si.summarize(flights=si._.airline.count(), mean_delay=si._.arr_delay.mean())
    >> si.arrange(-si._.mean_delay)
)

                        airline  flights  mean_delay
11        SkyWest Airlines Inc.        1  107.000000
8        Hawaiian Airlines Inc.       31   48.774194
6      ExpressJet Airlines Inc.     3964   29.642785
7        Frontier Airlines Inc.       59   23.881356
10           Mesa Airlines Inc.       39   20.410256
4             Endeavor Air Inc.     1480   19.321622
1          Alaska Airlines Inc.       62   17.645161
5                     Envoy Air     2203   14.303677
12       Southwest Airlines Co.      985   12.964467
9               JetBlue Airways     4413   12.919329
14        United Air Lines Inc.     4590   11.851852
2        American Airlines Inc.     2724   10.953377
0   AirTran Airways Corporation      324    9.953704
13              US Airways Inc.     1554    9.111326
3          Delta Air Lines Inc.     3655    8.070315
15               Virgin America      314    3.165605

I found siuba the easiest to work with. Once I understood the _ placeholder for a table of data, I could write it almost as fast as dplyr. Out of all the ways to refer to a column in a data frame, I found it to be the most convenient, because it doesn’t require me to spell out the name of the data frame over and over. While not as elegant as dplyr’s tidy evaluation (discussed at the end of the article), it avoids the ambivalence in dplyr where it can be unclear whether a name refers to a column or an outside object.

It’s always possible to drop into pandas, such as for the aggregation functions which use the mean() and count() methods of the pandas series. The >> is an easy replacement for the %>% magrittr pipe or |> base pipe in R.

The author advertises siuba like this (from the docs):

Siuba is a library for quick, scrappy data analysis in Python. It is a port of dplyr, tidyr, and other R Tidyverse libraries.

A way for dplyr users to quickly hack away at data analysis in Python, but not meant for unsupervised production use.

Polars: Fastest

Polars is written in Rust and also offers a Python API. It comes in two flavors: eager and lazy. Lazy evaluation is similar to how dbplyr and dtplyr work: until asked, nothing is evaluated. This enables performance gains by reordering the commands being executed. But it’s a little less convenient for interactive analysis. I’ll use the eager API here.

import polars as pl

# Import from CSV
flights_pl = pl.read_csv(flights_path)
airlines_pl = pl.read_csv(airlines_path)

(
    flights_pl.filter((pl.col("year") == 2013) & (pl.col("month") == 1))
    .drop_nulls("arr_delay")
    .join(airlines_pl, on="carrier", how="left")
    .with_columns(
        [
            pl.when(pl.col("arr_delay") > 0)
            .then(pl.col("arr_delay"))
            .otherwise(0)
            .alias("arr_delay"),
            pl.col("name").alias("airline"),
        ]
    )
    .groupby("airline")
    .agg(
        [pl.count("airline").alias("flights"), pl.mean("arr_delay").alias("mean_delay")]
    )
    .sort("mean_delay", descending=True)
)

shape: (16, 3)

airline	flights	mean_delay
str	u32	f64
"SkyWest Airlin…	1	107.0
"Hawaiian Airli…	31	48.774194
"ExpressJet Air…	3964	29.642785
"Frontier Airli…	59	23.881356
"Mesa Airlines …	39	20.410256
"Endeavor Air I…	1480	19.321622
"Alaska Airline…	62	17.645161
"Envoy Air"	2203	14.303677
"Southwest Airl…	985	12.964467
"JetBlue Airway…	4413	12.919329
"United Air Lin…	4590	11.851852
"American Airli…	2724	10.953377
"AirTran Airway…	324	9.953704
"US Airways Inc…	1554	9.111326
"Delta Air Line…	3655	8.070315
"Virgin America…	314	3.165605

The API is leaner than pandas, requiring to memorize fewer functions and patterns. Though this can also be seen as less feature-complete. Pandas, for example has a dedicated clip function.

There isn’t nearly as much help available for problems with polars as for with pandas. While the documentation is good, it can’t answer every question and lots of trial and error is needed.

A comparison of polars and pandas is available in the polars documentation.

DuckDB: Highly compatible and easy for SQL users

import duckdb

con_duckdb = duckdb.connect(database=":memory:")

# Import from CSV
con_duckdb.execute(
    "CREATE TABLE 'flights' AS "
    f"SELECT * FROM read_csv_auto('{flights_path}', header = True);"
    "CREATE TABLE 'airlines' AS "
    f"SELECT * FROM read_csv_auto('{airlines_path}', header = True);"
)

<duckdb.duckdb.DuckDBPyConnection object at 0x285aba870>

DuckDB’s read_csv_auto() works just like the csv readers in Python.

con_duckdb.execute(
    "WITH flights_clipped AS ( "
    "SELECT carrier, CASE WHEN arr_delay > 0 THEN arr_delay ELSE 0 END AS arr_delay "
    "FROM flights "
    "WHERE year = 2013 AND month = 1 AND arr_delay IS NOT NULL"
    ")"
    "SELECT name AS airline, COUNT(*) AS flights, AVG(arr_delay) AS mean_delay "
    "FROM flights_clipped "
    "LEFT JOIN airlines ON flights_clipped.carrier = airlines.carrier "
    "GROUP BY name "
    "ORDER BY mean_delay DESC "
).fetchdf()

                        airline  flights  mean_delay
0         SkyWest Airlines Inc.        1  107.000000
1        Hawaiian Airlines Inc.       31   48.774194
2      ExpressJet Airlines Inc.     3964   29.642785
3        Frontier Airlines Inc.       59   23.881356
4            Mesa Airlines Inc.       39   20.410256
5             Endeavor Air Inc.     1480   19.321622
6          Alaska Airlines Inc.       62   17.645161
7                     Envoy Air     2203   14.303677
8        Southwest Airlines Co.      985   12.964467
9               JetBlue Airways     4413   12.919329
10        United Air Lines Inc.     4590   11.851852
11       American Airlines Inc.     2724   10.953377
12  AirTran Airways Corporation      324    9.953704
13              US Airways Inc.     1554    9.111326
14         Delta Air Lines Inc.     3655    8.070315
15               Virgin America      314    3.165605

The performance is closer to polars than to pandas. A big plus is the ability to handle larger than memory data.

DuckDB can also operate directly on a pandas dataframe. The SQL code is portable to R, C, C++, Java and other programming languages the duckdb has APIs. It’s also portable when the logic is taken to a DB like Postgres, or Clickhouse, or is ported to an ETL framework like DBT.

This stands in contrast to polars and pandas code, which has to be rewritten from scratch. It also means that the skill gained in manipulating SQL translates well to other situations. SQL has been around for more than 50 years - learning SQL is future-proofing a career.

While these are big plusses, duckdb isn’t so convenient for interactive data exploration. SQL isn’t as composeable. Composing SQL queries requires many common table expressions (CTEs, WITH x AS (SELECT ...)). Reusing them for other queries is not as easy as with Python. SQL is typically less expressive than Python. It lacks shorthands and it’s awkward when there are many columns. It’s also harder to write custom functions in SQL than in R or Python. This is the motivation for using libraries like pandas and dplyr. But SQL can actually do a surprising amount of things, as database expert Haki Benita explained in a detailed article.

Or in short, from the documentation of ibis:

SQL is widely used and very convenient when writing simple queries. But as the complexity of operations grow, SQL can become very difficult to deal with.

Then, there’s the issue of how to actually write the SQL code. Writing strings rather than actual Python is awkward and many editors don’t provide syntax highlighting within the strings (Jetbrains editors like PyCharm and DataSpell do). The other option is writing .sql that have placeholders for parameters. That’s cleaner and allows using a linter, but is inconvenient for interactive use.

SQL is inherently lazily executed, because the query planner needs to take the whole query into account before starting computation. This enables performance gains. For interactive use, lazy evaluation is less convenient, because one can’t see the intermediate results at each step. Speed of iteration is critical: the faster one can iterate, the more hypotheses about the data can be tested.

There is a programmatic way to construct queries for duckdb, designed to provide a dbplyr alternative in Python. Unfortunately its documentation is sparse.

Using duckdb without pandas doesn’t seem feasible for exploratory data analysis, because graphing packages like seaborn and plotly expect a pandas data frame or similar as an input.

ibis: Lingua franca in Python

The goal of ibis is to provide a universal language for working with data frames in Python, regardless of the backend that is used. It’s tagline is: Write your analytics code once, run in everywhere. This is similar to how dplyr can use SQL as a backend with dbplyr and data.table with dtplyr.

Among others, Ibis supports pandas, PostgreSQL and SQLite as backends. Unfortunately duckdb is not an available backend, because the authors of duckdb have decided against building on ibis.

The ibis project aims to bridge the gap between the needs of interactive data analysis and the capabilities of SQL, which I have detailed in the previous section on duckdb.

Note

UPDATE October 2023

Duckdb is now a supported backend (along with many more). So performance is going to be very similar to duckdb.
Directly load/save data
join(), clip(), and case() are well-supported
Ibis is much more popular and now very actively maintained. There are more examples, better documentation, and community. Still definitely less than pandas, but perhaps comparable to polars.

Thanks to NickCrews for providing this update, including the following code example.

For the test drive, I’ll use the duckdb backend, meaning that the ibis code is translated to duckdb operations, similar to how siuba is translated to pandas. This gives ibis the blazing speed of duckdb.

import ibis
from ibis import _

flights_ib_csv = pd.read_csv(flights_path)
airlines_ib_csv = pd.read_csv(airlines_path)

ibis.options.interactive = True

flights_ib = ibis.read_csv(flights_path)
airlines_ib = ibis.read_csv(airlines_path)
flights_ib

┏━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━┓
┃ year  ┃ month ┃ day   ┃ dep_time ┃ sched_dep_time ┃ dep_delay ┃ arr_time ┃ … ┃
┡━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━┩
│ int64 │ int64 │ int64 │ int64    │ int64          │ int64     │ int64    │ … │
├───────┼───────┼───────┼──────────┼────────────────┼───────────┼──────────┼───┤
│  2013 │     1 │     1 │      517 │            515 │         2 │      830 │ … │
│  2013 │     1 │     1 │      533 │            529 │         4 │      850 │ … │
│  2013 │     1 │     1 │      542 │            540 │         2 │      923 │ … │
│  2013 │     1 │     1 │      544 │            545 │        -1 │     1004 │ … │
│  2013 │     1 │     1 │      554 │            600 │        -6 │      812 │ … │
│  2013 │     1 │     1 │      554 │            558 │        -4 │      740 │ … │
│  2013 │     1 │     1 │      555 │            600 │        -5 │      913 │ … │
│  2013 │     1 │     1 │      557 │            600 │        -3 │      709 │ … │
│  2013 │     1 │     1 │      557 │            600 │        -3 │      838 │ … │
│  2013 │     1 │     1 │      558 │            600 │        -2 │      753 │ … │
│     … │     … │     … │        … │              … │         … │        … │ … │
└───────┴───────┴───────┴──────────┴────────────────┴───────────┴──────────┴───┘

Non-interactive ibis means that queries are evaluated lazily.

(
    flights_ib.filter(
        [
            _.year == 2013,
            _.month == 1,
            _.arr_delay.notnull(),
        ]
    )
    .join(airlines_ib, "carrier", how="left")
    .select(arr_delay=_.arr_delay.clip(lower=0), airline=_.name)
    .group_by("airline")
    .agg(flights=_.count(), mean_delay=_.arr_delay.mean())
    .order_by(_.mean_delay.desc())
)

┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
┃ airline                  ┃ flights ┃ mean_delay ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
│ string                   │ int64   │ float64    │
├──────────────────────────┼─────────┼────────────┤
│ SkyWest Airlines Inc.    │       1 │ 107.000000 │
│ Hawaiian Airlines Inc.   │      31 │  48.774194 │
│ ExpressJet Airlines Inc. │    3964 │  29.642785 │
│ Frontier Airlines Inc.   │      59 │  23.881356 │
│ Mesa Airlines Inc.       │      39 │  20.410256 │
│ Endeavor Air Inc.        │    1480 │  19.321622 │
│ Alaska Airlines Inc.     │      62 │  17.645161 │
│ Envoy Air                │    2203 │  14.303677 │
│ Southwest Airlines Co.   │     985 │  12.964467 │
│ JetBlue Airways          │    4413 │  12.919329 │
│ …                        │       … │          … │
└──────────────────────────┴─────────┴────────────┘

The syntax looks quite similar to dplyr and the versatility of interchangeable backends is remarkable. In the first version of this article, ibis was lacking in documentation and had some rough edges in the API, but these were improved in the meantime.

Conclusion

It’s not a clear-cut choice. None of the options offer a syntax that is as convenient for interactive analysis as dplyr. siuba is the closest to it, but dplyr still has an edge with tidy evaluation, letting users refer to columns in a data frame by their names (colname) directly, without any wrappers. But I’ve also seen it be confusing for newbies to R that mix it up with base R’s syntax. It’s also harder to program with, where it’s necessary to use operators like { } and :=.

My appreciation for dplyr (and the closely associated tidyr) grew during this research. Not only is it a widely accepted standard like pandas, it can also be used as a translation layer for backends like SQL databases (including duckdb), data.table, and Spark. All while having the most elegant and flexible syntax available.

Personally, I’ll primarily leverage SQL and a OLAP database (such as Clickhouse or Snowflake) running on a server to do the heavy lifting. For steps that are better done locally, I’ll use pandas for maximum compatibility. I find the use of an index inconvenient, but there’s so much online help available on StackOverflow. Github Copilot also deserves a mention for making it easier to pick up. Other use cases can be very different, so I don’t mean to say that my way is the best. For instance, if the data is not already on a server, fast local processing with polars may be best.

Most data science work happens in a team. Choosing a library that all team members are familiar with is critical for collaboration. That is typically SQL, pandas or dplyr. The performance gains from using a less common library like polars have to be weighed against the effort spent learning the syntax as well as the increased likelihood of bugs, when beginners write in a new syntax.

Photo by Hunter Harritt on Unsplash