You can use your Mastodon account to reply to this post.
I currently build a piece of software which will need a lot of bog-standard SQL queries against an in-memory database. I was (and am) going with SQLite, but a good friend of mine who Knows About Databases™ told me recently that currently, DuckDB is the hot stuff among in-memory databases - but mainly for complex (read: OLAP) queries.
In this article I explore whether DuckDB or SQLite is the better (read: faster) alternative for my simple queries.
Disclaimer: I’m not saying that the results in this article mean that “
X is better than
X is bad”. I have a special use case - I only use relatively simple queries, and my data set
is comparatively small. I am fully aware that my use case is not what DuckDB is made for. I think
both DuckDB and SQLite are really great products.
Reproducing the Results: I publish my code and some additional data so that you can reproduce my results, see below.
Since I had that data lying around anyways, I am using the GTFS data for Germany as test data. The data is compiled by DELFi e.V. and can be downloaded for free (after registration) at www.opendata-oepnv.de1. It is not provided under an open license, so I cannot redistribute it. However, it is the largest GTFS data set that I could find.
I only use the
stop_times data from the GTFS data set. The
trips table contains all
public transit trips in Germany, and the
stop_times table contains all the calls that vehicles
make during those trips. The data set contains about 1.7 million
trips and about 35 million
I use a small Python script (which is also available, see below) that drives SQLite v. 3.41.2 and DuckDB v. 0.9.1.
For the SQLite database, I prepare the database outside of the benchmark script using an SQL
script. The SQL script creates the
stop_times with appropriate primary and foreign keys
set (thus creating the respective indices).
The DuckDB database is initialized directly in the benchmark script using DuckDB’s
method, and respective indices (see below) are created afterwards.
I want the benchmarks to run deterministically, so I create lists of values to query for a priori
and store them in two files. When I run
n queries for some value in the benchmarks, I use the
n values from the appropriate file.
For each of the benchmarks mentioned below, I create a list of statements to be run (and timed). I run each such list of statements twice: The first run warms up the caches, and the second run is timed. The benchmarks are run on my 2016 Thinkpad T460s laptop, which has an i5-6300U CPU and 20 GB of RAM.
I run a total of six benchmarks, which I split into two groups: Queries which can be served by looking at the built indices, and queries which can not (i.e., where the DB systems need to iterate through the tables).
The simplest benchmark is the
indexed benchmark, where each query simply retrieves a
trip by its
primary key, i.e., a query of the form
SELECT * FROM trips WHERE trip_id = <ID>. The second
composite-indexed does the same for the
stop_times table, which has a
composite primary key (the primary key consists of the
trip_id of the respective trip and a
stop_sequence integer, which just orders the calls within a trip). The third query in this group
join benchmark, which retrieves
stop_times joined with the data from their respective
trip, i.e., queries of the form
1SELECT * 2FROM stop_times JOIN trips ON trips.trip_id = stop_times.trip_id 3WHERE stop_times.trip_id = <Trip ID> and stop_sequence = <Sequence Number>;
The last benchmark in this group, called
indexed-multiple does the same as
indexed, but queries
50 rows per query instead of one.
For each of these benchmarks, I run 500 queries and average the query times.
The first benchmark in this group,
trips by their ‘route ID’. In GTFS,
every trip is associated with a route. There are about 27000 routes in the GTFS data set, so we
expect on average about 63 trips per route. The
route_id field is not indexed. The next benchmark,
many, does the same but with the
direction_id field. There are only two direction IDs:
1, so we expect each query to return half of the trips, about 850000.
For these benchmarks, I only run 100 queries each (because they take a lot longer…).
Let’s look at the results from the “with-index” benchmarks first, which are best looked at in a table.2 The reported values are milliseconds per query:
Or, if you like plots more than tables:
We see that in this group, SQLite outperforms DuckDB consistently by one or two orders of
magnitude. Keep in mind that the
join benchmarks both query the larger
stop_times table, and not the smaller
trips table. It looks to me that DuckDB somehow does not
profit from the indexing.
For the “without-index” benchmarks, we get a different picture (note the inverse factor):
Again as the obligatory plot:
Here, DuckDB clearly outperforms SQLite, by up to three orders of magnitude. I also think it’s very
interesting that DuckDB actually gets faster in the
many case, where the output size is larger
than in the
The interpretation seems to be clear: DuckDB does not seem to work with indices as well as SQLite does (see also the section below) but is blazingly fast when scanning large parts of the table. SQLite is great at using indices but takes its time when it has to look at the whole table.
Since my workload mainly consists of simple queries which should be well answerable from an index (and it is feasible to just create as many indices as it takes…), I will stay with SQLite for the project.
The results make me very sceptical whether I did the correct things to create indices in DuckDB. I
only found a short page about indices in the DuckDB documentation, but if I read that correctly it
states that I should just use the usual
CREATE INDEX SQL command. The benchmark script therefore
1duckdb.sql("CREATE UNIQUE INDEX trips_id_idx ON trips(trip_id)") 2duckdb.sql("CREATE UNIQUE INDEX stop_times_id_idx ON stop_times(trip_id,stop_sequence)")
To test my hypothesis, I removed these two lines (i.e., did not create any indices in DuckDB) and
re-ran the benchmarks. The performance of the
indexed benchmark went down from
0.9 ms / query
2 ms / query, so those
CREATE INDEX lines are not without effect. However, the
indexed-multiple benchmarks remain largely unaffected.
The documentation does say:
Unidimensional indexes are supported, while multidimensional indexes are not yet supported.
composite-indexed being unaffected is not a big surprise. However, the index only giving a
speedup of about 2 seems weird, and I would expect
indexed-multiple to also show at
least a small speedup.
I publish my benchmark code, which you can download as a tar archive here The archive contains:
Reproducing my results should be as easy as this:
data/subfolder. If should at least contain the
cd data; sqlite3 gtfs.sqlite3 < gtfs_to_sqlite.sqlto create the SQLite database.
bash ./prepare.shto create a two random lists of IDs.
cd ..; python3 ./bench.py
It should output a table with the measured values and place the plots at
Honestly: The plots are only there so that this post has a nice image that can appear when this article is shared… The differences between DuckDB and SQLite are just always so large that plotting them together completely hides all other detail. ↩︎
Note that the script is not necessarily a shining example for clean, well-documented Python code. However it is short, does nothing fancy and should be manageable. ↩︎
You can use your Mastodon account to reply to this post.