DuckDB and MotherDuck ecosystem

25 Oct, 2024

DuckDB

ddb-0 A fast in-process analytical database.
How fast you ask? Take a look at this massive benchmark on the 1.1 billion Taxi rides in New York City. It's amazing to have DuckDB sits there within the Top 15 among giants such as Google Bigquery, Amazon Athena, Spark and plenty of ultra-performant GPU-based SQL database (e.g: Brytlyt DB, OmniSci / MapD).

Several other benchmarks:

From DuckDB: groupby and join.

Groupby

ddb-perf-1

Join

ddb-perf-2

Processing Brazilian voters ~ 450Gb data

A compact overview of DuckDB: ddb

One key point that made me interested to DuckDB was the potential of speeding up my data processing flow from currently using Pandas (Python) to DuckDB, which has the benefit I can avoid rewriting my Pandas code to Polars. Those using Pandas heavily may consider this DuckDB+Pandas combo instead of rewriting everything to Polars.

Rewriting sucks! Especially if you got tens of thousands lines of code...there got to be better use of time rather than doing a massive overhaul.

MotherDuck

md-0 A collaborative data warehouse that extends the power of DuckDB to the cloud.

First I tried MotherDuck was in Q4 2023, utilizing both the UI dashboard and via Python script. Back then I was unimpressed with the response time especially in the dashboard while via script was nothing impressive too. Actually that was the reason why I didn't choose to use DuckDB and Motherduck coz it took 1-2 seconds only for the database response.

Just now I revisited the combo and I am happy to see the performance via UI has dramatically improved to sub 500ms and via script sub 300ms. These numbers fall between the "acceptable" territory and I can only see improvements going forward since the technology is relatively young.

Benefits

One of the main benefits of DuckDB / MotherDuck: pure raw SQL syntax ~ no ORM bullshit!
Look at this beauty: rawSQL_1

Can you imagine what kind of bullshit ORM syntax I need to write if using ORM? Moreover I deal with a lot of different languages and there's no uniform ORM syntax (SQLAlchemy, Django ORM, Drizzle, Prisma, etc). So each ORM requires their own syntax --> primary reason I don't like using ORM, raw SQL is the way!

Result you get in your terminal: rawSQL_2

So I get to write in raw SQL and I don't need to rewrite any of my Pandas code to Polars, only parts of the slow Pandas code will need to be refactored to use DuckDB / MotherDuck. That's a no brainer in my opinion!

Other than that, DuckDB is extra strong in OLAP (for analytical purposes -- requires columnar based structure), as we can see from the myriads of benchmark results provided earlier.