Whoever Integrates DuckDB Best Wins the OLAP World

By Ruohang Feng（@Vonng） | Wechat Column | 2024-08-13

Tags:

In the post “PostgreSQL is Eating the World”, I posed a question: Who will ultimately unify the database world? My take is that it’ll be the PostgreSQL ecosystem coupled with a rich variety of extension plugins. And I believe that to conquer OLAP—arguably the biggest and most distinct kingdom in the database domain—this “analysis extension” absolutely has something to do with DuckDB.

I’ve long been a huge fan of PostgreSQL. But interestingly, my second favorite database over the past two years has shifted from Redis to DuckDB. DuckDB is a very compact yet powerful embedded OLAP database, achieving top-tier performance and usability in analytics. It also ranks just behind PostgreSQL in terms of extensibility.

Much like the vector database extension race two years back, the new frontier in the PG ecosystem is a competition centered around DuckDB—“Whoever integrates DuckDB into PG more elegantly will be the future champion of the OLAP world.” Although many participants are sharpening their swords for this battle, DuckDB’s official entry into the race leaves no doubt that the competition is about to ignite.

DuckDB: A Rising Challenger in the OLAP Space

DuckDB was created by database researchers Mark Raasveldt and Hannes Mühleisen at the Centrum Wiskunde & Informatica (CWI) in Amsterdam. CWI is not just a research institute—it’s arguably the hidden powerhouse behind the development of analytical databases, pioneering columnar storage engines and vectorized query execution. Products like ClickHouse, Snowflake, and Databricks all carry CWI’s influence. Fun fact: Guido van Rossum (a.k.a. the father of Python) also created the Python language while at CWI.

Now these pioneers in analytical research are directly bringing their expertise to an OLAP database, choosing a smart timing and niche by introducing DuckDB.

DuckDB was born from observing database users’ pain points: data scientists mostly rely on tools like Python and Pandas and are less familiar with traditional databases. They’re often bogged down by hassles like connectivity, authentication, data import/export, and so on. So why not build a simple, embedded analytical database for them—kinda like SQLite for analytics?

The entire DuckDB project is essentially one header file and one C++ file, which compiles into a standalone binary. The database itself is just a single file. It uses a PostgreSQL-compatible parser and syntax, making it almost frictionless for newcomers. Though DuckDB seems refreshingly simple, its most impressive feature is “simplicity without compromise”—it boasts world-class analytical performance. For instance, on ClickHouse’s own benchmark site (ClickBench), DuckDB can beat the local champion on its home turf.

Another highlight: because DuckDB’s creators are government-funded researchers, they consider offering their work to everyone for free a social responsibility. Thus, DuckDB is released under the very permissive MIT License.

I believe DuckDB’s rise is inevitable: a database that’s blazing fast, requires virtually zero setup, and is also open-source and free—it’s hard not to become popular. In StackOverflow’s 2023 Developer Survey, DuckDB made the “Most Popular Databases” list for the first time at a 0.61% usage rate (29th place, fourth from the bottom). Just one year later, in the 2024 survey, DuckDB saw a 2.3x growth in popularity (1.4%), nearly catching up to ClickHouse (1.7%).

At the same time, DuckDB has garnered an excellent reputation among its users. In terms of developer appreciation and satisfaction (69.2%), it’s second only to PostgreSQL (74.5%) among major databases. If we look at DB-Engine’s popularity trend, it’s clear that since 2022, DuckDB has been on a meteoric rise—though it’s still nowhere near PostgreSQL levels, it has already surpassed every other NewSQL product in popularity scores.

DuckDB’s Weaknesses—and the Opportunity They Present

DuckDB can be used as a standalone database, but it truly shines as an embedded analytical engine. Being “embedded” is both a strength and a weakness. While DuckDB boasts top-notch analytics performance, its biggest shortcoming is its rather minimal data-management capabilities—the stuff data scientists hate dealing with: ACID, concurrent access, access control, data persistence, HA, database import/export… Ironically, these are precisely the strong suits of classic databases and the core pain points for enterprise analytics systems.

We can expect a wave of DuckDB “sidecar” products to address these gaps. It’s reminiscent of what happened when Facebook open-sourced RocksDB (a KV store): countless “new database” projects merely slapped a thin SQL layer on top of RocksDB and sold themselves as the next big thing—Yet another SQL sidecar for RocksDB. The same phenomenon happened with the vector search library hnswlib—numerous “specialized vector databases” sprang up, all just wrapping hnswlib. And with Lucene or its next-gen replacement Tantivy, we’ve seen a flurry of “full-text search databases” that are basically wrapped versions of those engines.

In fact, this is already happening within the PostgreSQL ecosystem. Before other database companies realized what was happening, five PG players jumped into the race, including ParadeDB’s pg_lakehouse, duckdb_fdw by independent developer Li Hongyan, CrunchyData’s crunchy_bridge, Hydra’s pg_quack, and now the official DuckDB team has arrived with a PG extension—pg_duckdb.

The Second PG Extension Grand Prix

It reminds me of the vector database extension frenzy in the PG community over the past year. As AI went mainstream, the PG world saw at least six vector extensions (pgvector, pgvector.rs, pg_embedding, latern, pase, pgvectorscale) racing to outdo each other. Eventually, pgvector, boosted heavily by AWS and others, steamrolled the specialized vector-database market before Oracle/MySQL/MariaDB even rolled out their half-baked offerings.

So, who will become the “pgvector” of the PG OLAP ecosystem? Personally, I’d bet on the official extension overshadowing community ones. Although pg_duckdb has only just arrived—it hasn’t even hit version v0.0.1 yet—its architectural design suggests it’s likely the future winner. Indeed, this extension arms race has only just started, but it’s already converging fast:

Hydra (YC W22), which originally forked Citus’ column store extension to create pg_quack, was so impressed by DuckDB that they abandoned their own engine and teamed up with MotherDuck to build pg_duckdb. This extension, blending Hydra’s PG know-how with DuckDB’s native expertise, can seamlessly read PG tables inside your database, use DuckDB for computation, and directly read Parquet/IceBerg formats from the filesystem/S3—thus creating a “data lakehouse” setup.

Similarly, ParadeDB (YC S23)—another YC-backed startup—originally built pg_analytics in Rust for OLAP capabilities, achieving decent traction. They, too, switched gears to build a DuckDB-based pg_lakehouse. Right after the pg_duckdb announcement, ParadeDB founder Phillipe essentially waved the white flag and said they’d develop on top of pg_duckdb rather than compete against it.

Meanwhile, Chinese independent developer Li Hongyan created duckdb_fdw as a different approach altogether—using PostgreSQL’s foreign-data-wrapper infrastructure to connect PG and DuckDB. The official DuckDB folks publicly critiqued this, highlighting it as a “bad example,” possibly motivating the birth of “pg_duckdb”: “We have grand visions for uniting PG and Duck, but you moved too fast—here’s the official shock and awe.”

As for CrunchyData’s crunchy_bridge or any other closed-source wrappers, I suspect they’ll struggle to gain broader adoption.

Of course, as the author of the PostgreSQL distribution Pigsty, my position is simply—let them race. I’ll bundle all these extensions and distribute them to end users, so they can pick whatever suits them best. Just like when vector databases were on the rise, I bundled pgvector, pg_embedding, pase, pg_sparse, etc.—the most promising candidates. It doesn’t matter who ultimately wins; PG and Pigsty will always be the ones reaping the spoils.

Speed trumps all, so in Pigsty v3 I’ve already integrated the three most promising extensions: pg_duckdb, pg_lakehouse, and duckdb_fdw, plus the main duckdb binary—all ready to go out of the box. Users can experience a one-stop PostgreSQL solution that handles both OLTP and OLAP—truly an all-conquering HTAP dream come true.

Last modified 2025-03-22: add postgres blogs (117ac1d)