The Astonishing Scalability of PostgreSQL

By Ruohang Feng（@Vonng） | WeChat Column | 2024-01-13

Tags:

This article outlines how Cloudflare uses 15 PostgreSQL clusters to scale up to 55 million requests per second.

How Cloudflare Used 15 PG Clusters to Support 55M QPS |

In July 2009, in California, USA, a startup team created a Content Delivery Network (CDN) called Cloudflare to accelerate internet requests, making web access more stable and faster. While facing various challenges in their early development, their growth rate was astounding.

Overall Internet Traffic; PostgreSQL Scalability

Global overview of internet traffic

Today, they handle 20% of all internet traffic, processing 55 million HTTP requests per second. And they accomplish this with just 15 PostgreSQL clusters.

Cloudflare uses PostgreSQL to store service metadata and handle OLTP workloads. However, supporting tenants with different types of workloads on the same cluster presents a challenge. A cluster is a group of database servers, while a tenant is an isolated data space dedicated to a specific user or group of users.

PostgreSQL Scalability

Here’s how they pushed PostgreSQL’s scalability to its limits.

1. Contention

Most clients compete for Postgres connections. But Postgres connections are expensive because each connection is a separate OS-level process. And since each tenant has a unique workload type, it’s difficult to create a global threshold for rate limiting.

Moreover, manually restricting misbehaving tenants is a massive undertaking. A tenant might initiate an extremely expensive query, blocking queries from neighboring tenants and starving them. Once a query reaches the database server, isolating it becomes challenging.

Connection Pooling With PgBouncer

Connection pooling with PgBouncer

Therefore, they use PgBouncer as a connection pool in front of Postgres. PgBouncer acts as a TCP proxy, pooling Postgres connections. Tenants connect to PgBouncer rather than directly to Postgres. This limits the number of Postgres connections and prevents connection starvation.

Additionally, PgBouncer avoids the high overhead of creating and destroying database connections by using persistent connections and is used to throttle tenants that initiate high-cost queries at runtime.

2. Thundering Herd

The Thundering Herd problem occurs when many clients query the server simultaneously, leading to database performance degradation.

Thundering Herd Problem

Thundering Herd

When applications are redeployed, their state initializes, and the applications create many database connections at once. Thus, when tenants compete for Postgres connections, it triggers the thundering herd phenomenon. Cloudflare uses PgBouncer to limit the number of Postgres connections created by specific tenants.

3. Performance

Cloudflare doesn’t run PostgreSQL in the cloud but uses bare-metal physical machines without any virtualization overhead to achieve the best performance.

Load Balancing Traffic Between Database Instances

Load balancing traffic between database instances

Cloudflare uses HAProxy as a layer-four load balancer. PgBouncer forwards queries to HAProxy, and the HAProxy load balancer balances traffic between the cluster’s primary instance and read-only replicas.

4. Concurrency

Performance decreases when many tenants make concurrent queries.

Congestion Avoidance Algorithm Throttling Tenants

Congestion control throttling algorithm

Therefore, Cloudflare uses the TCP Vegas congestion control algorithm to throttle tenants. This algorithm works by first sampling each tenant’s transaction round-trip time (RTT) to Postgres, then continuously adjusting the connection pool size as long as the RTT doesn’t degrade, enabling throttling before resources are exhausted.

5. Queuing

Cloudflare queues queries at the PgBouncer level. The order of queries in the queue depends on their historical resource usage—in other words, queries requiring more resources are placed at the end of the queue.

Ordering Queries in Priority Queue

Using priority queues to order queries

Cloudflare only enables priority queuing during peak traffic to prevent resource starvation. In other words, during normal traffic, queries won’t always end up at the back of the queue.

This approach improves latency for the vast majority of queries, though tenants initiating high-cost queries during peak traffic will observe higher latency.

6. High Availability

Cloudflare uses Stolon cluster management for Postgres high availability.

High Availability of Data Layer With Stolon

Using Stolon for database high availability

Stolon can be used to set up Postgres master-slave replication and is responsible for electing a Postgres cluster leader (primary) and handling failover when problems arise.

Each database cluster here replicates to two regions, with three instances in each region.

Write requests are routed to the primary in the main region, then asynchronously replicated to the secondary region, while read requests are routed to the secondary region for processing.

Cloudflare conducts component interconnectivity tests to actively detect network partition issues, performs chaos testing to optimize system resilience, and configures redundant network switches and routers to avoid network partitioning.

When failover completes and the primary instance comes back online, they use the pg_rewind tool to replay missed write changes to resynchronize the old primary with the cluster.

Cloudflare has over 100 Postgres primary and replica instances combined. They use a combination of OS resource management, queuing theory, congestion control algorithms, and even PostgreSQL metrics to achieve PostgreSQL scalability.

Evaluation and Discussion

This is a valuable experience-sharing article, mainly introducing how to use PgBouncer to solve PostgreSQL’s scalability issues. Fifty million QPS + 20% of internet traffic sounds like a significant scale. Although from a PostgreSQL expert’s perspective, the practices described here might seem somewhat basic and rudimentary, the article raises a meaningful question—PostgreSQL’s scalability.

Current State of PostgreSQL Scalability

PostgreSQL has a reputation for vertical and horizontal scaling capabilities. For read requests, PostgreSQL has no scalability issues—since reads and writes don’t block each other, the throughput limit for read-only queries grows almost linearly with invested resources (CPU), whether by vertically increasing CPU/memory or horizontally expanding with replica instances.

PostgreSQL’s write scalability isn’t as strong as its read capabilities. Single-machine WAL write/replay speed reaches a software bottleneck at 100 MB/s to 300 MB/s—but for regular production OLTP loads, this is already a substantial value. As a reference, an application like Tantan, with 200 million users and 10 million daily active users, has a structured data rate for all database writes of around 120 MB/s. The PostgreSQL community is also discussing ways to expand this bottleneck through DIO/AIO and parallel WAL replay. Users can also consider using Citus or other sharding middleware to achieve write scalability.

In terms of capacity, PostgreSQL’s scalability primarily depends on disk space and doesn’t have inherent bottlenecks. With today’s NVMe SSDs offering 64TB per card, supporting hundreds of terabytes of data capacity is no problem when combined with compression cards. Even larger capacities can be supported using RAID or multiple tablespaces. The community has reported numerous OLTP instances in the hundreds of terabytes range, with occasional instances at the petabyte level. The challenges with large instances are mainly in backup management and space maintenance, not performance.

In the past, a notable criticism of PostgreSQL’s scalability was its support for massive connections (significantly improved after PostgreSQL 14). PostgreSQL uses a multi-process architecture like Oracle by default. This design provides better reliability but can be a bottleneck when facing massive high-concurrency scenarios.

In internet scenarios, database access patterns primarily involve massive short connections: creating a connection for a query and destroying it after execution—PHP used to do this, making it compatible with MySQL, which uses a thread model. But for PostgreSQL, massive backend processes and frequent process creation/destruction waste considerable hardware and software resources, making its performance somewhat inadequate in these scenarios.

Connection Pooling — Solving High Concurrency Issues

PostgreSQL recommends a default connection count of about twice the number of CPU cores, typically appropriate in the range of a few dozen to a few hundred. In internet scenarios with thousands or tens of thousands of client connections directly connecting to PostgreSQL, there would be significant additional overhead. Connection pooling emerged to solve this problem—it can be said that connection pooling is a must-have for using PostgreSQL in internet scenarios, capable of transforming the ordinary into the extraordinary.

Note that PostgreSQL is not incapable of high throughput; the key issue is the number of concurrent connections. In “How Strong is PG Performance,” we achieved a sysbench point query throughput peak of 2.33 million on a 92 vCPU server using about 96 connections. Once resources are exceeded, this maximum throughput begins to slowly decline as concurrency further increases.

Using connection pooling has several significant benefits: First, tens of thousands of client connections can be pooled and buffered down to a few active server connections (using transaction-level connection pooling), greatly reducing the number of processes and overhead on the operating system, and avoiding the overhead of process creation and destruction. Second, concurrent contention is greatly reduced due to the reduction in active connections, further optimizing performance. Third, sudden load peaks will queue at the connection pool instead of overwhelming the database, reducing the probability of cascading failures and improving system stability.

Performance and Bottlenecks

I had many best practices with PgBouncer at Tantan. We had a core database cluster with 500,000 QPS, 20,000 client connections on the primary, and a write TPS of about 50,000. Such a load would immediately overwhelm Postgres if directed straight to it. Therefore, between the application and the database, there was a PgBouncer connection pooling middleware. All twenty thousand client connections, after transaction pooling, only required 5-8 active server connections to support all requests, with a CPU utilization of about 20%—a tremendous performance improvement.

PgBouncer is a lightweight connection pool that can be deployed on either the user side or the database side. PgBouncer itself, due to its single-process mode, has a QPS/TPS bottleneck of about 30,000-50,000. Therefore, to avoid PgBouncer’s single point of failure and bottleneck, we used 4 idempotent PgBouncer instances on the core primary and evenly distributed traffic through HAProxy to these four PgBouncer instances before reaching the database primary. But for most scenarios, a single PgBouncer process’s capability to handle 30,000 QPS is more than sufficient.

Management Flexibility

A huge advantage of PgBouncer is that it can provide query response time metrics (RT) at the User/Database/Instance level. This is a core metric for performance measurement, and for older versions of PostgreSQL, statistics in PgBouncer were the only way to obtain such data. Although users can obtain query group RT through the pg_stat_statements extension, and PostgreSQL 14 and later can obtain database-level session active time to calculate transaction RT, and the newly emerged eBPF can also accomplish this, the performance monitoring data provided by PgBouncer remains a very important reference for database management.

The PgBouncer connection pool not only provides performance improvements but also offers handles for fine-grained management. For example, in online database migration without downtime, if online traffic completely accesses through the connection pool, you can simply redirect read/write traffic from the old cluster to the new one by modifying the PgBouncer configuration file, without even requiring immediate participation from the business side to change configurations and restart services. You can also, like in Cloudflare’s example above, modify Database/User parameters in the connection pool to implement throttling capabilities. If a database tenant behaves poorly, affecting the entire shared cluster, administrators can easily implement throttling and blocking capabilities in PgBouncer.

Other Alternatives

There are other connection pool products in the PostgreSQL ecosystem. PGPool-II, which emerged around the same time as PgBouncer, was once a strong competitor: it provided more powerful load balancing/read-write splitting capabilities and could fully utilize multi-core capabilities, but it had invasiveness to the PostgreSQL database itself—requiring extension installation to use, and once had significant performance penalties (30%). So in the connection pool battle, the simple and lightweight PgBouncer became the winner, occupying the mainstream ecological niche of PG connection pools.

Besides PgBouncer, new PostgreSQL connection pool projects continue to emerge, such as Odyssey, pgcat, pgagroal, ZQPool, etc. I very much look forward to a high-performance/more user-friendly in-place replacement fully compatible with PgBouncer.

Additionally, many programming language standard library database drivers now have built-in connection pools, plus PostgreSQL 14’s improvements have reduced the overhead of multiple processes. And with the exponential growth of hardware performance (there are now servers with 512 vCPUs, and memory is no longer a scarce resource), sometimes not using a connection pool and directly handling a few thousand connections is also a viable option.

Can I Use Cloudflare’s Practices?

With the continuous improvement of hardware performance, the ongoing optimization of software architecture, and the gradual popularization of management best practices—high availability, high concurrency, and high performance (scalability) are old topics for internet companies and basically not new technologies anymore.

For example, nowadays, even a junior DBA/ops person, as long as they use Pigsty to deploy a PostgreSQL cluster, can easily achieve this, including the PgBouncer connection pool mentioned by Cloudflare, and Patroni, which has replaced the high availability component Stolon, are all ready to use out of the box. As long as the hardware meets requirements, handling massive concurrent requests in the millions is not a dream.

At the beginning of this century, an Apache server could only handle a miserable one or two hundred concurrent requests. Even the most excellent software could hardly handle tens of thousands of concurrent requests—there was a famous C10K high concurrency problem in the industry; anyone who could achieve thousands of concurrent connections was an industry expert. But with the successive emergence of Epoll and Nginx in 2003/2004, “high concurrency” was no longer a difficult problem—any novice who learned to configure Nginx could achieve what masters could not even dream of a few years ago—as Swedish Marcus says in “Cloud Providers’ View of Customers: Poor, Idle, and Lacking Love”

This is just like now any novice can use Nginx to achieve the massive web requests and high concurrency that masters using httpd could not even dream of before. PostgreSQL’s scalability has also entered thousands of households with the popularization of PgBouncer.

For example, in Pigsty, PgBouncer instances are deployed 1:1 for all PostgreSQL by default, using transaction pooling mode and incorporated into monitoring. And the default Primary and Replica services also access the Postgres database through PgBouncer. Users don’t need to worry too much about details related to PgBouncer—for instance, PgBouncer’s databases and users are automatically maintained when creating Postgres databases/users through scripts. Some common configuration considerations and pitfalls are also avoided in the preset configuration templates, striving to achieve out-of-the-box usability.

Of course, for non-internet scenario applications, PgBouncer is not a must-have. And the default Transaction Pooling, although excellent in performance, comes at the cost of sacrificing some session-level functionality. So you can also configure Primary/Replica services to connect directly to Postgres, bypassing PgBouncer; or use the Session Pooling mode with the best compatibility.

Overall, PgBouncer is indeed a very practical tool in the PostgreSQL ecosystem. If your system has high requirements for PostgreSQL client concurrent connections, be sure to try this middleware when testing performance.

Original article: How Cloudflare Used 15 PG Clusters to Support 55M QPS |

Last modified 2025-03-22: add postgres blogs (117ac1d)