This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

High Availability

Pigsty uses Patroni to implement PostgreSQL high availability, ensuring automatic failover when the primary becomes unavailable.

Overview

Pigsty’s PostgreSQL clusters come with out-of-the-box high availability, powered by Patroni, Etcd, and HAProxy.

When your PostgreSQL cluster has two or more instances, you automatically have self-healing database high availability without any additional configuration — as long as any instance in the cluster survives, the cluster can provide complete service. Clients only need to connect to any node in the cluster to get full service without worrying about primary-replica topology changes.

With default configuration, the primary failure Recovery Time Objective (RTO) ≈ 30s, and Recovery Point Objective (RPO) < 1MB; for replica failures, RPO = 0 and RTO ≈ 0 (brief interruption). In consistency-first mode, failover can guarantee zero data loss: RPO = 0. All these metrics can be configured as needed based on your actual hardware conditions and reliability requirements.

Pigsty includes built-in HAProxy load balancers for automatic traffic switching, providing DNS/VIP/LVS and other access methods for clients. Failover and switchover are almost transparent to the business side except for brief interruptions - applications don’t need to modify connection strings or restart. The minimal maintenance window requirements bring great flexibility and convenience: you can perform rolling maintenance and upgrades on the entire cluster without application coordination. The feature that hardware failures can wait until the next day to handle lets developers, operations, and DBAs sleep well during incidents.

pigsty-ha

Many large organizations and core institutions have been using Pigsty in production for extended periods. The largest deployment has 25K CPU cores and 220+ PostgreSQL ultra-large instances (64c / 512g / 3TB NVMe SSD). In this deployment case, dozens of hardware failures and various incidents occurred over five years, yet overall availability of over 99.999% was maintained.


What problems does High Availability solve?

  • Elevates data security C/IA availability to a new level: RPO ≈ 0, RTO < 30s.
  • Gains seamless rolling maintenance capability, minimizing maintenance window requirements and bringing great convenience.
  • Hardware failures can self-heal immediately without human intervention, allowing operations and DBAs to sleep well.
  • Replicas can handle read-only requests, offloading primary load and fully utilizing resources.

What are the costs of High Availability?

  • Infrastructure dependency: HA requires DCS (etcd/zk/consul) for consensus.
  • Higher starting threshold: A meaningful HA deployment requires at least three nodes.
  • Extra resource consumption: Each new replica consumes additional resources, though this is usually not a major concern.
  • Significantly increased complexity: Backup costs increase significantly, requiring tools to manage complexity.

Limitations of High Availability

Since replication happens in real-time, all changes are immediately applied to replicas. Therefore, streaming replication-based HA solutions cannot handle data deletion or modification caused by human errors and software defects. (e.g., DROP TABLE or DELETE data) Such failures require using delayed clusters or performing point-in-time recovery using previous base backups and WAL archives.

Configuration StrategyRTORPO
Standalone + Nothing Data permanently lost, unrecoverable All data lost
Standalone + Base Backup Depends on backup size and bandwidth (hours) Lose data since last backup (hours to days)
Standalone + Base Backup + WAL Archive Depends on backup size and bandwidth (hours) Lose unarchived data (tens of MB)
Primary-Replica + Manual Failover ~10 minutes Lose data in replication lag (~100KB)
Primary-Replica + Auto Failover Within 1 minute Lose data in replication lag (~100KB)
Primary-Replica + Auto Failover + Sync Commit Within 1 minute No data loss

How It Works

In Pigsty, the high availability architecture works as follows:

  • PostgreSQL uses standard streaming replication to build physical replicas; replicas take over when the primary fails.
  • Patroni manages PostgreSQL server processes and handles high availability matters.
  • Etcd provides distributed configuration storage (DCS) capability and is used for leader election after failures.
  • Patroni relies on Etcd to reach cluster leader consensus and provides health check interfaces externally.
  • HAProxy exposes cluster services externally and uses Patroni health check interfaces to automatically distribute traffic to healthy nodes.
  • vip-manager provides an optional Layer 2 VIP, retrieves leader information from Etcd, and binds the VIP to the node where the cluster primary resides.

When the primary fails, a new round of leader election is triggered. The healthiest replica in the cluster (highest LSN position, minimum data loss) wins and is promoted to the new primary. After the winning replica is promoted, read-write traffic is immediately routed to the new primary. The impact of primary failure is brief write service unavailability: write requests will be blocked or fail directly from primary failure until new primary promotion, with unavailability typically lasting 15 to 30 seconds, usually not exceeding 1 minute.

When a replica fails, read-only traffic is routed to other replicas. Only when all replicas fail will read-only traffic ultimately be handled by the primary. The impact of replica failure is partial read-only query interruption: queries currently running on that replica will abort due to connection reset and be immediately taken over by other available replicas.

Failure detection is performed jointly by Patroni and Etcd. The cluster leader holds a lease; if the cluster leader fails to renew the lease in time (10s) due to failure, the lease is released, triggering a Failover and new cluster election.

Even without any failures, you can proactively change the cluster primary through Switchover. In this case, write queries on the primary will experience a brief interruption and be immediately routed to the new primary. This operation is typically used for rolling maintenance/upgrades of database servers.


Tradeoffs

Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are two parameters that require careful tradeoffs when designing high availability clusters.

The default RTO and RPO values used by Pigsty meet reliability requirements for most scenarios. You can adjust them based on your hardware level, network quality, and business requirements.

The upper limit of unavailability during failover is controlled by the pg_rto parameter. RTO defaults to 30s. Increasing it will result in longer primary failure write unavailability, while decreasing it will increase the rate of false positive failovers (e.g., repeated switching due to brief network jitter).

The upper limit of potential data loss is controlled by the pg_rpo parameter, defaulting to 1MB. Reducing this value can lower the data loss ceiling during failover but also increases the probability of refusing automatic failover when replicas are not healthy enough (lagging too far behind).

Pigsty uses availability-first mode by default, meaning it will failover as quickly as possible when the primary fails, and data not yet replicated to replicas may be lost (under typical 10GbE networks, replication lag is usually a few KB to 100KB).

If you need to ensure zero data loss during failover, you can use the crit.yml template to ensure no data loss during failover, but this sacrifices some performance as a tradeoff.


pg_rto

Parameter name: pg_rto, Type: int, Level: C

Recovery Time Objective (RTO) in seconds. This is used to calculate Patroni’s TTL value, defaulting to 30 seconds.

If the primary instance is missing for this long, a new leader election will be triggered. This value is not always better when lower; it involves tradeoffs: Reducing this value can decrease unavailability during cluster failover (inability to write), but makes the cluster more sensitive to short-term network jitter, increasing the probability of false positive failover triggers. You need to configure this value based on network conditions and business constraints, making a tradeoff between failure probability and failure impact.

pg_rpo

Parameter name: pg_rpo, Type: int, Level: C

Recovery Point Objective (RPO) in bytes, default: 1048576.

Defaults to 1MiB, meaning up to 1MiB of data loss can be tolerated during failover.

When the primary goes down and all replicas are lagging, you must make a difficult choice: Either promote a replica to become the new primary immediately, accepting acceptable data loss (e.g., less than 1MB), and restore service as quickly as possible. Or wait for the primary to come back online (which may never happen) to avoid any data loss, or abandon automatic failover and wait for human intervention to make the final decision. You need to configure this value based on business preference, making a tradeoff between availability and consistency.

Additionally, you can always ensure RPO = 0 by enabling synchronous commit (e.g., using the crit.yml template), sacrificing some cluster latency/throughput performance to guarantee data consistency.

1 - Service Access

Pigsty uses HAProxy to provide service access, with optional pgBouncer for connection pooling, and optional L2 VIP and DNS access.

Split read and write operations, route traffic correctly, and deliver PostgreSQL cluster capabilities reliably.

Service is an abstraction: it represents the form in which database clusters expose their capabilities externally, encapsulating underlying cluster details.

Services are crucial for stable access in production environments, showing their value during automatic failover in high availability clusters. Personal users typically don’t need to worry about this concept.


Personal Users

The concept of “service” is for production environments. Personal users with single-node clusters can skip the complexity and directly use instance names or IP addresses to access the database.

For example, Pigsty’s default single-node pg-meta.meta database can be connected directly using three different users:

psql postgres://dbuser_dba:[email protected]/meta     # Connect directly with DBA superuser
psql postgres://dbuser_meta:[email protected]/meta   # Connect with default business admin user
psql postgres://dbuser_view:DBUser.View@pg-meta/meta       # Connect with default read-only user via instance domain name

Service Overview

In real-world production environments, we use primary-replica database clusters based on replication. Within a cluster, one and only one instance serves as the leader (primary) that can accept writes. Other instances (replicas) continuously fetch change logs from the cluster leader to stay synchronized. Replicas can also handle read-only requests, significantly offloading the primary in read-heavy, write-light scenarios. Therefore, distinguishing write requests from read-only requests is a common practice.

Additionally, for production environments with high-frequency, short-lived connections, we pool requests through connection pool middleware (Pgbouncer) to reduce connection and backend process creation overhead. However, for scenarios like ETL and change execution, we need to bypass the connection pool and directly access the database. Meanwhile, high-availability clusters may undergo failover during failures, causing cluster leadership changes. Therefore, high-availability database solutions require write traffic to automatically adapt to cluster leadership changes. These varying access needs (read-write separation, pooled vs. direct connections, failover auto-adaptation) ultimately lead to the abstraction of the Service concept.

Typically, database clusters must provide this most basic service:

  • Read-write service (primary): Can read from and write to the database

For production database clusters, at least these two services should be provided:

  • Read-write service (primary): Write data: Can only be served by the primary.
  • Read-only service (replica): Read data: Can be served by replicas; falls back to primary when no replicas are available

Additionally, depending on specific business scenarios, there may be other services, such as:

  • Default direct service (default): Allows (admin) users to bypass the connection pool and directly access the database
  • Offline replica service (offline): Dedicated replica not serving online read traffic, used for ETL and analytical queries
  • Sync replica service (standby): Read-only service with no replication delay, handled by synchronous standby/primary for read queries
  • Delayed replica service (delayed): Access data from the same cluster as it was some time ago, handled by delayed replicas

Access Services

Pigsty’s service delivery boundary stops at the cluster’s HAProxy. Users can access these load balancers through various means.

The typical approach is to use DNS or VIP access, binding them to all or any number of load balancers in the cluster.

pigsty-access.jpg

You can use different host & port combinations, which provide PostgreSQL service in different ways.

Host

TypeSampleDescription
Cluster Domain Namepg-testAccess via cluster domain name (resolved by dnsmasq @ infra nodes)
Cluster VIP Address10.10.10.3Access via L2 VIP address managed by vip-manager, bound to primary node
Instance Hostnamepg-test-1Access via any instance hostname (resolved by dnsmasq @ infra nodes)
Instance IP Address10.10.10.11Access any instance’s IP address

Port

Pigsty uses different ports to distinguish pg services

PortServiceTypeDescription
5432postgresDatabaseDirect access to postgres server
6432pgbouncerMiddlewareAccess postgres through connection pool middleware
5433primaryServiceAccess primary pgbouncer (or postgres)
5434replicaServiceAccess replica pgbouncer (or postgres)
5436defaultServiceAccess primary postgres
5438offlineServiceAccess offline postgres

Combinations

# Access via cluster domain
postgres://test@pg-test:5432/test # DNS -> L2 VIP -> primary direct connection
postgres://test@pg-test:6432/test # DNS -> L2 VIP -> primary connection pool -> primary
postgres://test@pg-test:5433/test # DNS -> L2 VIP -> HAProxy -> primary connection pool -> primary
postgres://test@pg-test:5434/test # DNS -> L2 VIP -> HAProxy -> replica connection pool -> replica
postgres://dbuser_dba@pg-test:5436/test # DNS -> L2 VIP -> HAProxy -> primary direct connection (for admin)
postgres://dbuser_stats@pg-test:5438/test # DNS -> L2 VIP -> HAProxy -> offline direct connection (for ETL/personal queries)

# Access via cluster VIP directly
postgres://[email protected]:5432/test # L2 VIP -> primary direct access
postgres://[email protected]:6432/test # L2 VIP -> primary connection pool -> primary
postgres://[email protected]:5433/test # L2 VIP -> HAProxy -> primary connection pool -> primary
postgres://[email protected]:5434/test # L2 VIP -> HAProxy -> replica connection pool -> replica
postgres://[email protected]:5436/test # L2 VIP -> HAProxy -> primary direct connection (for admin)
postgres://[email protected]::5438/test # L2 VIP -> HAProxy -> offline direct connection (for ETL/personal queries)

# Directly specify any cluster instance name
postgres://test@pg-test-1:5432/test # DNS -> database instance direct connection (singleton access)
postgres://test@pg-test-1:6432/test # DNS -> connection pool -> database
postgres://test@pg-test-1:5433/test # DNS -> HAProxy -> connection pool -> database read/write
postgres://test@pg-test-1:5434/test # DNS -> HAProxy -> connection pool -> database read-only
postgres://dbuser_dba@pg-test-1:5436/test # DNS -> HAProxy -> database direct connection
postgres://dbuser_stats@pg-test-1:5438/test # DNS -> HAProxy -> database offline read/write

# Directly specify any cluster instance IP access
postgres://[email protected]:5432/test # Database instance direct connection (directly specify instance, no automatic traffic distribution)
postgres://[email protected]:6432/test # Connection pool -> database
postgres://[email protected]:5433/test # HAProxy -> connection pool -> database read/write
postgres://[email protected]:5434/test # HAProxy -> connection pool -> database read-only
postgres://[email protected]:5436/test # HAProxy -> database direct connection
postgres://[email protected]:5438/test # HAProxy -> database offline read-write

# Smart client: read/write separation via URL
postgres://[email protected]:6432,10.10.10.12:6432,10.10.10.13:6432/test?target_session_attrs=primary
postgres://[email protected]:6432,10.10.10.12:6432,10.10.10.13:6432/test?target_session_attrs=prefer-standby