High Availability

Pigsty uses Patroni to achieve high availability for PostgreSQL, ensuring automatic failover.

Overview

Primary Failure RTO ≈ 30s, RPO < 1MB, Replica Failure RTO≈0 (reset current conn)

Pigsty’s PostgreSQL cluster has battery-included high-availability powered by Patroni, Etcd, and HAProxy

When your have two or more instances in the PostgreSQL cluster, you have the ability to self-healing from hardware failures without any further configuration — as long as any instance within the cluster survives, the cluster can serve its services. Clients simply need to connect to any node in the cluster to obtain full services without worrying about replication topology changes.

By default, the recovery time objective (RTO) for primary failure is approximately 30s ~ 60s, and the data recovery point objective (RPO) is < 1MB; for standby failure, RPO = 0, RTO ≈ 0 (instantaneous). In consistency-first mode, zero data loss during failover is guaranteed: RPO = 0. These metrics can be configured as needed based on your actual hardware conditions and reliability requirements.

Pigsty incorporates an HAProxy load balancer for automatic traffic switching, offering multiple access methods for clients such as DNS/VIP/LVS. Failovers and switchover are almost imperceptible to the business side except for sporadic interruptions, meaning applications do not need connection string modifications or restarts.

pigsty-ha

What problems does High-Availability solve?

  • Elevates the availability aspect of data safety C/IA to a new height: RPO ≈ 0, RTO < 30s.
  • Enables seamless rolling maintenance capabilities, minimizing maintenance window requirements for great convenience.
  • Hardware failures can self-heal immediately without human intervention, allowing operations DBAs to sleep soundly.
  • Standbys can carry read-only requests, sharing the load with the primary to make full use of resources.

What are the costs of High Availability?

  • Infrastructure dependency: High availability relies on DCS (etcd/zk/consul) for consensus.
  • Increased entry barrier: A meaningful high-availability deployment environment requires at least three nodes.
  • Additional resource consumption: Each new standby consumes additional resources, which isn’t a major issue.
  • Significantly higher complexity costs: Backup costs significantly increase, requiring tools to manage complexity.

Limitations of High Availability

Since replication is real-time, all changes are immediately applied to the standby. Thus, high-availability solutions based on streaming replication cannot address human errors and software defects that cause data deletions or modifications. (e.g., DROP TABLE, or DELETE data) Such failures require the use of Delayed Clusters or Point-In-Time Recovery using previous base backups and WAL archives.

Strategy RTO (Time to Recover) RPO (Max Data Loss)
Standalone + Do Nothing Permanent data loss, irrecoverable Total data loss
Standalone + Basic Backup Depends on backup size and bandwidth (hours) Loss of data since last backup (hours to days)
Standalone + Basic Backup +
WAL Archiving
Depends on backup size and bandwidth (hours) Loss of last unarchived data (tens of MB)
Primary-Replica + Manual Failover Dozens of minutes Replication Lag (about 100KB)
Primary-Replica + Auto Failover Within a minute Replication Lag (about 100KB)
Primary-Replica + Auto Failover +
Synchronous Commit
Within a minute No data loss

Implementation

In Pigsty, the high-availability architecture works as follows:

  • PostgreSQL uses standard streaming replication to set up physical standby databases. In case of a primary database failure, the standby takes over.
  • Patroni is responsible for managing PostgreSQL server processes and handles high-availability-related matters.
  • Etcd provides Distributed Configuration Store (DCS) capabilities and is used for leader election after a failure.
  • Patroni relies on Etcd to reach a consensus on cluster leadership and offers a health check interface to the outside.
  • HAProxy exposes cluster services externally and utilizes the Patroni health check interface to automatically route traffic to healthy nodes.
  • vip-manager offers an optional layer 2 VIP, retrieves leader information from Etcd, and binds the VIP to the node hosting the primary database.

Upon primary database failure, a new round of leader election is triggered. The healthiest standby in the cluster (with the highest LSN and least data loss) wins and is promoted to the new primary. After the promotion of the winning standby, read-write traffic is immediately routed to the new primary. The impact of a primary failure is temporary unavailability of write services: from the primary’s failure to the promotion of a new primary, write requests will be blocked or directly fail, typically lasting 15 to 30 seconds, usually not exceeding 1 minute.

When a standby fails, read-only traffic is routed to other standbys. If all standbys fail, the primary will eventually carry the read-only traffic. The impact of a standby failure is partial read-only query interruption: queries currently running on the failed standby will be aborted due to connection reset and immediately taken over by another available standby.

Failure detection is jointly completed by Patroni and Etcd. The cluster leader holds a lease, if the cluster leader fails to renew the lease in time (10s) due to a failure, the lease will be released, triggering a failover and a new round of cluster elections.

Even without any failures, you can still proactively perform a Switchover to change the primary of the cluster. In this case, write queries on the primary will be interrupted and immediately routed to the new primary for execution. This operation can typically be used for rolling maintenance/upgrades of the database server.


Trade Offs

The ttl can be tuned with pg_rto, which is 30s by default, increasing it will cause longer failover wait time, while decreasing it will increase the false-positive failover rate (e.g. network jitter).

Pigsty will use availability first mode by default, which means when primary fails, it will try to failover ASAP, data not replicated to the replica may be lost (usually 100KB), and the max potential data loss is controlled by pg_rpo, which is 1MB by default.

Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are two parameters that need careful consideration when designing a high-availability cluster.

The default values of RTO and RPO used by Pigsty meet the reliability requirements for most scenarios. You can adjust them based on your hardware level, network quality, and business needs.

The maximum duration of unavailability during a failover is controlled by the pg_rto parameter, with a default value of 30s. Increasing it will lead to a longer duration of unavailability for write operations during primary failover, while decreasing it will increase the rate of false failovers (e.g., due to brief network jitters).

The upper limit of potential data loss is controlled by the pg_rpo parameter, defaulting to 1MB. Lowering this value can reduce the upper limit of data loss during failovers but also increases the likelihood of refusing automatic failovers due to insufficiently healthy standbys (too far behind).

Pigsty defaults to an availability-first mode, meaning that it will proceed with a failover as quickly as possible when the primary fails, and data not yet replicated to the standby might be lost (under regular ten-gigabit networks, replication delay is usually between a few KB to 100KB).

If you need to ensure no data loss during failovers, you can use the crit.yml template to ensure no data loss during failovers, but this will come at the cost of some performance.


Parameters

pg_rto

name: pg_rto, type: int, level: C

recovery time objective in seconds, This will be used as Patroni TTL value, 30s by default.

If a primary instance is missing for such a long time, a new leader election will be triggered.

Decrease the value can reduce the unavailable time (unable to write) of the cluster during failover, but it will make the cluster more sensitive to network jitter, thus increase the chance of false-positive failover.

Config this according to your network condition and expectation to trade-off between chance and impact, the default value is 30s, and it will be populated to the following patroni parameters:

# the TTL to acquire the leader lock (in seconds). Think of it as the length of time before initiation of the automatic failover process. Default value: 30
ttl: {{ pg_rto }}

# the number of seconds the loop will sleep. Default value: 10 , this is patroni check loop interval
loop_wait: {{ (pg_rto / 3)|round(0, 'ceil')|int }}

# timeout for DCS and PostgreSQL operation retries (in seconds). DCS or network issues shorter than this will not cause Patroni to demote the leader. Default value: 10
retry_timeout: {{ (pg_rto / 3)|round(0, 'ceil')|int }}

# the amount of time a primary is allowed to recover from failures before failover is triggered (in seconds), Max RTO: 2 loop wait + primary_start_timeout
primary_start_timeout: {{ (pg_rto / 3)|round(0, 'ceil')|int }}

pg_rpo

name: pg_rpo, type: int, level: C

recovery point objective in bytes, 1MiB at most by default

default values: 1048576, which will tolerate at most 1MiB data loss during failover.

when the primary is down and all replicas are lagged, you have to make a tough choice to trade off between Availability and Consistency:

  • Promote a replica to be the new primary and bring system back online ASAP, with the price of an acceptable data loss (e.g. less than 1MB).
  • Wait for the primary to come back (which may never be) or human intervention to avoid any data loss.

You can use crit.yml crit.yml template to ensure no data loss during failover, but it will sacrifice some performance.


Last modified 2024-03-07: add ca / iac in en/docs (4720967)