This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

High Availability

Pigsty uses Patroni to implement PostgreSQL high availability, ensuring automatic failover when the primary becomes unavailable.

Overview

Pigsty’s PostgreSQL clusters come with out-of-the-box high availability, powered by Patroni, Etcd, and HAProxy.

When your PostgreSQL cluster has two or more instances, you automatically have self-healing database high availability without any additional configuration — as long as any instance in the cluster survives, the cluster can provide complete service. Clients only need to connect to any node in the cluster to get full service without worrying about primary-replica topology changes.

With default configuration, the primary failure Recovery Time Objective (RTO) ≈ 45s, and Recovery Point Objective (RPO) < 1MB; for replica failures, RPO = 0 and RTO ≈ 0 (brief interruption). In consistency-first mode, failover can guarantee zero data loss: RPO = 0. All these metrics can be configured as needed based on your actual hardware conditions and reliability requirements.

Pigsty includes built-in HAProxy load balancers for automatic traffic switching, providing DNS/VIP/LVS and other access methods for clients. Failover and switchover are almost transparent to the business side except for brief interruptions - applications don’t need to modify connection strings or restart. The minimal maintenance window requirements bring great flexibility and convenience: you can perform rolling maintenance and upgrades on the entire cluster without application coordination. The feature that hardware failures can wait until the next day to handle lets developers, operations, and DBAs sleep well during incidents.

pigsty-ha

Many large organizations and core institutions have been using Pigsty in production for extended periods. The largest deployment has 25K CPU cores and 220+ PostgreSQL ultra-large instances (64c / 512g / 3TB NVMe SSD). In this deployment case, dozens of hardware failures and various incidents occurred over five years, yet overall availability of over 99.999% was maintained.


What problems does High Availability solve?

  • Elevates data security C/IA availability to a new level: RPO ≈ 0, RTO < 45s.
  • Gains seamless rolling maintenance capability, minimizing maintenance window requirements and bringing great convenience.
  • Hardware failures can self-heal immediately without human intervention, allowing operations and DBAs to sleep well.
  • Replicas can handle read-only requests, offloading primary load and fully utilizing resources.

What are the costs of High Availability?

  • Infrastructure dependency: HA requires DCS (etcd/zk/consul) for consensus.
  • Higher starting threshold: A meaningful HA deployment requires at least three nodes.
  • Extra resource consumption: Each new replica consumes additional resources, though this is usually not a major concern.
  • Significantly increased complexity: Backup costs increase significantly, requiring tools to manage complexity.

Limitations of High Availability

Since replication happens in real-time, all changes are immediately applied to replicas. Therefore, streaming replication-based HA solutions cannot handle data deletion or modification caused by human errors and software defects. (e.g., DROP TABLE or DELETE data) Such failures require using delayed clusters or performing point-in-time recovery using previous base backups and WAL archives.

Configuration StrategyRTORPO
Standalone + Nothing Data permanently lost, unrecoverable All data lost
Standalone + Base Backup Depends on backup size and bandwidth (hours) Lose data since last backup (hours to days)
Standalone + Base Backup + WAL Archive Depends on backup size and bandwidth (hours) Lose unarchived data (tens of MB)
Primary-Replica + Manual Failover ~10 minutes Lose data in replication lag (~100KB)
Primary-Replica + Auto Failover Within 1 minute Lose data in replication lag (~100KB)
Primary-Replica + Auto Failover + Sync Commit Within 1 minute No data loss

How It Works

In Pigsty, the high availability architecture works as follows:

  • PostgreSQL uses standard streaming replication to build physical replicas; replicas take over when the primary fails.
  • Patroni manages PostgreSQL server processes and handles high availability matters.
  • Etcd provides distributed configuration storage (DCS) capability and is used for leader election after failures.
  • Patroni relies on Etcd to reach cluster leader consensus and provides health check interfaces externally.
  • HAProxy exposes cluster services externally and uses Patroni health check interfaces to automatically distribute traffic to healthy nodes.
  • vip-manager provides an optional Layer 2 VIP, retrieves leader information from Etcd, and binds the VIP to the node where the cluster primary resides.

When the primary fails, a new round of leader election is triggered. The healthiest replica in the cluster (highest LSN position, minimum data loss) wins and is promoted to the new primary. After the winning replica is promoted, read-write traffic is immediately routed to the new primary. The impact of primary failure is brief write service unavailability: write requests will be blocked or fail directly from primary failure until new primary promotion, with unavailability typically lasting 15 to 30 seconds, usually not exceeding 1 minute.

When a replica fails, read-only traffic is routed to other replicas. Only when all replicas fail will read-only traffic ultimately be handled by the primary. The impact of replica failure is partial read-only query interruption: queries currently running on that replica will abort due to connection reset and be immediately taken over by other available replicas.

Failure detection is performed jointly by Patroni and Etcd. The cluster leader holds a lease; if the cluster leader fails to renew the lease in time (10s) due to failure, the lease is released, triggering a Failover and new cluster election.

Even without any failures, you can proactively change the cluster primary through Switchover. In this case, write queries on the primary will experience a brief interruption and be immediately routed to the new primary. This operation is typically used for rolling maintenance/upgrades of database servers.

1 - RPO Trade-offs

Trade-off analysis for RPO (Recovery Point Objective), finding the optimal balance between availability and data loss.

RPO (Recovery Point Objective) defines the maximum amount of data loss allowed when the primary fails.

For scenarios where data integrity is critical, such as financial transactions, RPO = 0 is typically required, meaning no data loss is allowed.

However, stricter RPO targets come at a cost: higher write latency, reduced system throughput, and the risk that replica failures may cause primary unavailability. For typical scenarios, some data loss is acceptable (e.g., up to 1MB) in exchange for higher availability and performance.


Trade-offs

In asynchronous replication scenarios, there is typically some replication lag between replicas and the primary (depending on network and throughput, normally in the range of 10KB-100KB / 100µs-10ms). This means when the primary fails, replicas may not have fully synchronized with the latest data. If a failover occurs, the new primary may lose some unreplicated data.

The upper limit of potential data loss is controlled by the pg_rpo parameter, which defaults to 1048576 (1MB), meaning up to 1MiB of data loss can be tolerated during failover.

When the cluster primary fails, if any replica has replication lag within this threshold, Pigsty will automatically promote that replica to be the new primary. However, when all replicas exceed this threshold, Pigsty will refuse [automatic failover] to prevent data loss. Manual intervention is then required to decide whether to wait for the primary to recover (which may never happen) or accept the data loss and force-promote a replica.

You need to configure this value based on your business requirements, making a trade-off between availability and consistency. Increasing this value improves the success rate of automatic failover but also increases the upper limit of potential data loss.

When you set pg_rpo = 0, Pigsty enables synchronous replication, ensuring the primary only returns write success after at least one replica has persisted the data. This configuration ensures zero replication lag but introduces significant write latency and reduces overall throughput.

flowchart LR
    A([Primary Failure]) --> B{Synchronous<br/>Replication?}

    B -->|No| C{Lag < RPO?}
    B -->|Yes| D{Sync Replica<br/>Available?}

    C -->|Yes| E[Lossy Auto Failover<br/>RPO < 1MB]
    C -->|No| F[Refuse Auto Failover<br/>Wait for Primary Recovery<br/>or Manual Intervention]

    D -->|Yes| G[Lossless Auto Failover<br/>RPO = 0]
    D -->|No| H{Strict Mode?}

    H -->|No| C
    H -->|Yes| F

    style A fill:#dc3545,stroke:#b02a37,color:#fff
    style E fill:#F0AD4E,stroke:#146c43,color:#fff
    style G fill:#198754,stroke:#146c43,color:#fff
    style F fill:#BE002F,stroke:#565e64,color:#fff

Protection Modes

Pigsty provides three protection modes to help users make trade-offs under different RPO requirements, similar to Oracle Data Guard protection modes.

NameMaximum PerformanceMaximum AvailabilityMaximum Protection
ReplicationAsynchronousSynchronousStrict Synchronous
Data LossPossible (replication lag)Zero normally, minor when degradedZero
Write LatencyLowestMedium (+1 network RTT)Medium (+1 network RTT)
ThroughputHighestReducedReduced
Replica Failure ImpactNoneAuto degrade, service continuesPrimary stops writes
RPO< 1MB= 0 (normal) / < 1MB (degraded)= 0
Use CaseTypical business, performance firstCritical business, safety firstFinancial core, compliance first
ConfigurationDefault configpg_rpo = 0pg_conf: crit.yml

Implementation

The three protection modes differ in how two core Patroni parameters are configured: synchronous_mode and synchronous_mode_strict:

  • synchronous_mode: Whether Patroni enables synchronous replication. If enabled, check if synchronous_mode_strict enables strict synchronous mode.
  • synchronous_mode_strict = false: Default configuration, allows degradation to async mode when replicas fail, primary continues service (Maximum Availability)
  • synchronous_mode_strict = true: Degradation forbidden, primary stops writes until sync replica recovers (Maximum Protection)
Modesynchronous_modesynchronous_mode_strictReplication ModeReplica Failure Behavior
Max Performancefalse-AsyncNo impact
Max AvailabilitytruefalseSynchronousAuto degrade to async
Max ProtectiontruetrueStrict SynchronousPrimary refuses writes

Typically, you only need to set the pg_rpo parameter to 0 to enable the synchronous_mode switch, activating Maximum Availability mode. If you use pg_conf = crit.yml template, it additionally enables the synchronous_mode_strict strict mode switch, activating Maximum Protection mode. Additionally, you can enable watchdog to fence the primary directly during node/Patroni freeze scenarios instead of degrading, achieving behavior equivalent to Oracle Maximum Protection mode.

You can also directly configure these Patroni parameters as needed. Refer to Patroni and PostgreSQL documentation to achieve stronger data protection, such as:

  • Specify the synchronous replica list, configure more sync replicas to improve disaster tolerance, use quorum synchronous commit, or even require all replicas to perform synchronous commit.
  • Configure synchronous_commit: 'remote_apply' to strictly ensure primary-replica read-write consistency. (Oracle Maximum Protection mode is equivalent to remote_write)

Recommendations

Maximum Performance mode (asynchronous replication) is the default mode used by Pigsty and is sufficient for the vast majority of workloads. Tolerating minor data loss during failures (typically in the range of a few KB to hundreds of KB) in exchange for higher throughput and availability is the recommended configuration for typical business scenarios. In this case, you can adjust the maximum allowed data loss through the pg_rpo parameter to suit different business needs.

Maximum Availability mode (synchronous replication) is suitable for scenarios with high data integrity requirements that cannot tolerate data loss. In this mode, a minimum of two-node PostgreSQL cluster (one primary, one replica) is required. Set pg_rpo to 0 to enable this mode.

Maximum Protection mode (strict synchronous replication) is suitable for financial transactions, medical records, and other scenarios with extremely high data integrity requirements. We recommend using at least a three-node cluster (one primary, two replicas), because with only two nodes, if the replica fails, the primary will stop writes, causing service unavailability, which reduces overall system reliability. With three nodes, if only one replica fails, the primary can continue to serve.

2 - Failure Model

Detailed analysis of worst-case, best-case, and average RTO calculation logic and results across three classic failure detection/recovery paths

Patroni failures can be classified into 10 categories by failure target, and further consolidated into five categories based on detection path, which are detailed in this section.

#Failure ScenarioDescriptionFinal Path
1PG process crashcrash, OOM killedActive Detection
2PG connection refusedmax_connectionsActive Detection
3PG zombieProcess alive but unresponsiveActive Detection (timeout)
4Patroni process crashkill -9, OOMPassive Detection
5Patroni zombieProcess alive but stuckWatchdog
6Node downPower outage, hardware failurePassive Detection
7Node zombieIO hang, CPU starvationWatchdog
8Primary ↔ DCS network failureFirewall, switch failureNetwork Partition
9Storage failureDisk failure, disk full, mount failureActive Detection or Watchdog
10Manual switchoverSwitchover/FailoverManual Trigger

However, for RTO calculation purposes, all failures ultimately converge to two paths. This section explores the upper bound, lower bound, and average RTO for these two scenarios.

flowchart LR
    A([Primary Failure]) --> B{Patroni<br/>Detected?}

    B -->|PG Crash| C[Attempt Local Restart]
    B -->|Node Down| D[Wait TTL Expiration]

    C -->|Success| E([Local Recovery])
    C -->|Fail/Timeout| F[Release Leader Lock]

    D --> F
    F --> G[Replica Election]
    G --> H[Execute Promote]
    H --> I[HAProxy Detects]
    I --> J([Service Restored])

    style A fill:#dc3545,stroke:#b02a37,color:#fff
    style E fill:#198754,stroke:#146c43,color:#fff
    style J fill:#198754,stroke:#146c43,color:#fff

2.1 - Model of Patroni Passive Failure

Failover path triggered by node crash causing leader lease expiration and cluster election

RTO Timeline


Failure Model

PhaseBestWorstAverageDescription
Lease Expirationttl - loopttlttl - loop/2Best: crash just before refresh
Worst: crash right after refresh
Replica Detect0looploop / 2Best: exactly at check point
Worst: just missed check point
Election Promote021Best: direct lock and promote
Worst: API timeout + Promote
HAProxy Check(rise-1) × fastinter(rise-1) × fastinter + inter(rise-1) × fastinter + inter/2Best: state change before check
Worst: state change right after check

Key Difference Between Passive and Active Failover:

ScenarioPatroni StatusLease HandlingPrimary Wait Time
Active Failover (PG crash)Alive, healthyActively tries to restart PG, releases lease on timeoutprimary_start_timeout
Passive Failover (Node crash)Dies with nodeCannot actively release, must wait for TTL expirationttl

In passive failover scenarios, Patroni dies along with the node and cannot actively release the Leader Key. The lease in DCS can only trigger cluster election after TTL naturally expires.


Timeline Analysis

Phase 1: Lease Expiration

The Patroni primary refreshes the Leader Key every loop_wait cycle, resetting TTL to the configured value.

Timeline:
     t-loop        t          t+ttl-loop    t+ttl
       |           |              |           |
    Last Refresh  Failure      Best Case   Worst Case
       |←── loop ──→|              |           |
       |←──────────── ttl ─────────────────────→|
  • Best case: Failure occurs just before lease refresh (elapsed loop since last refresh), remaining TTL = ttl - loop
  • Worst case: Failure occurs right after lease refresh, must wait full ttl
  • Average case: ttl - loop/2
Texpire={ttlloopBestttlloop/2AveragettlWorstT_{expire} = \begin{cases} ttl - loop & \text{Best} \\ ttl - loop/2 & \text{Average} \\ ttl & \text{Worst} \end{cases}

Phase 2: Replica Detection

Replicas wake up on loop_wait cycles and check the Leader Key status in DCS.

Timeline:
    Lease Expired   Replica Wakes
       |            |
       |←── 0~loop ─→|
  • Best case: Replica happens to wake when lease expires, wait 0
  • Worst case: Replica just entered sleep when lease expires, wait loop
  • Average case: loop/2
Tdetect={0Bestloop/2AverageloopWorstT_{detect} = \begin{cases} 0 & \text{Best} \\ loop/2 & \text{Average} \\ loop & \text{Worst} \end{cases}

Phase 3: Lock Contest & Promote

When replicas detect Leader Key expiration, they start the election process. The replica that acquires the Leader Key executes pg_ctl promote to become the new primary.

  1. Via REST API, parallel queries to check each replica’s replication position, typically 10ms, hardcoded 2s timeout.
  2. Compare WAL positions to determine the best candidate, replicas attempt to create Leader Key (CAS atomic operation)
  3. Execute pg_ctl promote to become primary (very fast, typically negligible)
Election Flow:
  ReplicaA ──→ Query replication position ──→ Compare ──→ Contest lock ──→ Success
  ReplicaB ──→ Query replication position ──→ Compare ──→ Contest lock ──→ Fail
  • Best case: Single replica or immediate lock acquisition and promotion, constant overhead 0.1s
  • Worst case: DCS API call timeout: 2s
  • Average case: 1s constant overhead
Telect={0.1Best1Average2WorstT_{elect} = \begin{cases} 0.1 & \text{Best} \\ 1 & \text{Average} \\ 2 & \text{Worst} \end{cases}

Phase 4: Health Check

HAProxy detects the new primary online, requiring rise consecutive successful health checks.

Detection Timeline:
  New Primary    First Check   Second Check  Third Check (UP)
     |          |           |           |
     |←─ 0~inter ─→|←─ fast ─→|←─ fast ─→|
  • Best case: New primary promoted just before check, (rise-1) × fastinter
  • Worst case: New primary promoted right after check, (rise-1) × fastinter + inter
  • Average case: (rise-1) × fastinter + inter/2
Thaproxy={(rise1)×fastinterBest(rise1)×fastinter+inter/2Average(rise1)×fastinter+interWorstT_{haproxy} = \begin{cases} (rise-1) \times fastinter & \text{Best} \\ (rise-1) \times fastinter + inter/2 & \text{Average} \\ (rise-1) \times fastinter + inter & \text{Worst} \end{cases}

RTO Formula

Sum all phase times to get total RTO:

Best Case

RTOmin=ttlloop+0.1+(rise1)×fastinterRTO_{min} = ttl - loop + 0.1 + (rise-1) \times fastinter

Average Case

RTOavg=ttl+1+inter/2+(rise1)×fastinterRTO_{avg} = ttl + 1 + inter/2 + (rise-1) \times fastinter

Worst Case

RTOmax=ttl+loop+2+inter+(rise1)×fastinterRTO_{max} = ttl + loop + 2 + inter + (rise-1) \times fastinter

Model Calculation

Substitute the four RTO model parameters into the formulas above:

pg_rto_plan:  # [ttl, loop, retry, start, margin, inter, fastinter, downinter, rise, fall]
  fast: [ 20  ,5  ,5  ,15 ,5  ,'1s' ,'0.5s' ,'1s' ,3 ,3 ]  # rto < 30s
  norm: [ 30  ,5  ,10 ,25 ,5  ,'2s' ,'1s'   ,'2s' ,3 ,3 ]  # rto < 45s
  safe: [ 60  ,10 ,20 ,45 ,10 ,'3s' ,'1.5s' ,'3s' ,3 ,3 ]  # rto < 90s
  wide: [ 120 ,20 ,30 ,95 ,15 ,'4s' ,'2s'   ,'4s' ,3 ,3 ]  # rto < 150s

Four Mode Calculation Results (unit: seconds, format: min / avg / max)

Phasefastnormsafewide
Lease Expiration15 / 17 / 2025 / 27 / 3050 / 55 / 60100 / 110 / 120
Replica Detection0 / 3 / 50 / 3 / 50 / 5 / 100 / 10 / 20
Lock Contest & Promote0 / 1 / 20 / 1 / 20 / 1 / 20 / 1 / 2
Health Check1 / 2 / 22 / 3 / 43 / 5 / 64 / 6 / 8
Total16 / 23 / 2927 / 34 / 4153 / 66 / 78104 / 127 / 150

2.2 - Model of Patroni Active Failure

PostgreSQL primary process crashes while Patroni stays alive and attempts restart, triggering failover after timeout

RTO Timeline


Failure Model

ItemBestWorstAverageDescription
Crash Found0looploop/2Best: PG crashes right before check
Worst: PG crashes right after check
Restart Timeout0startstartBest: PG recovers instantly
Worst: Wait full start timeout before releasing lease
Replica Detect0looploop/2Best: Right at check point
Worst: Just missed check point
Elect Promote021Best: Acquire lock and promote directly
Worst: API timeout + Promote
HAProxy Check(rise-1) × fastinter(rise-1) × fastinter + inter(rise-1) × fastinter + inter/2Best: State changes before check
Worst: State changes right after check

Key Difference Between Active and Passive Failure:

ScenarioPatroni StatusLease HandlingMain Wait Time
Active Failure (PG crash)Alive, healthyActively tries to restart PG, releases lease after timeoutprimary_start_timeout
Passive Failure (node down)Dies with nodeCannot actively release, must wait for TTL expiryttl

In active failure scenarios, Patroni remains alive and can actively detect PG crash and attempt restart. If restart succeeds, service self-heals; if timeout expires without recovery, Patroni actively releases the Leader Key, triggering cluster election.


Timing Analysis

Phase 1: Failure Detection

Patroni checks PostgreSQL status every loop_wait cycle (via pg_isready or process check).

Timeline:
    Last check      PG crash      Next check
       |              |              |
       |←── 0~loop ──→|              |
  • Best case: PG crashes right before Patroni check, detected immediately, wait 0
  • Worst case: PG crashes right after check, wait for next cycle, wait loop
  • Average case: loop/2
Tdetect={0Bestloop/2AverageloopWorstT_{detect} = \begin{cases} 0 & \text{Best} \\ loop/2 & \text{Average} \\ loop & \text{Worst} \end{cases}

Phase 2: Restart Timeout

After Patroni detects PG crash, it attempts to restart PostgreSQL. This phase has two possible outcomes:

Timeline:
  Crash detected     Restart attempt     Success/Timeout
      |                  |                    |
      |←──── 0 ~ start ─────────────────────→|

Path A: Self-healing Success (Best case)

  • PG restarts successfully, service recovers
  • No failover triggered, extremely short RTO
  • Wait time: 0 (relative to Failover path)

Path B: Failover Required (Average/Worst case)

  • PG still not recovered after primary_start_timeout
  • Patroni actively releases Leader Key
  • Wait time: start
Trestart={0Best (self-healing success)startAverage (failover required)startWorstT_{restart} = \begin{cases} 0 & \text{Best (self-healing success)} \\ start & \text{Average (failover required)} \\ start & \text{Worst} \end{cases}

Note: Average case assumes failover is required. If PG can quickly self-heal, overall RTO will be significantly lower.

Phase 3: Standby Detection

Standbys wake up on loop_wait cycle and check Leader Key status in DCS. When primary Patroni releases the Leader Key, standbys discover this and begin election.

Timeline:
    Lease released    Standby wakes
       |                  |
       |←── 0~loop ──────→|
  • Best case: Standby wakes right when lease is released, wait 0
  • Worst case: Standby just went to sleep when lease released, wait loop
  • Average case: loop/2
Tstandby={0Bestloop/2AverageloopWorstT_{standby} = \begin{cases} 0 & \text{Best} \\ loop/2 & \text{Average} \\ loop & \text{Worst} \end{cases}

Phase 4: Lock & Promote

After standbys discover Leader Key vacancy, election begins. The standby that acquires the Leader Key executes pg_ctl promote to become the new primary.

  1. Via REST API, parallel queries to check each standby’s replication position, typically 10ms, hardcoded 2s timeout.
  2. Compare WAL positions to determine best candidate, standbys attempt to create Leader Key (CAS atomic operation)
  3. Execute pg_ctl promote to become primary (very fast, typically negligible)
Election process:
  StandbyA ──→ Query replication position ──→ Compare ──→ Try lock ──→ Success
  StandbyB ──→ Query replication position ──→ Compare ──→ Try lock ──→ Fail
  • Best case: Single standby or direct lock acquisition and promote, constant overhead 0.1s
  • Worst case: DCS API call timeout: 2s
  • Average case: 1s constant overhead
Telect={0.1Best1Average2WorstT_{elect} = \begin{cases} 0.1 & \text{Best} \\ 1 & \text{Average} \\ 2 & \text{Worst} \end{cases}

Phase 5: Health Check

HAProxy detects new primary online, requires rise consecutive successful health checks.

Check timeline:
  New primary    First check    Second check   Third check (UP)
     |              |               |               |
     |←─ 0~inter ──→|←─── fast ────→|←─── fast ────→|
  • Best case: New primary comes up right at check time, (rise-1) × fastinter
  • Worst case: New primary comes up right after check, (rise-1) × fastinter + inter
  • Average case: (rise-1) × fastinter + inter/2
Thaproxy={(rise1)×fastinterBest(rise1)×fastinter+inter/2Average(rise1)×fastinter+interWorstT_{haproxy} = \begin{cases} (rise-1) \times fastinter & \text{Best} \\ (rise-1) \times fastinter + inter/2 & \text{Average} \\ (rise-1) \times fastinter + inter & \text{Worst} \end{cases}

RTO Formula

Sum all phase times to get total RTO:

Best Case (PG instant self-healing)

RTOmin=0+0+0+0.1+(rise1)×fastinter(rise1)×fastinterRTO_{min} = 0 + 0 + 0 + 0.1 + (rise-1) \times fastinter \approx (rise-1) \times fastinter

Average Case (Failover required)

RTOavg=loop+start+1+inter/2+(rise1)×fastinterRTO_{avg} = loop + start + 1 + inter/2 + (rise-1) \times fastinter

Worst Case

RTOmax=loop×2+start+2+inter+(rise1)×fastinterRTO_{max} = loop \times 2 + start + 2 + inter + (rise-1) \times fastinter

Model Calculation

Substituting the four RTO model parameters into the formulas above:

pg_rto_plan:  # [ttl, loop, retry, start, margin, inter, fastinter, downinter, rise, fall]
  fast: [ 20  ,5  ,5  ,15 ,5  ,'1s' ,'0.5s' ,'1s' ,3 ,3 ]  # rto < 30s
  norm: [ 30  ,5  ,10 ,25 ,5  ,'2s' ,'1s'   ,'2s' ,3 ,3 ]  # rto < 45s
  safe: [ 60  ,10 ,20 ,45 ,10 ,'3s' ,'1.5s' ,'3s' ,3 ,3 ]  # rto < 90s
  wide: [ 120 ,20 ,30 ,95 ,15 ,'4s' ,'2s'   ,'4s' ,3 ,3 ]  # rto < 150s

Calculation Results for Four Modes (unit: seconds, format: min / avg / max)

Phasefastnormsafewide
Failure Detection0 / 3 / 50 / 3 / 50 / 5 / 100 / 10 / 20
Restart Timeout0 / 15 / 150 / 25 / 250 / 45 / 450 / 95 / 95
Standby Detection0 / 3 / 50 / 3 / 50 / 5 / 100 / 10 / 20
Lock & Promote0 / 1 / 20 / 1 / 20 / 1 / 20 / 1 / 2
Health Check1 / 2 / 22 / 3 / 43 / 5 / 64 / 6 / 8
Total1 / 24 / 292 / 35 / 413 / 61 / 734 / 122 / 145

Comparison with Passive Failure

PhaseActive Failure (PG crash)Passive Failure (node down)Description
Detection MechanismPatroni active detectionTTL passive expiryActive detection discovers failure faster
Core Waitstartttlstart is usually less than ttl, but requires additional failure detection time
Lease HandlingActive releasePassive expiryActive release is more timely
Self-healing PossibleYesNoActive detection can attempt local recovery

RTO Comparison (Average case):

ModeActive Failure (PG crash)Passive Failure (node down)Difference
fast24s23s+1s
norm35s34s+1s
safe61s66s-5s
wide122s127s-5s

Analysis: In fast and norm modes, active failure RTO is slightly higher than passive failure because it waits for primary_start_timeout (start); but in safe and wide modes, since start < ttl - loop, active failure is actually faster. However, active failure has the possibility of self-healing, with potentially extremely short RTO in best case scenarios.

3 - RTO Trade-offs

Trade-off analysis for RTO (Recovery Time Objective), finding the optimal balance between recovery speed and false failover risk.

RTO (Recovery Time Objective) defines the maximum time required for the system to restore write capability when the primary fails.

For critical transaction systems where availability is paramount, the shortest possible RTO is typically required, such as under one minute.

However, shorter RTO comes at a cost: increased false failover risk. Network jitter may be misinterpreted as a failure, leading to unnecessary failovers. For cross-datacenter/cross-region deployments, RTO requirements are typically relaxed (e.g., 1-2 minutes) to reduce false failover risk.


Trade-offs

The upper limit of unavailability during failover is controlled by the pg_rto parameter. Pigsty provides four preset RTO modes: fast, norm, safe, wide, each optimized for different network conditions and deployment scenarios. The default is norm mode (~45 seconds). You can also specify the RTO upper limit directly in seconds, and the system will automatically map to the closest mode.

When the primary fails, the entire recovery process involves multiple phases: Patroni detects the failure, DCS lock expires, new primary election, promote execution, HAProxy detects the new primary. Reducing RTO means shortening the timeout for each phase, which makes the cluster more sensitive to network jitter, thereby increasing false failover risk.

You need to choose the appropriate mode based on actual network conditions, balancing recovery speed and false failover risk. The worse the network quality, the more conservative mode you should choose; the better the network quality, the more aggressive mode you can choose.

flowchart LR
    A([Primary Failure]) --> B{Patroni<br/>Detected?}

    B -->|PG Crash| C[Attempt Local Restart]
    B -->|Node Down| D[Wait TTL Expiration]

    C -->|Success| E([Local Recovery])
    C -->|Fail/Timeout| F[Release Leader Lock]

    D --> F
    F --> G[Replica Election]
    G --> H[Execute Promote]
    H --> I[HAProxy Detects]
    I --> J([Service Restored])

    style A fill:#dc3545,stroke:#b02a37,color:#fff
    style E fill:#198754,stroke:#146c43,color:#fff
    style J fill:#198754,stroke:#146c43,color:#fff

Four Modes

Pigsty provides four RTO modes to help users make trade-offs under different network conditions.

Namefastnormsafewide
Use CaseSame rackSame datacenter (default)Same region, cross-DCCross-region/continent
Network< 1ms, very stable1-5ms, normal10-50ms, cross-DC100-200ms, public network
Target RTO30s45s90s150s
False Failover RiskHigherMediumLowerVery Low
Configurationpg_rto: fastpg_rto: normpg_rto: safepg_rto: wide

RTO Timeline

Patroni / PG HA has two key failure paths: active failure detection (Patroni detects a PG crash and attempts restart) and passive lease expiration (node down waits for TTL expiration to trigger election).


Implementation

The four RTO modes differ in how the following 10 Patroni and HAProxy HA-related parameters are configured.

ComponentParameterfastnormsafewideDescription
patronittl203060120Leader lock TTL (seconds)
loop_wait551020HA loop check interval (seconds)
retry_timeout5102030DCS operation retry timeout (seconds)
primary_start_timeout15254595Primary restart wait time (seconds)
safety_margin551015Watchdog safety margin (seconds)
haproxyinter1s2s3s4sNormal state check interval
fastinter0.5s1s1.5s2sState transition check interval
downinter1s2s3s4sDOWN state check interval
rise3333Consecutive successes to mark UP
fall3333Consecutive failures to mark DOWN

Patroni Parameters

  • ttl: Leader lock TTL. Primary must renew within this time, otherwise lock expires and triggers election. Directly determines passive failure detection delay.
  • loop_wait: Patroni main loop interval. Each loop performs one health check and state sync, affects failure discovery timeliness.
  • retry_timeout: DCS operation retry timeout. During network partition, Patroni retries continuously within this period; after timeout, primary actively demotes to prevent split-brain.
  • primary_start_timeout: Wait time for Patroni to attempt local restart after PG crash. After timeout, releases Leader lock and triggers failover.
  • safety_margin: Watchdog safety margin. Ensures sufficient time to trigger system restart during failures, avoiding split-brain.

HAProxy Parameters

  • inter: Health check interval in normal state, used when service status is stable.
  • fastinter: Check interval during state transition, uses shorter interval to accelerate confirmation when state change detected.
  • downinter: Check interval in DOWN state, uses this interval to probe recovery after service marked DOWN.
  • rise: Consecutive successes required to mark UP. After new primary comes online, must pass rise consecutive checks before receiving traffic.
  • fall: Consecutive failures required to mark DOWN. Service must fail fall consecutive times before being marked DOWN.

Key Constraint

Patroni core constraint: Ensures primary can complete demotion before TTL expires, preventing split-brain.

loop_wait+2×retry_timeoutttlloop\_wait + 2 \times retry\_timeout \leq ttl

Data Summary


Recommendations

fast mode is suitable for scenarios with extremely high RTO requirements, but requires sufficiently good network quality (latency < 1ms, very low packet loss). Recommended only for same-rack or same-switch deployments, and should be thoroughly tested in production before enabling.

norm mode (default) is Pigsty’s default configuration, sufficient for the vast majority of same-datacenter deployments. An average recovery time of 21 seconds is within acceptable range while providing a reasonable tolerance window to avoid false failovers from network jitter.

safe mode is suitable for same-city cross-datacenter deployments with higher network latency or occasional jitter. The longer tolerance window effectively prevents false failovers from network jitter, making it the recommended configuration for cross-datacenter disaster recovery.

wide mode is suitable for cross-region or even cross-continent deployments with high network latency and possible public-network-level packet loss. In such scenarios, stability is more important than recovery speed, so an extremely wide tolerance window ensures very low false failover rate.

ModeTarget RTOPassive RTOActive RTOScenario
fast3016 / 23 / 291 / 24 / 29Same switch, high-quality network
norm4527 / 34 / 412 / 35 / 41Default, same DC, standard network
safe9053 / 66 / 783 / 61 / 73Same-city active-active / cross-DC DR
wide150104 / 127 / 1504 / 122 / 145Geo-DR / cross-country
default32622 / 34 / 462 / 314 / 326Patroni default params

Typically you only need to set pg_rto to the mode name, and Pigsty will automatically configure Patroni and HAProxy parameters. For backward compatibility, Pigsty still supports configuring RTO directly in seconds, but the effect is equivalent to specifying norm mode.

The mode configuration actually loads the corresponding parameter set from pg_rto_plan. You can modify or override this configuration to implement custom RTO strategies.

pg_rto_plan:  # [ttl, loop, retry, start, margin, inter, fastinter, downinter, rise, fall]
  fast: [ 20  ,5  ,5  ,15 ,5  ,'1s' ,'0.5s' ,'1s' ,3 ,3 ]  # rto < 30s
  norm: [ 30  ,5  ,10 ,25 ,5  ,'2s' ,'1s'   ,'2s' ,3 ,3 ]  # rto < 45s
  safe: [ 60  ,10 ,20 ,45 ,10 ,'3s' ,'1.5s' ,'3s' ,3 ,3 ]  # rto < 90s
  wide: [ 120 ,20 ,30 ,95 ,15 ,'4s' ,'2s'   ,'4s' ,3 ,3 ]  # rto < 150s

4 - Service Access

Pigsty uses HAProxy to provide service access, with optional pgBouncer for connection pooling, and optional L2 VIP and DNS access.

Split read and write operations, route traffic correctly, and deliver PostgreSQL cluster capabilities reliably.

Service is an abstraction: it represents the form in which database clusters expose their capabilities externally, encapsulating underlying cluster details.

Services are crucial for stable access in production environments, showing their value during automatic failover in high availability clusters. Personal users typically don’t need to worry about this concept.


Personal Users

The concept of “service” is for production environments. Personal users with single-node clusters can skip the complexity and directly use instance names or IP addresses to access the database.

For example, Pigsty’s default single-node pg-meta.meta database can be connected directly using three different users:

psql postgres://dbuser_dba:[email protected]/meta     # Connect directly with DBA superuser
psql postgres://dbuser_meta:[email protected]/meta   # Connect with default business admin user
psql postgres://dbuser_view:DBUser.View@pg-meta/meta       # Connect with default read-only user via instance domain name

Service Overview

In real-world production environments, we use primary-replica database clusters based on replication. Within a cluster, one and only one instance serves as the leader (primary) that can accept writes. Other instances (replicas) continuously fetch change logs from the cluster leader to stay synchronized. Replicas can also handle read-only requests, significantly offloading the primary in read-heavy, write-light scenarios. Therefore, distinguishing write requests from read-only requests is a common practice.

Additionally, for production environments with high-frequency, short-lived connections, we pool requests through connection pool middleware (Pgbouncer) to reduce connection and backend process creation overhead. However, for scenarios like ETL and change execution, we need to bypass the connection pool and directly access the database. Meanwhile, high-availability clusters may undergo failover during failures, causing cluster leadership changes. Therefore, high-availability database solutions require write traffic to automatically adapt to cluster leadership changes. These varying access needs (read-write separation, pooled vs. direct connections, failover auto-adaptation) ultimately lead to the abstraction of the Service concept.

Typically, database clusters must provide this most basic service:

  • Read-write service (primary): Can read from and write to the database

For production database clusters, at least these two services should be provided:

  • Read-write service (primary): Write data: Can only be served by the primary.
  • Read-only service (replica): Read data: Can be served by replicas; falls back to primary when no replicas are available

Additionally, depending on specific business scenarios, there may be other services, such as:

  • Default direct service (default): Allows (admin) users to bypass the connection pool and directly access the database
  • Offline replica service (offline): Dedicated replica not serving online read traffic, used for ETL and analytical queries
  • Sync replica service (standby): Read-only service with no replication delay, handled by synchronous standby/primary for read queries
  • Delayed replica service (delayed): Access data from the same cluster as it was some time ago, handled by delayed replicas

Access Services

Pigsty’s service delivery boundary stops at the cluster’s HAProxy. Users can access these load balancers through various means.

The typical approach is to use DNS or VIP access, binding them to all or any number of load balancers in the cluster.

pigsty-access.jpg

You can use different host & port combinations, which provide PostgreSQL service in different ways.

Host

TypeSampleDescription
Cluster Domain Namepg-testAccess via cluster domain name (resolved by dnsmasq @ infra nodes)
Cluster VIP Address10.10.10.3Access via L2 VIP address managed by vip-manager, bound to primary node
Instance Hostnamepg-test-1Access via any instance hostname (resolved by dnsmasq @ infra nodes)
Instance IP Address10.10.10.11Access any instance’s IP address

Port

Pigsty uses different ports to distinguish pg services

PortServiceTypeDescription
5432postgresDatabaseDirect access to postgres server
6432pgbouncerMiddlewareAccess postgres through connection pool middleware
5433primaryServiceAccess primary pgbouncer (or postgres)
5434replicaServiceAccess replica pgbouncer (or postgres)
5436defaultServiceAccess primary postgres
5438offlineServiceAccess offline postgres

Combinations

# Access via cluster domain
postgres://test@pg-test:5432/test # DNS -> L2 VIP -> primary direct connection
postgres://test@pg-test:6432/test # DNS -> L2 VIP -> primary connection pool -> primary
postgres://test@pg-test:5433/test # DNS -> L2 VIP -> HAProxy -> primary connection pool -> primary
postgres://test@pg-test:5434/test # DNS -> L2 VIP -> HAProxy -> replica connection pool -> replica
postgres://dbuser_dba@pg-test:5436/test # DNS -> L2 VIP -> HAProxy -> primary direct connection (for admin)
postgres://dbuser_stats@pg-test:5438/test # DNS -> L2 VIP -> HAProxy -> offline direct connection (for ETL/personal queries)

# Access via cluster VIP directly
postgres://[email protected]:5432/test # L2 VIP -> primary direct access
postgres://[email protected]:6432/test # L2 VIP -> primary connection pool -> primary
postgres://[email protected]:5433/test # L2 VIP -> HAProxy -> primary connection pool -> primary
postgres://[email protected]:5434/test # L2 VIP -> HAProxy -> replica connection pool -> replica
postgres://[email protected]:5436/test # L2 VIP -> HAProxy -> primary direct connection (for admin)
postgres://[email protected]::5438/test # L2 VIP -> HAProxy -> offline direct connection (for ETL/personal queries)

# Directly specify any cluster instance name
postgres://test@pg-test-1:5432/test # DNS -> database instance direct connection (singleton access)
postgres://test@pg-test-1:6432/test # DNS -> connection pool -> database
postgres://test@pg-test-1:5433/test # DNS -> HAProxy -> connection pool -> database read/write
postgres://test@pg-test-1:5434/test # DNS -> HAProxy -> connection pool -> database read-only
postgres://dbuser_dba@pg-test-1:5436/test # DNS -> HAProxy -> database direct connection
postgres://dbuser_stats@pg-test-1:5438/test # DNS -> HAProxy -> database offline read/write

# Directly specify any cluster instance IP access
postgres://[email protected]:5432/test # Database instance direct connection (directly specify instance, no automatic traffic distribution)
postgres://[email protected]:6432/test # Connection pool -> database
postgres://[email protected]:5433/test # HAProxy -> connection pool -> database read/write
postgres://[email protected]:5434/test # HAProxy -> connection pool -> database read-only
postgres://[email protected]:5436/test # HAProxy -> database direct connection
postgres://[email protected]:5438/test # HAProxy -> database offline read-write

# Smart client: read/write separation via URL
postgres://[email protected]:6432,10.10.10.12:6432,10.10.10.13:6432/test?target_session_attrs=primary
postgres://[email protected]:6432,10.10.10.12:6432,10.10.10.13:6432/test?target_session_attrs=prefer-standby