Model of Patroni Passive Failure

Failover path triggered by node crash causing leader lease expiration and cluster election

RTO Timeline


Failure Model

PhaseBestWorstAverageDescription
Lease Expirationttl - loopttlttl - loop/2Best: crash just before refresh
Worst: crash right after refresh
Replica Detection0looploop / 2Best: exactly at check point
Worst: just missed check point
Lock Contest & Promote021Best: direct lock and promote
Worst: API timeout + Promote
Health Check(rise-1) × fastinter(rise-1) × fastinter + inter(rise-1) × fastinter + inter/2Best: state change before check
Worst: state change right after check

Key Difference Between Passive and Active Failover:

ScenarioPatroni StatusLease HandlingPrimary Wait Time
Active Failover (PG crash)Alive, healthyActively tries to restart PG, releases lease on timeoutprimary_start_timeout
Passive Failover (Node crash)Dies with nodeCannot actively release, must wait for TTL expirationttl

In passive failover scenarios, Patroni dies along with the node and cannot actively release the Leader Key. The lease in DCS can only trigger cluster election after TTL naturally expires.


Timeline Analysis

Phase 1: Lease Expiration

The Patroni primary refreshes the Leader Key every loop_wait cycle, resetting TTL to the configured value.

Timeline:
     t-loop        t          t+ttl-loop    t+ttl
       |           |              |           |
    Last Refresh  Failure      Best Case   Worst Case
       |←── loop ──→|              |           |
       |←──────────── ttl ─────────────────────→|
  • Best case: Failure occurs just before lease refresh (elapsed loop since last refresh), remaining TTL = ttl - loop
  • Worst case: Failure occurs right after lease refresh, must wait full ttl
  • Average case: ttl - loop/2
Texpire={ttlloopBestttlloop/2AveragettlWorstT_{expire} = \begin{cases} ttl - loop & \text{Best} \\ ttl - loop/2 & \text{Average} \\ ttl & \text{Worst} \end{cases}

Phase 2: Replica Detection

Replicas wake up on loop_wait cycles and check the Leader Key status in DCS.

Timeline:
    Lease Expired   Replica Wakes
       |            |
       |←── 0~loop ─→|
  • Best case: Replica happens to wake when lease expires, wait 0
  • Worst case: Replica just entered sleep when lease expires, wait loop
  • Average case: loop/2
Tdetect={0Bestloop/2AverageloopWorstT_{detect} = \begin{cases} 0 & \text{Best} \\ loop/2 & \text{Average} \\ loop & \text{Worst} \end{cases}

Phase 3: Lock Contest & Promote

When replicas detect Leader Key expiration, they start the election process. The replica that acquires the Leader Key executes pg_ctl promote to become the new primary.

  1. Via REST API, parallel queries to check each replica’s replication position, typically 10ms, hardcoded 2s timeout.
  2. Compare WAL positions to determine the best candidate, replicas attempt to create Leader Key (CAS atomic operation)
  3. Execute pg_ctl promote to become primary (very fast, typically negligible)
Election Flow:
  ReplicaA ──→ Query replication position ──→ Compare ──→ Contest lock ──→ Success
  ReplicaB ──→ Query replication position ──→ Compare ──→ Contest lock ──→ Fail
  • Best case: Single replica or immediate lock acquisition and promotion, constant overhead 0.1s
  • Worst case: DCS API call timeout: 2s
  • Average case: 1s constant overhead
Telect={0.1Best1Average2WorstT_{elect} = \begin{cases} 0.1 & \text{Best} \\ 1 & \text{Average} \\ 2 & \text{Worst} \end{cases}

Phase 4: Health Check

HAProxy detects the new primary online, requiring rise consecutive successful health checks.

Detection Timeline:
  New Primary    First Check   Second Check  Third Check (UP)
     |          |           |           |
     |←─ 0~inter ─→|←─ fast ─→|←─ fast ─→|
  • Best case: New primary promoted just before check, (rise-1) × fastinter
  • Worst case: New primary promoted right after check, (rise-1) × fastinter + inter
  • Average case: (rise-1) × fastinter + inter/2
Thaproxy={(rise1)×fastinterBest(rise1)×fastinter+inter/2Average(rise1)×fastinter+interWorstT_{haproxy} = \begin{cases} (rise-1) \times fastinter & \text{Best} \\ (rise-1) \times fastinter + inter/2 & \text{Average} \\ (rise-1) \times fastinter + inter & \text{Worst} \end{cases}

RTO Formula

Sum all phase times to get total RTO:

Best Case

RTOmin=ttlloop+0.1+(rise1)×fastinterRTO_{min} = ttl - loop + 0.1 + (rise-1) \times fastinter

Average Case

RTOavg=ttl+1+inter/2+(rise1)×fastinterRTO_{avg} = ttl + 1 + inter/2 + (rise-1) \times fastinter

Worst Case

RTOmax=ttl+loop+2+inter+(rise1)×fastinterRTO_{max} = ttl + loop + 2 + inter + (rise-1) \times fastinter

Model Calculation

Substitute the four RTO model parameters into the formulas above:

pg_rto_plan:  # [ttl, loop, retry, start, margin, inter, fastinter, downinter, rise, fall]
  fast: [ 20  ,5  ,5  ,15 ,5  ,'1s' ,'0.5s' ,'1s' ,3 ,3 ]  # rto < 30s
  norm: [ 30  ,5  ,10 ,25 ,5  ,'2s' ,'1s'   ,'2s' ,3 ,3 ]  # rto < 45s
  safe: [ 60  ,10 ,20 ,45 ,10 ,'3s' ,'1.5s' ,'3s' ,3 ,3 ]  # rto < 90s
  wide: [ 120 ,20 ,30 ,95 ,15 ,'4s' ,'2s'   ,'4s' ,3 ,3 ]  # rto < 150s

Four Mode Calculation Results (unit: seconds, format: min / avg / max)

Phasefastnormsafewide
Lease Expiration15 / 17 / 2025 / 27 / 3050 / 55 / 60100 / 110 / 120
Replica Detection0 / 3 / 50 / 3 / 50 / 5 / 100 / 10 / 20
Lock Contest & Promote0 / 1 / 20 / 1 / 20 / 1 / 20 / 1 / 2
Health Check1 / 2 / 22 / 3 / 43 / 5 / 64 / 6 / 8
Total16 / 23 / 2927 / 34 / 4153 / 66 / 78104 / 127 / 150

Last Modified 2026-01-15: fix some legacy commands (5535c22)