Model of Patroni Passive Failure

Failover path triggered by node crash causing leader lease expiration and cluster election

RTO Timeline

Failure Model

Phase	Best	Worst	Average	Description
Lease Expiration	`ttl - loop`	`ttl`	`ttl - loop/2`	Best: crash just before refresh Worst: crash right after refresh
Replica Detect	`0`	`loop`	`loop / 2`	Best: exactly at check point Worst: just missed check point
Election Promote	`0`	`2`	`1`	Best: direct lock and promote Worst: API timeout + Promote
HAProxy Check	`(rise-1) × fastinter`	`(rise-1) × fastinter + inter`	`(rise-1) × fastinter + inter/2`	Best: state change before check Worst: state change right after check

Key Difference Between Passive and Active Failover:

Scenario	Patroni Status	Lease Handling	Primary Wait Time
Active Failover (PG crash)	Alive, healthy	Actively tries to restart PG, releases lease on timeout	`primary_start_timeout`
Passive Failover (Node crash)	Dies with node	Cannot actively release, must wait for TTL expiration	`ttl`

In passive failover scenarios, Patroni dies along with the node and cannot actively release the Leader Key. The lease in DCS can only trigger cluster election after TTL naturally expires.

Timeline Analysis

Phase 1: Lease Expiration

The Patroni primary refreshes the Leader Key every loop_wait cycle, resetting TTL to the configured value.

Timeline:
     t-loop        t          t+ttl-loop    t+ttl
       |           |              |           |
    Last Refresh  Failure      Best Case   Worst Case
       |←── loop ──→|              |           |
       |←──────────── ttl ─────────────────────→|

Best case: Failure occurs just before lease refresh (elapsed loop since last refresh), remaining TTL = ttl - loop
Worst case: Failure occurs right after lease refresh, must wait full ttl
Average case: ttl - loop/2

T_{expire} = \begin{cases} ttl - loop & \text{Best} \\ ttl - loop/2 & \text{Average} \\ ttl & \text{Worst} \end{cases}

Phase 2: Replica Detection

Replicas wake up on loop_wait cycles and check the Leader Key status in DCS.

Timeline:
    Lease Expired   Replica Wakes
       |            |
       |←── 0~loop ─→|

Best case: Replica happens to wake when lease expires, wait 0
Worst case: Replica just entered sleep when lease expires, wait loop
Average case: loop/2

T_{detect} = \begin{cases} 0 & \text{Best} \\ loop/2 & \text{Average} \\ loop & \text{Worst} \end{cases}

Phase 3: Lock Contest & Promote

When replicas detect Leader Key expiration, they start the election process. The replica that acquires the Leader Key executes pg_ctl promote to become the new primary.

Via REST API, parallel queries to check each replica’s replication position, typically 10ms, hardcoded 2s timeout.
Compare WAL positions to determine the best candidate, replicas attempt to create Leader Key (CAS atomic operation)
Execute pg_ctl promote to become primary (very fast, typically negligible)

Election Flow:
  ReplicaA ──→ Query replication position ──→ Compare ──→ Contest lock ──→ Success
  ReplicaB ──→ Query replication position ──→ Compare ──→ Contest lock ──→ Fail

Best case: Single replica or immediate lock acquisition and promotion, constant overhead 0.1s
Worst case: DCS API call timeout: 2s
Average case: 1s constant overhead

T_{elect} = \begin{cases} 0.1 & \text{Best} \\ 1 & \text{Average} \\ 2 & \text{Worst} \end{cases}

Phase 4: Health Check

HAProxy detects the new primary online, requiring rise consecutive successful health checks.

Detection Timeline:
  New Primary    First Check   Second Check  Third Check (UP)
     |          |           |           |
     |←─ 0~inter ─→|←─ fast ─→|←─ fast ─→|

Best case: New primary promoted just before check, (rise-1) × fastinter
Worst case: New primary promoted right after check, (rise-1) × fastinter + inter
Average case: (rise-1) × fastinter + inter/2

T_{haproxy} = \begin{cases} (rise-1) \times fastinter & \text{Best} \\ (rise-1) \times fastinter + inter/2 & \text{Average} \\ (rise-1) \times fastinter + inter & \text{Worst} \end{cases}

RTO Formula

Sum all phase times to get total RTO:

Best Case

RTO_{min} = ttl - loop + 0.1 + (rise-1) \times fastinter

Average Case

RTO_{avg} = ttl + 1 + inter/2 + (rise-1) \times fastinter

Worst Case

RTO_{max} = ttl + loop + 2 + inter + (rise-1) \times fastinter

Model Calculation

Substitute the four RTO model parameters into the formulas above:

pg_rto_plan:  # [ttl, loop, retry, start, margin, inter, fastinter, downinter, rise, fall]
  fast: [ 20  ,5  ,5  ,15 ,5  ,'1s' ,'0.5s' ,'1s' ,3 ,3 ]  # rto < 30s
  norm: [ 30  ,5  ,10 ,25 ,5  ,'2s' ,'1s'   ,'2s' ,3 ,3 ]  # rto < 45s
  safe: [ 60  ,10 ,20 ,45 ,10 ,'3s' ,'1.5s' ,'3s' ,3 ,3 ]  # rto < 90s
  wide: [ 120 ,20 ,30 ,95 ,15 ,'4s' ,'2s'   ,'4s' ,3 ,3 ]  # rto < 150s

Four Mode Calculation Results (unit: seconds, format: min / avg / max)

Phase	fast	norm	safe	wide
Lease Expiration	`15` / `17` / `20`	`25` / `27` / `30`	`50` / `55` / `60`	`100` / `110` / `120`
Replica Detection	`0` / `3` / `5`	`0` / `3` / `5`	`0` / `5` / `10`	`0` / `10` / `20`
Lock Contest & Promote	`0` / `1` / `2`	`0` / `1` / `2`	`0` / `1` / `2`	`0` / `1` / `2`
Health Check	`1` / `2` / `2`	`2` / `3` / `4`	`3` / `5` / `6`	`4` / `6` / `8`
Total	`16` / `23` / `29`	`27` / `34` / `41`	`53` / `66` / `78`	`104` / `127` / `150`

Feedback

Was this page helpful?

Thanks for the feedback! Please let us know how we can improve.

Sorry to hear that. Please let us know how we can improve.

Last Modified 2026-02-28: add kernel desc (d5d938e)