Model of Patroni Active Failure

PostgreSQL primary process crashes while Patroni stays alive and attempts restart, triggering failover after timeout

RTO Timeline

Failure Model

Item	Best	Worst	Average	Description
Crash Found	`0`	`loop`	`loop/2`	Best: PG crashes right before check Worst: PG crashes right after check
Restart Timeout	`0`	`start`	`start`	Best: PG recovers instantly Worst: Wait full start timeout before releasing lease
Replica Detect	`0`	`loop`	`loop/2`	Best: Right at check point Worst: Just missed check point
Elect Promote	`0`	`2`	`1`	Best: Acquire lock and promote directly Worst: API timeout + Promote
HAProxy Check	`(rise-1) × fastinter`	`(rise-1) × fastinter + inter`	`(rise-1) × fastinter + inter/2`	Best: State changes before check Worst: State changes right after check

Key Difference Between Active and Passive Failure:

Scenario	Patroni Status	Lease Handling	Main Wait Time
Active Failure (PG crash)	Alive, healthy	Actively tries to restart PG, releases lease after timeout	`primary_start_timeout`
Passive Failure (node down)	Dies with node	Cannot actively release, must wait for TTL expiry	`ttl`

In active failure scenarios, Patroni remains alive and can actively detect PG crash and attempt restart. If restart succeeds, service self-heals; if timeout expires without recovery, Patroni actively releases the Leader Key, triggering cluster election.

Timing Analysis

Phase 1: Failure Detection

Patroni checks PostgreSQL status every loop_wait cycle (via pg_isready or process check).

Timeline:
    Last check      PG crash      Next check
       |              |              |
       |←── 0~loop ──→|              |

Best case: PG crashes right before Patroni check, detected immediately, wait 0
Worst case: PG crashes right after check, wait for next cycle, wait loop
Average case: loop/2

T_{detect} = \begin{cases} 0 & \text{Best} \\ loop/2 & \text{Average} \\ loop & \text{Worst} \end{cases}

Phase 2: Restart Timeout

After Patroni detects PG crash, it attempts to restart PostgreSQL. This phase has two possible outcomes:

Timeline:
  Crash detected     Restart attempt     Success/Timeout
      |                  |                    |
      |←──── 0 ~ start ─────────────────────→|

Path A: Self-healing Success (Best case)

PG restarts successfully, service recovers
No failover triggered, extremely short RTO
Wait time: 0 (relative to Failover path)

Path B: Failover Required (Average/Worst case)

PG still not recovered after primary_start_timeout
Patroni actively releases Leader Key
Wait time: start

T_{restart} = \begin{cases} 0 & \text{Best (self-healing success)} \\ start & \text{Average (failover required)} \\ start & \text{Worst} \end{cases}

Note: Average case assumes failover is required. If PG can quickly self-heal, overall RTO will be significantly lower.

Phase 3: Standby Detection

Standbys wake up on loop_wait cycle and check Leader Key status in DCS. When primary Patroni releases the Leader Key, standbys discover this and begin election.

Timeline:
    Lease released    Standby wakes
       |                  |
       |←── 0~loop ──────→|

Best case: Standby wakes right when lease is released, wait 0
Worst case: Standby just went to sleep when lease released, wait loop
Average case: loop/2

T_{standby} = \begin{cases} 0 & \text{Best} \\ loop/2 & \text{Average} \\ loop & \text{Worst} \end{cases}

Phase 4: Lock & Promote

After standbys discover Leader Key vacancy, election begins. The standby that acquires the Leader Key executes pg_ctl promote to become the new primary.

Via REST API, parallel queries to check each standby’s replication position, typically 10ms, hardcoded 2s timeout.
Compare WAL positions to determine best candidate, standbys attempt to create Leader Key (CAS atomic operation)
Execute pg_ctl promote to become primary (very fast, typically negligible)

Election process:
  StandbyA ──→ Query replication position ──→ Compare ──→ Try lock ──→ Success
  StandbyB ──→ Query replication position ──→ Compare ──→ Try lock ──→ Fail

Best case: Single standby or direct lock acquisition and promote, constant overhead 0.1s
Worst case: DCS API call timeout: 2s
Average case: 1s constant overhead

T_{elect} = \begin{cases} 0.1 & \text{Best} \\ 1 & \text{Average} \\ 2 & \text{Worst} \end{cases}

Phase 5: Health Check

HAProxy detects new primary online, requires rise consecutive successful health checks.

Check timeline:
  New primary    First check    Second check   Third check (UP)
     |              |               |               |
     |←─ 0~inter ──→|←─── fast ────→|←─── fast ────→|

Best case: New primary comes up right at check time, (rise-1) × fastinter
Worst case: New primary comes up right after check, (rise-1) × fastinter + inter
Average case: (rise-1) × fastinter + inter/2

T_{haproxy} = \begin{cases} (rise-1) \times fastinter & \text{Best} \\ (rise-1) \times fastinter + inter/2 & \text{Average} \\ (rise-1) \times fastinter + inter & \text{Worst} \end{cases}

RTO Formula

Sum all phase times to get total RTO:

Best Case (PG instant self-healing)

RTO_{min} = 0 + 0 + 0 + 0.1 + (rise-1) \times fastinter \approx (rise-1) \times fastinter

Average Case (Failover required)

RTO_{avg} = loop + start + 1 + inter/2 + (rise-1) \times fastinter

Worst Case

RTO_{max} = loop \times 2 + start + 2 + inter + (rise-1) \times fastinter

Model Calculation

Substituting the four RTO model parameters into the formulas above:

pg_rto_plan:  # [ttl, loop, retry, start, margin, inter, fastinter, downinter, rise, fall]
  fast: [ 20  ,5  ,5  ,15 ,5  ,'1s' ,'0.5s' ,'1s' ,3 ,3 ]  # rto < 30s
  norm: [ 30  ,5  ,10 ,25 ,5  ,'2s' ,'1s'   ,'2s' ,3 ,3 ]  # rto < 45s
  safe: [ 60  ,10 ,20 ,45 ,10 ,'3s' ,'1.5s' ,'3s' ,3 ,3 ]  # rto < 90s
  wide: [ 120 ,20 ,30 ,95 ,15 ,'4s' ,'2s'   ,'4s' ,3 ,3 ]  # rto < 150s

Calculation Results for Four Modes (unit: seconds, format: min / avg / max)

Phase	fast	norm	safe	wide
Failure Detection	`0` / `3` / `5`	`0` / `3` / `5`	`0` / `5` / `10`	`0` / `10` / `20`
Restart Timeout	`0` / `15` / `15`	`0` / `25` / `25`	`0` / `45` / `45`	`0` / `95` / `95`
Standby Detection	`0` / `3` / `5`	`0` / `3` / `5`	`0` / `5` / `10`	`0` / `10` / `20`
Lock & Promote	`0` / `1` / `2`	`0` / `1` / `2`	`0` / `1` / `2`	`0` / `1` / `2`
Health Check	`1` / `2` / `2`	`2` / `3` / `4`	`3` / `5` / `6`	`4` / `6` / `8`
Total	`1` / `24` / `29`	`2` / `35` / `41`	`3` / `61` / `73`	`4` / `122` / `145`

Comparison with Passive Failure

Phase	Active Failure (PG crash)	Passive Failure (node down)	Description
Detection Mechanism	Patroni active detection	TTL passive expiry	Active detection discovers failure faster
Core Wait	`start`	`ttl`	start is usually less than ttl, but requires additional failure detection time
Lease Handling	Active release	Passive expiry	Active release is more timely
Self-healing Possible	Yes	No	Active detection can attempt local recovery

RTO Comparison (Average case):

Mode	Active Failure (PG crash)	Passive Failure (node down)	Difference
fast	24s	23s	+1s
norm	35s	34s	+1s
safe	61s	66s	-5s
wide	122s	127s	-5s

Analysis: In fast and norm modes, active failure RTO is slightly higher than passive failure because it waits for primary_start_timeout (start); but in safe and wide modes, since start < ttl - loop, active failure is actually faster. However, active failure has the possibility of self-healing, with potentially extremely short RTO in best case scenarios.

Feedback

Was this page helpful?

Thanks for the feedback! Please let us know how we can improve.

Sorry to hear that. Please let us know how we can improve.

Last Modified 2026-02-28: add kernel desc (d5d938e)