Model of Patroni Active Failure

PostgreSQL primary process crashes while Patroni stays alive and attempts restart, triggering failover after timeout

RTO Timeline


Failure Model

ItemBestWorstAverageDescription
Failure Detection0looploop/2Best: PG crashes right before check
Worst: PG crashes right after check
Restart Timeout0startstartBest: PG recovers instantly
Worst: Wait full start timeout before releasing lease
Standby Detection0looploop/2Best: Right at check point
Worst: Just missed check point
Lock & Promote021Best: Acquire lock and promote directly
Worst: API timeout + Promote
Health Check(rise-1) × fastinter(rise-1) × fastinter + inter(rise-1) × fastinter + inter/2Best: State changes before check
Worst: State changes right after check

Key Difference Between Active and Passive Failure:

ScenarioPatroni StatusLease HandlingMain Wait Time
Active Failure (PG crash)Alive, healthyActively tries to restart PG, releases lease after timeoutprimary_start_timeout
Passive Failure (node down)Dies with nodeCannot actively release, must wait for TTL expiryttl

In active failure scenarios, Patroni remains alive and can actively detect PG crash and attempt restart. If restart succeeds, service self-heals; if timeout expires without recovery, Patroni actively releases the Leader Key, triggering cluster election.


Timing Analysis

Phase 1: Failure Detection

Patroni checks PostgreSQL status every loop_wait cycle (via pg_isready or process check).

Timeline:
    Last check      PG crash      Next check
       |              |              |
       |←── 0~loop ──→|              |
  • Best case: PG crashes right before Patroni check, detected immediately, wait 0
  • Worst case: PG crashes right after check, wait for next cycle, wait loop
  • Average case: loop/2
Tdetect={0Bestloop/2AverageloopWorstT_{detect} = \begin{cases} 0 & \text{Best} \\ loop/2 & \text{Average} \\ loop & \text{Worst} \end{cases}

Phase 2: Restart Timeout

After Patroni detects PG crash, it attempts to restart PostgreSQL. This phase has two possible outcomes:

Timeline:
  Crash detected     Restart attempt     Success/Timeout
      |                  |                    |
      |←──── 0 ~ start ─────────────────────→|

Path A: Self-healing Success (Best case)

  • PG restarts successfully, service recovers
  • No failover triggered, extremely short RTO
  • Wait time: 0 (relative to Failover path)

Path B: Failover Required (Average/Worst case)

  • PG still not recovered after primary_start_timeout
  • Patroni actively releases Leader Key
  • Wait time: start
Trestart={0Best (self-healing success)startAverage (failover required)startWorstT_{restart} = \begin{cases} 0 & \text{Best (self-healing success)} \\ start & \text{Average (failover required)} \\ start & \text{Worst} \end{cases}

Note: Average case assumes failover is required. If PG can quickly self-heal, overall RTO will be significantly lower.

Phase 3: Standby Detection

Standbys wake up on loop_wait cycle and check Leader Key status in DCS. When primary Patroni releases the Leader Key, standbys discover this and begin election.

Timeline:
    Lease released    Standby wakes
       |                  |
       |←── 0~loop ──────→|
  • Best case: Standby wakes right when lease is released, wait 0
  • Worst case: Standby just went to sleep when lease released, wait loop
  • Average case: loop/2
Tstandby={0Bestloop/2AverageloopWorstT_{standby} = \begin{cases} 0 & \text{Best} \\ loop/2 & \text{Average} \\ loop & \text{Worst} \end{cases}

Phase 4: Lock & Promote

After standbys discover Leader Key vacancy, election begins. The standby that acquires the Leader Key executes pg_ctl promote to become the new primary.

  1. Via REST API, parallel queries to check each standby’s replication position, typically 10ms, hardcoded 2s timeout.
  2. Compare WAL positions to determine best candidate, standbys attempt to create Leader Key (CAS atomic operation)
  3. Execute pg_ctl promote to become primary (very fast, typically negligible)
Election process:
  StandbyA ──→ Query replication position ──→ Compare ──→ Try lock ──→ Success
  StandbyB ──→ Query replication position ──→ Compare ──→ Try lock ──→ Fail
  • Best case: Single standby or direct lock acquisition and promote, constant overhead 0.1s
  • Worst case: DCS API call timeout: 2s
  • Average case: 1s constant overhead
Telect={0.1Best1Average2WorstT_{elect} = \begin{cases} 0.1 & \text{Best} \\ 1 & \text{Average} \\ 2 & \text{Worst} \end{cases}

Phase 5: Health Check

HAProxy detects new primary online, requires rise consecutive successful health checks.

Check timeline:
  New primary    First check    Second check   Third check (UP)
     |              |               |               |
     |←─ 0~inter ──→|←─── fast ────→|←─── fast ────→|
  • Best case: New primary comes up right at check time, (rise-1) × fastinter
  • Worst case: New primary comes up right after check, (rise-1) × fastinter + inter
  • Average case: (rise-1) × fastinter + inter/2
Thaproxy={(rise1)×fastinterBest(rise1)×fastinter+inter/2Average(rise1)×fastinter+interWorstT_{haproxy} = \begin{cases} (rise-1) \times fastinter & \text{Best} \\ (rise-1) \times fastinter + inter/2 & \text{Average} \\ (rise-1) \times fastinter + inter & \text{Worst} \end{cases}

RTO Formula

Sum all phase times to get total RTO:

Best Case (PG instant self-healing)

RTOmin=0+0+0+0.1+(rise1)×fastinter(rise1)×fastinterRTO_{min} = 0 + 0 + 0 + 0.1 + (rise-1) \times fastinter \approx (rise-1) \times fastinter

Average Case (Failover required)

RTOavg=loop+start+1+inter/2+(rise1)×fastinterRTO_{avg} = loop + start + 1 + inter/2 + (rise-1) \times fastinter

Worst Case

RTOmax=loop×2+start+2+inter+(rise1)×fastinterRTO_{max} = loop \times 2 + start + 2 + inter + (rise-1) \times fastinter

Model Calculation

Substituting the four RTO model parameters into the formulas above:

pg_rto_plan:  # [ttl, loop, retry, start, margin, inter, fastinter, downinter, rise, fall]
  fast: [ 20  ,5  ,5  ,15 ,5  ,'1s' ,'0.5s' ,'1s' ,3 ,3 ]  # rto < 30s
  norm: [ 30  ,5  ,10 ,25 ,5  ,'2s' ,'1s'   ,'2s' ,3 ,3 ]  # rto < 45s
  safe: [ 60  ,10 ,20 ,45 ,10 ,'3s' ,'1.5s' ,'3s' ,3 ,3 ]  # rto < 90s
  wide: [ 120 ,20 ,30 ,95 ,15 ,'4s' ,'2s'   ,'4s' ,3 ,3 ]  # rto < 150s

Calculation Results for Four Modes (unit: seconds, format: min / avg / max)

Phasefastnormsafewide
Failure Detection0 / 3 / 50 / 3 / 50 / 5 / 100 / 10 / 20
Restart Timeout0 / 15 / 150 / 25 / 250 / 45 / 450 / 95 / 95
Standby Detection0 / 3 / 50 / 3 / 50 / 5 / 100 / 10 / 20
Lock & Promote0 / 1 / 20 / 1 / 20 / 1 / 20 / 1 / 2
Health Check1 / 2 / 22 / 3 / 43 / 5 / 64 / 6 / 8
Total1 / 24 / 292 / 35 / 413 / 61 / 734 / 122 / 145

Comparison with Passive Failure

PhaseActive Failure (PG crash)Passive Failure (node down)Description
Detection MechanismPatroni active detectionTTL passive expiryActive detection discovers failure faster
Core Waitstartttlstart is usually less than ttl, but requires additional failure detection time
Lease HandlingActive releasePassive expiryActive release is more timely
Self-healing PossibleYesNoActive detection can attempt local recovery

RTO Comparison (Average case):

ModeActive Failure (PG crash)Passive Failure (node down)Difference
fast24s23s+1s
norm35s34s+1s
safe61s66s-5s
wide122s127s-5s

Analysis: In fast and norm modes, active failure RTO is slightly higher than passive failure because it waits for primary_start_timeout (start); but in safe and wide modes, since start < ttl - loop, active failure is actually faster. However, active failure has the possibility of self-healing, with potentially extremely short RTO in best case scenarios.


Last Modified 2026-01-15: fix some legacy commands (5535c22)