This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Failure Model

Detailed analysis of worst-case, best-case, and average RTO calculation logic and results across three classic failure detection/recovery paths

Patroni failures can be classified into 10 categories by failure target, and further consolidated into five categories based on detection path, which are detailed in this section.

#Failure ScenarioDescriptionFinal Path
1PG process crashcrash, OOM killedActive Detection
2PG connection refusedmax_connectionsActive Detection
3PG zombieProcess alive but unresponsiveActive Detection (timeout)
4Patroni process crashkill -9, OOMPassive Detection
5Patroni zombieProcess alive but stuckWatchdog
6Node downPower outage, hardware failurePassive Detection
7Node zombieIO hang, CPU starvationWatchdog
8Primary ↔ DCS network failureFirewall, switch failureNetwork Partition
9Storage failureDisk failure, disk full, mount failureActive Detection or Watchdog
10Manual switchoverSwitchover/FailoverManual Trigger

However, for RTO calculation purposes, all failures ultimately converge to two paths. This section explores the upper bound, lower bound, and average RTO for these two scenarios.

flowchart LR
    A([Primary Failure]) --> B{Patroni<br/>Detected?}

    B -->|PG Crash| C[Attempt Local Restart]
    B -->|Node Down| D[Wait TTL Expiration]

    C -->|Success| E([Local Recovery])
    C -->|Fail/Timeout| F[Release Leader Lock]

    D --> F
    F --> G[Replica Election]
    G --> H[Execute Promote]
    H --> I[HAProxy Detects]
    I --> J([Service Restored])

    style A fill:#dc3545,stroke:#b02a37,color:#fff
    style E fill:#198754,stroke:#146c43,color:#fff
    style J fill:#198754,stroke:#146c43,color:#fff

1 - Model of Patroni Passive Failure

Failover path triggered by node crash causing leader lease expiration and cluster election

RTO Timeline


Failure Model

PhaseBestWorstAverageDescription
Lease Expirationttl - loopttlttl - loop/2Best: crash just before refresh
Worst: crash right after refresh
Replica Detection0looploop / 2Best: exactly at check point
Worst: just missed check point
Lock Contest & Promote021Best: direct lock and promote
Worst: API timeout + Promote
Health Check(rise-1) × fastinter(rise-1) × fastinter + inter(rise-1) × fastinter + inter/2Best: state change before check
Worst: state change right after check

Key Difference Between Passive and Active Failover:

ScenarioPatroni StatusLease HandlingPrimary Wait Time
Active Failover (PG crash)Alive, healthyActively tries to restart PG, releases lease on timeoutprimary_start_timeout
Passive Failover (Node crash)Dies with nodeCannot actively release, must wait for TTL expirationttl

In passive failover scenarios, Patroni dies along with the node and cannot actively release the Leader Key. The lease in DCS can only trigger cluster election after TTL naturally expires.


Timeline Analysis

Phase 1: Lease Expiration

The Patroni primary refreshes the Leader Key every loop_wait cycle, resetting TTL to the configured value.

Timeline:
     t-loop        t          t+ttl-loop    t+ttl
       |           |              |           |
    Last Refresh  Failure      Best Case   Worst Case
       |←── loop ──→|              |           |
       |←──────────── ttl ─────────────────────→|
  • Best case: Failure occurs just before lease refresh (elapsed loop since last refresh), remaining TTL = ttl - loop
  • Worst case: Failure occurs right after lease refresh, must wait full ttl
  • Average case: ttl - loop/2
Texpire={ttlloopBestttlloop/2AveragettlWorstT_{expire} = \begin{cases} ttl - loop & \text{Best} \\ ttl - loop/2 & \text{Average} \\ ttl & \text{Worst} \end{cases}

Phase 2: Replica Detection

Replicas wake up on loop_wait cycles and check the Leader Key status in DCS.

Timeline:
    Lease Expired   Replica Wakes
       |            |
       |←── 0~loop ─→|
  • Best case: Replica happens to wake when lease expires, wait 0
  • Worst case: Replica just entered sleep when lease expires, wait loop
  • Average case: loop/2
Tdetect={0Bestloop/2AverageloopWorstT_{detect} = \begin{cases} 0 & \text{Best} \\ loop/2 & \text{Average} \\ loop & \text{Worst} \end{cases}

Phase 3: Lock Contest & Promote

When replicas detect Leader Key expiration, they start the election process. The replica that acquires the Leader Key executes pg_ctl promote to become the new primary.

  1. Via REST API, parallel queries to check each replica’s replication position, typically 10ms, hardcoded 2s timeout.
  2. Compare WAL positions to determine the best candidate, replicas attempt to create Leader Key (CAS atomic operation)
  3. Execute pg_ctl promote to become primary (very fast, typically negligible)
Election Flow:
  ReplicaA ──→ Query replication position ──→ Compare ──→ Contest lock ──→ Success
  ReplicaB ──→ Query replication position ──→ Compare ──→ Contest lock ──→ Fail
  • Best case: Single replica or immediate lock acquisition and promotion, constant overhead 0.1s
  • Worst case: DCS API call timeout: 2s
  • Average case: 1s constant overhead
Telect={0.1Best1Average2WorstT_{elect} = \begin{cases} 0.1 & \text{Best} \\ 1 & \text{Average} \\ 2 & \text{Worst} \end{cases}

Phase 4: Health Check

HAProxy detects the new primary online, requiring rise consecutive successful health checks.

Detection Timeline:
  New Primary    First Check   Second Check  Third Check (UP)
     |          |           |           |
     |←─ 0~inter ─→|←─ fast ─→|←─ fast ─→|
  • Best case: New primary promoted just before check, (rise-1) × fastinter
  • Worst case: New primary promoted right after check, (rise-1) × fastinter + inter
  • Average case: (rise-1) × fastinter + inter/2
Thaproxy={(rise1)×fastinterBest(rise1)×fastinter+inter/2Average(rise1)×fastinter+interWorstT_{haproxy} = \begin{cases} (rise-1) \times fastinter & \text{Best} \\ (rise-1) \times fastinter + inter/2 & \text{Average} \\ (rise-1) \times fastinter + inter & \text{Worst} \end{cases}

RTO Formula

Sum all phase times to get total RTO:

Best Case

RTOmin=ttlloop+0.1+(rise1)×fastinterRTO_{min} = ttl - loop + 0.1 + (rise-1) \times fastinter

Average Case

RTOavg=ttl+1+inter/2+(rise1)×fastinterRTO_{avg} = ttl + 1 + inter/2 + (rise-1) \times fastinter

Worst Case

RTOmax=ttl+loop+2+inter+(rise1)×fastinterRTO_{max} = ttl + loop + 2 + inter + (rise-1) \times fastinter

Model Calculation

Substitute the four RTO model parameters into the formulas above:

pg_rto_plan:  # [ttl, loop, retry, start, margin, inter, fastinter, downinter, rise, fall]
  fast: [ 20  ,5  ,5  ,15 ,5  ,'1s' ,'0.5s' ,'1s' ,3 ,3 ]  # rto < 30s
  norm: [ 30  ,5  ,10 ,25 ,5  ,'2s' ,'1s'   ,'2s' ,3 ,3 ]  # rto < 45s
  safe: [ 60  ,10 ,20 ,45 ,10 ,'3s' ,'1.5s' ,'3s' ,3 ,3 ]  # rto < 90s
  wide: [ 120 ,20 ,30 ,95 ,15 ,'4s' ,'2s'   ,'4s' ,3 ,3 ]  # rto < 150s

Four Mode Calculation Results (unit: seconds, format: min / avg / max)

Phasefastnormsafewide
Lease Expiration15 / 17 / 2025 / 27 / 3050 / 55 / 60100 / 110 / 120
Replica Detection0 / 3 / 50 / 3 / 50 / 5 / 100 / 10 / 20
Lock Contest & Promote0 / 1 / 20 / 1 / 20 / 1 / 20 / 1 / 2
Health Check1 / 2 / 22 / 3 / 43 / 5 / 64 / 6 / 8
Total16 / 23 / 2927 / 34 / 4153 / 66 / 78104 / 127 / 150

2 - Model of Patroni Active Failure

PostgreSQL primary process crashes while Patroni stays alive and attempts restart, triggering failover after timeout

RTO Timeline


Failure Model

ItemBestWorstAverageDescription
Failure Detection0looploop/2Best: PG crashes right before check
Worst: PG crashes right after check
Restart Timeout0startstartBest: PG recovers instantly
Worst: Wait full start timeout before releasing lease
Standby Detection0looploop/2Best: Right at check point
Worst: Just missed check point
Lock & Promote021Best: Acquire lock and promote directly
Worst: API timeout + Promote
Health Check(rise-1) × fastinter(rise-1) × fastinter + inter(rise-1) × fastinter + inter/2Best: State changes before check
Worst: State changes right after check

Key Difference Between Active and Passive Failure:

ScenarioPatroni StatusLease HandlingMain Wait Time
Active Failure (PG crash)Alive, healthyActively tries to restart PG, releases lease after timeoutprimary_start_timeout
Passive Failure (node down)Dies with nodeCannot actively release, must wait for TTL expiryttl

In active failure scenarios, Patroni remains alive and can actively detect PG crash and attempt restart. If restart succeeds, service self-heals; if timeout expires without recovery, Patroni actively releases the Leader Key, triggering cluster election.


Timing Analysis

Phase 1: Failure Detection

Patroni checks PostgreSQL status every loop_wait cycle (via pg_isready or process check).

Timeline:
    Last check      PG crash      Next check
       |              |              |
       |←── 0~loop ──→|              |
  • Best case: PG crashes right before Patroni check, detected immediately, wait 0
  • Worst case: PG crashes right after check, wait for next cycle, wait loop
  • Average case: loop/2
Tdetect={0Bestloop/2AverageloopWorstT_{detect} = \begin{cases} 0 & \text{Best} \\ loop/2 & \text{Average} \\ loop & \text{Worst} \end{cases}

Phase 2: Restart Timeout

After Patroni detects PG crash, it attempts to restart PostgreSQL. This phase has two possible outcomes:

Timeline:
  Crash detected     Restart attempt     Success/Timeout
      |                  |                    |
      |←──── 0 ~ start ─────────────────────→|

Path A: Self-healing Success (Best case)

  • PG restarts successfully, service recovers
  • No failover triggered, extremely short RTO
  • Wait time: 0 (relative to Failover path)

Path B: Failover Required (Average/Worst case)

  • PG still not recovered after primary_start_timeout
  • Patroni actively releases Leader Key
  • Wait time: start
Trestart={0Best (self-healing success)startAverage (failover required)startWorstT_{restart} = \begin{cases} 0 & \text{Best (self-healing success)} \\ start & \text{Average (failover required)} \\ start & \text{Worst} \end{cases}

Note: Average case assumes failover is required. If PG can quickly self-heal, overall RTO will be significantly lower.

Phase 3: Standby Detection

Standbys wake up on loop_wait cycle and check Leader Key status in DCS. When primary Patroni releases the Leader Key, standbys discover this and begin election.

Timeline:
    Lease released    Standby wakes
       |                  |
       |←── 0~loop ──────→|
  • Best case: Standby wakes right when lease is released, wait 0
  • Worst case: Standby just went to sleep when lease released, wait loop
  • Average case: loop/2
Tstandby={0Bestloop/2AverageloopWorstT_{standby} = \begin{cases} 0 & \text{Best} \\ loop/2 & \text{Average} \\ loop & \text{Worst} \end{cases}

Phase 4: Lock & Promote

After standbys discover Leader Key vacancy, election begins. The standby that acquires the Leader Key executes pg_ctl promote to become the new primary.

  1. Via REST API, parallel queries to check each standby’s replication position, typically 10ms, hardcoded 2s timeout.
  2. Compare WAL positions to determine best candidate, standbys attempt to create Leader Key (CAS atomic operation)
  3. Execute pg_ctl promote to become primary (very fast, typically negligible)
Election process:
  StandbyA ──→ Query replication position ──→ Compare ──→ Try lock ──→ Success
  StandbyB ──→ Query replication position ──→ Compare ──→ Try lock ──→ Fail
  • Best case: Single standby or direct lock acquisition and promote, constant overhead 0.1s
  • Worst case: DCS API call timeout: 2s
  • Average case: 1s constant overhead
Telect={0.1Best1Average2WorstT_{elect} = \begin{cases} 0.1 & \text{Best} \\ 1 & \text{Average} \\ 2 & \text{Worst} \end{cases}

Phase 5: Health Check

HAProxy detects new primary online, requires rise consecutive successful health checks.

Check timeline:
  New primary    First check    Second check   Third check (UP)
     |              |               |               |
     |←─ 0~inter ──→|←─── fast ────→|←─── fast ────→|
  • Best case: New primary comes up right at check time, (rise-1) × fastinter
  • Worst case: New primary comes up right after check, (rise-1) × fastinter + inter
  • Average case: (rise-1) × fastinter + inter/2
Thaproxy={(rise1)×fastinterBest(rise1)×fastinter+inter/2Average(rise1)×fastinter+interWorstT_{haproxy} = \begin{cases} (rise-1) \times fastinter & \text{Best} \\ (rise-1) \times fastinter + inter/2 & \text{Average} \\ (rise-1) \times fastinter + inter & \text{Worst} \end{cases}

RTO Formula

Sum all phase times to get total RTO:

Best Case (PG instant self-healing)

RTOmin=0+0+0+0.1+(rise1)×fastinter(rise1)×fastinterRTO_{min} = 0 + 0 + 0 + 0.1 + (rise-1) \times fastinter \approx (rise-1) \times fastinter

Average Case (Failover required)

RTOavg=loop+start+1+inter/2+(rise1)×fastinterRTO_{avg} = loop + start + 1 + inter/2 + (rise-1) \times fastinter

Worst Case

RTOmax=loop×2+start+2+inter+(rise1)×fastinterRTO_{max} = loop \times 2 + start + 2 + inter + (rise-1) \times fastinter

Model Calculation

Substituting the four RTO model parameters into the formulas above:

pg_rto_plan:  # [ttl, loop, retry, start, margin, inter, fastinter, downinter, rise, fall]
  fast: [ 20  ,5  ,5  ,15 ,5  ,'1s' ,'0.5s' ,'1s' ,3 ,3 ]  # rto < 30s
  norm: [ 30  ,5  ,10 ,25 ,5  ,'2s' ,'1s'   ,'2s' ,3 ,3 ]  # rto < 45s
  safe: [ 60  ,10 ,20 ,45 ,10 ,'3s' ,'1.5s' ,'3s' ,3 ,3 ]  # rto < 90s
  wide: [ 120 ,20 ,30 ,95 ,15 ,'4s' ,'2s'   ,'4s' ,3 ,3 ]  # rto < 150s

Calculation Results for Four Modes (unit: seconds, format: min / avg / max)

Phasefastnormsafewide
Failure Detection0 / 3 / 50 / 3 / 50 / 5 / 100 / 10 / 20
Restart Timeout0 / 15 / 150 / 25 / 250 / 45 / 450 / 95 / 95
Standby Detection0 / 3 / 50 / 3 / 50 / 5 / 100 / 10 / 20
Lock & Promote0 / 1 / 20 / 1 / 20 / 1 / 20 / 1 / 2
Health Check1 / 2 / 22 / 3 / 43 / 5 / 64 / 6 / 8
Total1 / 24 / 292 / 35 / 413 / 61 / 734 / 122 / 145

Comparison with Passive Failure

PhaseActive Failure (PG crash)Passive Failure (node down)Description
Detection MechanismPatroni active detectionTTL passive expiryActive detection discovers failure faster
Core Waitstartttlstart is usually less than ttl, but requires additional failure detection time
Lease HandlingActive releasePassive expiryActive release is more timely
Self-healing PossibleYesNoActive detection can attempt local recovery

RTO Comparison (Average case):

ModeActive Failure (PG crash)Passive Failure (node down)Difference
fast24s23s+1s
norm35s34s+1s
safe61s66s-5s
wide122s127s-5s

Analysis: In fast and norm modes, active failure RTO is slightly higher than passive failure because it waits for primary_start_timeout (start); but in safe and wide modes, since start < ttl - loop, active failure is actually faster. However, active failure has the possibility of self-healing, with potentially extremely short RTO in best case scenarios.