This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Failure Model

Detailed analysis of worst-case, best-case, and average RTO calculation logic and results across three classic failure detection/recovery paths

1: Model of Patroni Passive Failure
2: Model of Patroni Active Failure

Patroni failures can be classified into 10 categories by failure target, and further consolidated into five categories based on detection path, which are detailed in this section.

#	Failure Scenario	Description	Final Path
1	PG process crash	crash, OOM killed	Active Detection
2	PG connection refused	max_connections	Active Detection
3	PG zombie	Process alive but unresponsive	Active Detection (timeout)
4	Patroni process crash	kill -9, OOM	Passive Detection
5	Patroni zombie	Process alive but stuck	Watchdog
6	Node down	Power outage, hardware failure	Passive Detection
7	Node zombie	IO hang, CPU starvation	Watchdog
8	Primary ↔ DCS network failure	Firewall, switch failure	Network Partition
9	Storage failure	Disk failure, disk full, mount failure	Active Detection or Watchdog
10	Manual switchover	Switchover/Failover	Manual Trigger

However, for RTO calculation purposes, all failures ultimately converge to two paths. This section explores the upper bound, lower bound, and average RTO for these two scenarios.

flowchart LR
    A([Primary Failure]) --> B{Patroni<br/>Detected?}

    B -->|PG Crash| C[Attempt Local Restart]
    B -->|Node Down| D[Wait TTL Expiration]

    C -->|Success| E([Local Recovery])
    C -->|Fail/Timeout| F[Release Leader Lock]

    D --> F
    F --> G[Replica Election]
    G --> H[Execute Promote]
    H --> I[HAProxy Detects]
    I --> J([Service Restored])

    style A fill:#dc3545,stroke:#b02a37,color:#fff
    style E fill:#198754,stroke:#146c43,color:#fff
    style J fill:#198754,stroke:#146c43,color:#fff

1 - Model of Patroni Passive Failure

Failover path triggered by node crash causing leader lease expiration and cluster election

RTO Timeline

Failure Model

Phase	Best	Worst	Average	Description
Lease Expiration	`ttl - loop`	`ttl`	`ttl - loop/2`	Best: crash just before refresh Worst: crash right after refresh
Replica Detect	`0`	`loop`	`loop / 2`	Best: exactly at check point Worst: just missed check point
Election Promote	`0`	`2`	`1`	Best: direct lock and promote Worst: API timeout + Promote
HAProxy Check	`(rise-1) × fastinter`	`(rise-1) × fastinter + inter`	`(rise-1) × fastinter + inter/2`	Best: state change before check Worst: state change right after check

Key Difference Between Passive and Active Failover:

Scenario	Patroni Status	Lease Handling	Primary Wait Time
Active Failover (PG crash)	Alive, healthy	Actively tries to restart PG, releases lease on timeout	`primary_start_timeout`
Passive Failover (Node crash)	Dies with node	Cannot actively release, must wait for TTL expiration	`ttl`

In passive failover scenarios, Patroni dies along with the node and cannot actively release the Leader Key. The lease in DCS can only trigger cluster election after TTL naturally expires.

Timeline Analysis

Phase 1: Lease Expiration

The Patroni primary refreshes the Leader Key every loop_wait cycle, resetting TTL to the configured value.

Timeline:
     t-loop        t          t+ttl-loop    t+ttl
       |           |              |           |
    Last Refresh  Failure      Best Case   Worst Case
       |←── loop ──→|              |           |
       |←──────────── ttl ─────────────────────→|

Best case: Failure occurs just before lease refresh (elapsed loop since last refresh), remaining TTL = ttl - loop
Worst case: Failure occurs right after lease refresh, must wait full ttl
Average case: ttl - loop/2

T_{expire} = \begin{cases} ttl - loop & \text{Best} \\ ttl - loop/2 & \text{Average} \\ ttl & \text{Worst} \end{cases}

Phase 2: Replica Detection

Replicas wake up on loop_wait cycles and check the Leader Key status in DCS.

Timeline:
    Lease Expired   Replica Wakes
       |            |
       |←── 0~loop ─→|

Best case: Replica happens to wake when lease expires, wait 0
Worst case: Replica just entered sleep when lease expires, wait loop
Average case: loop/2

T_{detect} = \begin{cases} 0 & \text{Best} \\ loop/2 & \text{Average} \\ loop & \text{Worst} \end{cases}

Phase 3: Lock Contest & Promote

When replicas detect Leader Key expiration, they start the election process. The replica that acquires the Leader Key executes pg_ctl promote to become the new primary.

Via REST API, parallel queries to check each replica’s replication position, typically 10ms, hardcoded 2s timeout.
Compare WAL positions to determine the best candidate, replicas attempt to create Leader Key (CAS atomic operation)
Execute pg_ctl promote to become primary (very fast, typically negligible)

Election Flow:
  ReplicaA ──→ Query replication position ──→ Compare ──→ Contest lock ──→ Success
  ReplicaB ──→ Query replication position ──→ Compare ──→ Contest lock ──→ Fail

Best case: Single replica or immediate lock acquisition and promotion, constant overhead 0.1s
Worst case: DCS API call timeout: 2s
Average case: 1s constant overhead

T_{elect} = \begin{cases} 0.1 & \text{Best} \\ 1 & \text{Average} \\ 2 & \text{Worst} \end{cases}

Phase 4: Health Check

HAProxy detects the new primary online, requiring rise consecutive successful health checks.

Detection Timeline:
  New Primary    First Check   Second Check  Third Check (UP)
     |          |           |           |
     |←─ 0~inter ─→|←─ fast ─→|←─ fast ─→|

Best case: New primary promoted just before check, (rise-1) × fastinter
Worst case: New primary promoted right after check, (rise-1) × fastinter + inter
Average case: (rise-1) × fastinter + inter/2

T_{haproxy} = \begin{cases} (rise-1) \times fastinter & \text{Best} \\ (rise-1) \times fastinter + inter/2 & \text{Average} \\ (rise-1) \times fastinter + inter & \text{Worst} \end{cases}

RTO Formula

Sum all phase times to get total RTO:

Best Case

RTO_{min} = ttl - loop + 0.1 + (rise-1) \times fastinter

Average Case

RTO_{avg} = ttl + 1 + inter/2 + (rise-1) \times fastinter

Worst Case

RTO_{max} = ttl + loop + 2 + inter + (rise-1) \times fastinter

Model Calculation

Substitute the four RTO model parameters into the formulas above:

pg_rto_plan:  # [ttl, loop, retry, start, margin, inter, fastinter, downinter, rise, fall]
  fast: [ 20  ,5  ,5  ,15 ,5  ,'1s' ,'0.5s' ,'1s' ,3 ,3 ]  # rto < 30s
  norm: [ 30  ,5  ,10 ,25 ,5  ,'2s' ,'1s'   ,'2s' ,3 ,3 ]  # rto < 45s
  safe: [ 60  ,10 ,20 ,45 ,10 ,'3s' ,'1.5s' ,'3s' ,3 ,3 ]  # rto < 90s
  wide: [ 120 ,20 ,30 ,95 ,15 ,'4s' ,'2s'   ,'4s' ,3 ,3 ]  # rto < 150s

Four Mode Calculation Results (unit: seconds, format: min / avg / max)

Phase	fast	norm	safe	wide
Lease Expiration	`15` / `17` / `20`	`25` / `27` / `30`	`50` / `55` / `60`	`100` / `110` / `120`
Replica Detection	`0` / `3` / `5`	`0` / `3` / `5`	`0` / `5` / `10`	`0` / `10` / `20`
Lock Contest & Promote	`0` / `1` / `2`	`0` / `1` / `2`	`0` / `1` / `2`	`0` / `1` / `2`
Health Check	`1` / `2` / `2`	`2` / `3` / `4`	`3` / `5` / `6`	`4` / `6` / `8`
Total	`16` / `23` / `29`	`27` / `34` / `41`	`53` / `66` / `78`	`104` / `127` / `150`

2 - Model of Patroni Active Failure

PostgreSQL primary process crashes while Patroni stays alive and attempts restart, triggering failover after timeout

RTO Timeline

Failure Model

Item	Best	Worst	Average	Description
Crash Found	`0`	`loop`	`loop/2`	Best: PG crashes right before check Worst: PG crashes right after check
Restart Timeout	`0`	`start`	`start`	Best: PG recovers instantly Worst: Wait full start timeout before releasing lease
Replica Detect	`0`	`loop`	`loop/2`	Best: Right at check point Worst: Just missed check point
Elect Promote	`0`	`2`	`1`	Best: Acquire lock and promote directly Worst: API timeout + Promote
HAProxy Check	`(rise-1) × fastinter`	`(rise-1) × fastinter + inter`	`(rise-1) × fastinter + inter/2`	Best: State changes before check Worst: State changes right after check

Key Difference Between Active and Passive Failure:

Scenario	Patroni Status	Lease Handling	Main Wait Time
Active Failure (PG crash)	Alive, healthy	Actively tries to restart PG, releases lease after timeout	`primary_start_timeout`
Passive Failure (node down)	Dies with node	Cannot actively release, must wait for TTL expiry	`ttl`

In active failure scenarios, Patroni remains alive and can actively detect PG crash and attempt restart. If restart succeeds, service self-heals; if timeout expires without recovery, Patroni actively releases the Leader Key, triggering cluster election.

Timing Analysis

Phase 1: Failure Detection

Patroni checks PostgreSQL status every loop_wait cycle (via pg_isready or process check).

Timeline:
    Last check      PG crash      Next check
       |              |              |
       |←── 0~loop ──→|              |

Best case: PG crashes right before Patroni check, detected immediately, wait 0
Worst case: PG crashes right after check, wait for next cycle, wait loop
Average case: loop/2

T_{detect} = \begin{cases} 0 & \text{Best} \\ loop/2 & \text{Average} \\ loop & \text{Worst} \end{cases}

Phase 2: Restart Timeout

After Patroni detects PG crash, it attempts to restart PostgreSQL. This phase has two possible outcomes:

Timeline:
  Crash detected     Restart attempt     Success/Timeout
      |                  |                    |
      |←──── 0 ~ start ─────────────────────→|

Path A: Self-healing Success (Best case)

PG restarts successfully, service recovers
No failover triggered, extremely short RTO
Wait time: 0 (relative to Failover path)

Path B: Failover Required (Average/Worst case)

PG still not recovered after primary_start_timeout
Patroni actively releases Leader Key
Wait time: start

T_{restart} = \begin{cases} 0 & \text{Best (self-healing success)} \\ start & \text{Average (failover required)} \\ start & \text{Worst} \end{cases}

Note: Average case assumes failover is required. If PG can quickly self-heal, overall RTO will be significantly lower.

Phase 3: Standby Detection

Standbys wake up on loop_wait cycle and check Leader Key status in DCS. When primary Patroni releases the Leader Key, standbys discover this and begin election.

Timeline:
    Lease released    Standby wakes
       |                  |
       |←── 0~loop ──────→|

Best case: Standby wakes right when lease is released, wait 0
Worst case: Standby just went to sleep when lease released, wait loop
Average case: loop/2

T_{standby} = \begin{cases} 0 & \text{Best} \\ loop/2 & \text{Average} \\ loop & \text{Worst} \end{cases}

Phase 4: Lock & Promote

After standbys discover Leader Key vacancy, election begins. The standby that acquires the Leader Key executes pg_ctl promote to become the new primary.

Via REST API, parallel queries to check each standby’s replication position, typically 10ms, hardcoded 2s timeout.
Compare WAL positions to determine best candidate, standbys attempt to create Leader Key (CAS atomic operation)
Execute pg_ctl promote to become primary (very fast, typically negligible)

Election process:
  StandbyA ──→ Query replication position ──→ Compare ──→ Try lock ──→ Success
  StandbyB ──→ Query replication position ──→ Compare ──→ Try lock ──→ Fail

Best case: Single standby or direct lock acquisition and promote, constant overhead 0.1s
Worst case: DCS API call timeout: 2s
Average case: 1s constant overhead

T_{elect} = \begin{cases} 0.1 & \text{Best} \\ 1 & \text{Average} \\ 2 & \text{Worst} \end{cases}

Phase 5: Health Check

HAProxy detects new primary online, requires rise consecutive successful health checks.

Check timeline:
  New primary    First check    Second check   Third check (UP)
     |              |               |               |
     |←─ 0~inter ──→|←─── fast ────→|←─── fast ────→|

Best case: New primary comes up right at check time, (rise-1) × fastinter
Worst case: New primary comes up right after check, (rise-1) × fastinter + inter
Average case: (rise-1) × fastinter + inter/2

T_{haproxy} = \begin{cases} (rise-1) \times fastinter & \text{Best} \\ (rise-1) \times fastinter + inter/2 & \text{Average} \\ (rise-1) \times fastinter + inter & \text{Worst} \end{cases}

RTO Formula

Sum all phase times to get total RTO:

Best Case (PG instant self-healing)

RTO_{min} = 0 + 0 + 0 + 0.1 + (rise-1) \times fastinter \approx (rise-1) \times fastinter

Average Case (Failover required)

RTO_{avg} = loop + start + 1 + inter/2 + (rise-1) \times fastinter

Worst Case

RTO_{max} = loop \times 2 + start + 2 + inter + (rise-1) \times fastinter

Model Calculation

Substituting the four RTO model parameters into the formulas above:

pg_rto_plan:  # [ttl, loop, retry, start, margin, inter, fastinter, downinter, rise, fall]
  fast: [ 20  ,5  ,5  ,15 ,5  ,'1s' ,'0.5s' ,'1s' ,3 ,3 ]  # rto < 30s
  norm: [ 30  ,5  ,10 ,25 ,5  ,'2s' ,'1s'   ,'2s' ,3 ,3 ]  # rto < 45s
  safe: [ 60  ,10 ,20 ,45 ,10 ,'3s' ,'1.5s' ,'3s' ,3 ,3 ]  # rto < 90s
  wide: [ 120 ,20 ,30 ,95 ,15 ,'4s' ,'2s'   ,'4s' ,3 ,3 ]  # rto < 150s

Calculation Results for Four Modes (unit: seconds, format: min / avg / max)

Phase	fast	norm	safe	wide
Failure Detection	`0` / `3` / `5`	`0` / `3` / `5`	`0` / `5` / `10`	`0` / `10` / `20`
Restart Timeout	`0` / `15` / `15`	`0` / `25` / `25`	`0` / `45` / `45`	`0` / `95` / `95`
Standby Detection	`0` / `3` / `5`	`0` / `3` / `5`	`0` / `5` / `10`	`0` / `10` / `20`
Lock & Promote	`0` / `1` / `2`	`0` / `1` / `2`	`0` / `1` / `2`	`0` / `1` / `2`
Health Check	`1` / `2` / `2`	`2` / `3` / `4`	`3` / `5` / `6`	`4` / `6` / `8`
Total	`1` / `24` / `29`	`2` / `35` / `41`	`3` / `61` / `73`	`4` / `122` / `145`

Comparison with Passive Failure

Phase	Active Failure (PG crash)	Passive Failure (node down)	Description
Detection Mechanism	Patroni active detection	TTL passive expiry	Active detection discovers failure faster
Core Wait	`start`	`ttl`	start is usually less than ttl, but requires additional failure detection time
Lease Handling	Active release	Passive expiry	Active release is more timely
Self-healing Possible	Yes	No	Active detection can attempt local recovery

RTO Comparison (Average case):

Mode	Active Failure (PG crash)	Passive Failure (node down)	Difference
fast	24s	23s	+1s
norm	35s	34s	+1s
safe	61s	66s	-5s
wide	122s	127s	-5s

Analysis: In fast and norm modes, active failure RTO is slightly higher than passive failure because it waits for primary_start_timeout (start); but in safe and wide modes, since start < ttl - loop, active failure is actually faster. However, active failure has the possibility of self-healing, with potentially extremely short RTO in best case scenarios.