Patroni failures can be classified into 10 categories by failure target, and further consolidated into five categories based on detection path, which are detailed in this section.
#
Failure Scenario
Description
Final Path
1
PG process crash
crash, OOM killed
Active Detection
2
PG connection refused
max_connections
Active Detection
3
PG zombie
Process alive but unresponsive
Active Detection (timeout)
4
Patroni process crash
kill -9, OOM
Passive Detection
5
Patroni zombie
Process alive but stuck
Watchdog
6
Node down
Power outage, hardware failure
Passive Detection
7
Node zombie
IO hang, CPU starvation
Watchdog
8
Primary ↔ DCS network failure
Firewall, switch failure
Network Partition
9
Storage failure
Disk failure, disk full, mount failure
Active Detection or Watchdog
10
Manual switchover
Switchover/Failover
Manual Trigger
However, for RTO calculation purposes, all failures ultimately converge to two paths. This section explores the upper bound, lower bound, and average RTO for these two scenarios.
flowchart LR
A([Primary Failure]) --> B{Patroni<br/>Detected?}
B -->|PG Crash| C[Attempt Local Restart]
B -->|Node Down| D[Wait TTL Expiration]
C -->|Success| E([Local Recovery])
C -->|Fail/Timeout| F[Release Leader Lock]
D --> F
F --> G[Replica Election]
G --> H[Execute Promote]
H --> I[HAProxy Detects]
I --> J([Service Restored])
style A fill:#dc3545,stroke:#b02a37,color:#fff
style E fill:#198754,stroke:#146c43,color:#fff
style J fill:#198754,stroke:#146c43,color:#fff
1 - Model of Patroni Passive Failure
Failover path triggered by node crash causing leader lease expiration and cluster election
RTO Timeline
Failure Model
Phase
Best
Worst
Average
Description
Lease Expiration
ttl - loop
ttl
ttl - loop/2
Best: crash just before refresh Worst: crash right after refresh
Replica Detection
0
loop
loop / 2
Best: exactly at check point Worst: just missed check point
Lock Contest & Promote
0
2
1
Best: direct lock and promote Worst: API timeout + Promote
Health Check
(rise-1) × fastinter
(rise-1) × fastinter + inter
(rise-1) × fastinter + inter/2
Best: state change before check Worst: state change right after check
Key Difference Between Passive and Active Failover:
Scenario
Patroni Status
Lease Handling
Primary Wait Time
Active Failover (PG crash)
Alive, healthy
Actively tries to restart PG, releases lease on timeout
primary_start_timeout
Passive Failover (Node crash)
Dies with node
Cannot actively release, must wait for TTL expiration
ttl
In passive failover scenarios, Patroni dies along with the node and cannot actively release the Leader Key.
The lease in DCS can only trigger cluster election after TTL naturally expires.
Timeline Analysis
Phase 1: Lease Expiration
The Patroni primary refreshes the Leader Key every loop_wait cycle, resetting TTL to the configured value.
Timeline:
t-loop t t+ttl-loop t+ttl
| | | |
Last Refresh Failure Best Case Worst Case
|←── loop ──→| | |
|←──────────── ttl ─────────────────────→|
Best case: Failure occurs just before lease refresh (elapsed loop since last refresh), remaining TTL = ttl - loop
Worst case: Failure occurs right after lease refresh, must wait full ttl
Best case: Replica happens to wake when lease expires, wait 0
Worst case: Replica just entered sleep when lease expires, wait loop
Average case: loop/2
Tdetect=⎩⎨⎧0loop/2loopBestAverageWorst
Phase 3: Lock Contest & Promote
When replicas detect Leader Key expiration, they start the election process. The replica that acquires the Leader Key executes pg_ctl promote to become the new primary.
Via REST API, parallel queries to check each replica’s replication position, typically 10ms, hardcoded 2s timeout.
Compare WAL positions to determine the best candidate, replicas attempt to create Leader Key (CAS atomic operation)
Execute pg_ctl promote to become primary (very fast, typically negligible)
Four Mode Calculation Results (unit: seconds, format: min / avg / max)
Phase
fast
norm
safe
wide
Lease Expiration
15 / 17 / 20
25 / 27 / 30
50 / 55 / 60
100 / 110 / 120
Replica Detection
0 / 3 / 5
0 / 3 / 5
0 / 5 / 10
0 / 10 / 20
Lock Contest & Promote
0 / 1 / 2
0 / 1 / 2
0 / 1 / 2
0 / 1 / 2
Health Check
1 / 2 / 2
2 / 3 / 4
3 / 5 / 6
4 / 6 / 8
Total
16 / 23 / 29
27 / 34 / 41
53 / 66 / 78
104 / 127 / 150
2 - Model of Patroni Active Failure
PostgreSQL primary process crashes while Patroni stays alive and attempts restart, triggering failover after timeout
RTO Timeline
Failure Model
Item
Best
Worst
Average
Description
Failure Detection
0
loop
loop/2
Best: PG crashes right before check Worst: PG crashes right after check
Restart Timeout
0
start
start
Best: PG recovers instantly Worst: Wait full start timeout before releasing lease
Standby Detection
0
loop
loop/2
Best: Right at check point Worst: Just missed check point
Lock & Promote
0
2
1
Best: Acquire lock and promote directly Worst: API timeout + Promote
Health Check
(rise-1) × fastinter
(rise-1) × fastinter + inter
(rise-1) × fastinter + inter/2
Best: State changes before check Worst: State changes right after check
Key Difference Between Active and Passive Failure:
Scenario
Patroni Status
Lease Handling
Main Wait Time
Active Failure (PG crash)
Alive, healthy
Actively tries to restart PG, releases lease after timeout
primary_start_timeout
Passive Failure (node down)
Dies with node
Cannot actively release, must wait for TTL expiry
ttl
In active failure scenarios, Patroni remains alive and can actively detect PG crash and attempt restart.
If restart succeeds, service self-heals; if timeout expires without recovery, Patroni actively releases the Leader Key, triggering cluster election.
Timing Analysis
Phase 1: Failure Detection
Patroni checks PostgreSQL status every loop_wait cycle (via pg_isready or process check).
Timeline:
Last check PG crash Next check
| | |
|←── 0~loop ──→| |
Best case: PG crashes right before Patroni check, detected immediately, wait 0
Worst case: PG crashes right after check, wait for next cycle, wait loop
Average case: loop/2
Tdetect=⎩⎨⎧0loop/2loopBestAverageWorst
Phase 2: Restart Timeout
After Patroni detects PG crash, it attempts to restart PostgreSQL. This phase has two possible outcomes:
Note: Average case assumes failover is required. If PG can quickly self-heal, overall RTO will be significantly lower.
Phase 3: Standby Detection
Standbys wake up on loop_wait cycle and check Leader Key status in DCS. When primary Patroni releases the Leader Key, standbys discover this and begin election.
Timeline:
Lease released Standby wakes
| |
|←── 0~loop ──────→|
Best case: Standby wakes right when lease is released, wait 0
Worst case: Standby just went to sleep when lease released, wait loop
Average case: loop/2
Tstandby=⎩⎨⎧0loop/2loopBestAverageWorst
Phase 4: Lock & Promote
After standbys discover Leader Key vacancy, election begins. The standby that acquires the Leader Key executes pg_ctl promote to become the new primary.
Via REST API, parallel queries to check each standby’s replication position, typically 10ms, hardcoded 2s timeout.
Compare WAL positions to determine best candidate, standbys attempt to create Leader Key (CAS atomic operation)
Execute pg_ctl promote to become primary (very fast, typically negligible)
Calculation Results for Four Modes (unit: seconds, format: min / avg / max)
Phase
fast
norm
safe
wide
Failure Detection
0 / 3 / 5
0 / 3 / 5
0 / 5 / 10
0 / 10 / 20
Restart Timeout
0 / 15 / 15
0 / 25 / 25
0 / 45 / 45
0 / 95 / 95
Standby Detection
0 / 3 / 5
0 / 3 / 5
0 / 5 / 10
0 / 10 / 20
Lock & Promote
0 / 1 / 2
0 / 1 / 2
0 / 1 / 2
0 / 1 / 2
Health Check
1 / 2 / 2
2 / 3 / 4
3 / 5 / 6
4 / 6 / 8
Total
1 / 24 / 29
2 / 35 / 41
3 / 61 / 73
4 / 122 / 145
Comparison with Passive Failure
Phase
Active Failure (PG crash)
Passive Failure (node down)
Description
Detection Mechanism
Patroni active detection
TTL passive expiry
Active detection discovers failure faster
Core Wait
start
ttl
start is usually less than ttl, but requires additional failure detection time
Lease Handling
Active release
Passive expiry
Active release is more timely
Self-healing Possible
Yes
No
Active detection can attempt local recovery
RTO Comparison (Average case):
Mode
Active Failure (PG crash)
Passive Failure (node down)
Difference
fast
24s
23s
+1s
norm
35s
34s
+1s
safe
61s
66s
-5s
wide
122s
127s
-5s
Analysis: In fast and norm modes, active failure RTO is slightly higher than passive failure because it waits for primary_start_timeout (start);
but in safe and wide modes, since start < ttl - loop, active failure is actually faster.
However, active failure has the possibility of self-healing, with potentially extremely short RTO in best case scenarios.