Model of Patroni Active Failure
RTO Timeline
Failure Model
| Item | Best | Worst | Average | Description |
|---|---|---|---|---|
| Failure Detection | 0 | loop | loop/2 | Best: PG crashes right before check Worst: PG crashes right after check |
| Restart Timeout | 0 | start | start | Best: PG recovers instantly Worst: Wait full start timeout before releasing lease |
| Standby Detection | 0 | loop | loop/2 | Best: Right at check point Worst: Just missed check point |
| Lock & Promote | 0 | 2 | 1 | Best: Acquire lock and promote directly Worst: API timeout + Promote |
| Health Check | (rise-1) × fastinter | (rise-1) × fastinter + inter | (rise-1) × fastinter + inter/2 | Best: State changes before check Worst: State changes right after check |
Key Difference Between Active and Passive Failure:
| Scenario | Patroni Status | Lease Handling | Main Wait Time |
|---|---|---|---|
| Active Failure (PG crash) | Alive, healthy | Actively tries to restart PG, releases lease after timeout | primary_start_timeout |
| Passive Failure (node down) | Dies with node | Cannot actively release, must wait for TTL expiry | ttl |
In active failure scenarios, Patroni remains alive and can actively detect PG crash and attempt restart. If restart succeeds, service self-heals; if timeout expires without recovery, Patroni actively releases the Leader Key, triggering cluster election.
Timing Analysis
Phase 1: Failure Detection
Patroni checks PostgreSQL status every loop_wait cycle (via pg_isready or process check).
Timeline:
Last check PG crash Next check
| | |
|←── 0~loop ──→| |
- Best case: PG crashes right before Patroni check, detected immediately, wait
0 - Worst case: PG crashes right after check, wait for next cycle, wait
loop - Average case:
loop/2
Phase 2: Restart Timeout
After Patroni detects PG crash, it attempts to restart PostgreSQL. This phase has two possible outcomes:
Timeline:
Crash detected Restart attempt Success/Timeout
| | |
|←──── 0 ~ start ─────────────────────→|
Path A: Self-healing Success (Best case)
- PG restarts successfully, service recovers
- No failover triggered, extremely short RTO
- Wait time:
0(relative to Failover path)
Path B: Failover Required (Average/Worst case)
- PG still not recovered after
primary_start_timeout - Patroni actively releases Leader Key
- Wait time:
start
Note: Average case assumes failover is required. If PG can quickly self-heal, overall RTO will be significantly lower.
Phase 3: Standby Detection
Standbys wake up on loop_wait cycle and check Leader Key status in DCS. When primary Patroni releases the Leader Key, standbys discover this and begin election.
Timeline:
Lease released Standby wakes
| |
|←── 0~loop ──────→|
- Best case: Standby wakes right when lease is released, wait
0 - Worst case: Standby just went to sleep when lease released, wait
loop - Average case:
loop/2
Phase 4: Lock & Promote
After standbys discover Leader Key vacancy, election begins. The standby that acquires the Leader Key executes pg_ctl promote to become the new primary.
- Via REST API, parallel queries to check each standby’s replication position, typically 10ms, hardcoded 2s timeout.
- Compare WAL positions to determine best candidate, standbys attempt to create Leader Key (CAS atomic operation)
- Execute
pg_ctl promoteto become primary (very fast, typically negligible)
Election process:
StandbyA ──→ Query replication position ──→ Compare ──→ Try lock ──→ Success
StandbyB ──→ Query replication position ──→ Compare ──→ Try lock ──→ Fail
- Best case: Single standby or direct lock acquisition and promote, constant overhead
0.1s - Worst case: DCS API call timeout:
2s - Average case:
1sconstant overhead
Phase 5: Health Check
HAProxy detects new primary online, requires rise consecutive successful health checks.
Check timeline:
New primary First check Second check Third check (UP)
| | | |
|←─ 0~inter ──→|←─── fast ────→|←─── fast ────→|
- Best case: New primary comes up right at check time,
(rise-1) × fastinter - Worst case: New primary comes up right after check,
(rise-1) × fastinter + inter - Average case:
(rise-1) × fastinter + inter/2
RTO Formula
Sum all phase times to get total RTO:
Best Case (PG instant self-healing)
Average Case (Failover required)
Worst Case
Model Calculation
Substituting the four RTO model parameters into the formulas above:
pg_rto_plan: # [ttl, loop, retry, start, margin, inter, fastinter, downinter, rise, fall]
fast: [ 20 ,5 ,5 ,15 ,5 ,'1s' ,'0.5s' ,'1s' ,3 ,3 ] # rto < 30s
norm: [ 30 ,5 ,10 ,25 ,5 ,'2s' ,'1s' ,'2s' ,3 ,3 ] # rto < 45s
safe: [ 60 ,10 ,20 ,45 ,10 ,'3s' ,'1.5s' ,'3s' ,3 ,3 ] # rto < 90s
wide: [ 120 ,20 ,30 ,95 ,15 ,'4s' ,'2s' ,'4s' ,3 ,3 ] # rto < 150s
Calculation Results for Four Modes (unit: seconds, format: min / avg / max)
| Phase | fast | norm | safe | wide |
|---|---|---|---|---|
| Failure Detection | 0 / 3 / 5 | 0 / 3 / 5 | 0 / 5 / 10 | 0 / 10 / 20 |
| Restart Timeout | 0 / 15 / 15 | 0 / 25 / 25 | 0 / 45 / 45 | 0 / 95 / 95 |
| Standby Detection | 0 / 3 / 5 | 0 / 3 / 5 | 0 / 5 / 10 | 0 / 10 / 20 |
| Lock & Promote | 0 / 1 / 2 | 0 / 1 / 2 | 0 / 1 / 2 | 0 / 1 / 2 |
| Health Check | 1 / 2 / 2 | 2 / 3 / 4 | 3 / 5 / 6 | 4 / 6 / 8 |
| Total | 1 / 24 / 29 | 2 / 35 / 41 | 3 / 61 / 73 | 4 / 122 / 145 |
Comparison with Passive Failure
| Phase | Active Failure (PG crash) | Passive Failure (node down) | Description |
|---|---|---|---|
| Detection Mechanism | Patroni active detection | TTL passive expiry | Active detection discovers failure faster |
| Core Wait | start | ttl | start is usually less than ttl, but requires additional failure detection time |
| Lease Handling | Active release | Passive expiry | Active release is more timely |
| Self-healing Possible | Yes | No | Active detection can attempt local recovery |
RTO Comparison (Average case):
| Mode | Active Failure (PG crash) | Passive Failure (node down) | Difference |
|---|---|---|---|
| fast | 24s | 23s | +1s |
| norm | 35s | 34s | +1s |
| safe | 61s | 66s | -5s |
| wide | 122s | 127s | -5s |
Analysis: In
fastandnormmodes, active failure RTO is slightly higher than passive failure because it waits forprimary_start_timeout(start); but insafeandwidemodes, sincestart < ttl - loop, active failure is actually faster. However, active failure has the possibility of self-healing, with potentially extremely short RTO in best case scenarios.
Feedback
Was this page helpful?
Thanks for the feedback! Please let us know how we can improve.
Sorry to hear that. Please let us know how we can improve.