Failure Model

Detailed analysis of worst-case, best-case, and average RTO calculation logic and results across three classic failure detection/recovery paths

Patroni failures can be classified into 10 categories by failure target, and further consolidated into five categories based on detection path, which are detailed in this section.

#Failure ScenarioDescriptionFinal Path
1PG process crashcrash, OOM killedActive Detection
2PG connection refusedmax_connectionsActive Detection
3PG zombieProcess alive but unresponsiveActive Detection (timeout)
4Patroni process crashkill -9, OOMPassive Detection
5Patroni zombieProcess alive but stuckWatchdog
6Node downPower outage, hardware failurePassive Detection
7Node zombieIO hang, CPU starvationWatchdog
8Primary ↔ DCS network failureFirewall, switch failureNetwork Partition
9Storage failureDisk failure, disk full, mount failureActive Detection or Watchdog
10Manual switchoverSwitchover/FailoverManual Trigger

However, for RTO calculation purposes, all failures ultimately converge to two paths. This section explores the upper bound, lower bound, and average RTO for these two scenarios.

flowchart LR
    A([Primary Failure]) --> B{Patroni<br/>Detected?}

    B -->|PG Crash| C[Attempt Local Restart]
    B -->|Node Down| D[Wait TTL Expiration]

    C -->|Success| E([Local Recovery])
    C -->|Fail/Timeout| F[Release Leader Lock]

    D --> F
    F --> G[Replica Election]
    G --> H[Execute Promote]
    H --> I[HAProxy Detects]
    I --> J([Service Restored])

    style A fill:#dc3545,stroke:#b02a37,color:#fff
    style E fill:#198754,stroke:#146c43,color:#fff
    style J fill:#198754,stroke:#146c43,color:#fff

Model of Patroni Passive Failure

Failover path triggered by node crash causing leader lease expiration and cluster election

Model of Patroni Active Failure

PostgreSQL primary process crashes while Patroni stays alive and attempts restart, triggering failover after timeout


Last Modified 2026-02-25: update homepage layout (a2dde14)