HA Drill: 2/3 Failure

How to recover from emergency scenario with 2-node broken in a 3-node setup?

When two nodes (majority) in a classic 3-node HA deployment fail simultaneously, automatic failover becomes impossible. Time for some manual intervention - let’s roll up our sleeves!

First, assess the status of the failed nodes. If they can be restored quickly, prioritize bringing them back online. Otherwise, initiate the Emergency Response Protocol.

The Emergency Response Protocol assumes your management node is down, leaving only a single database node alive. In this “last man standing” scenario, here’s the fastest recovery path:

  • Adjust HAProxy configuration to direct traffic to the primary
  • Stop Patroni and manually promote the PostgreSQL replica to primary

Adjusting HAProxy Configuration

If you’re accessing the cluster through means other than HAProxy, you can skip this part (lucky you!). If you’re using HAProxy to access your database cluster, you’ll need to adjust the load balancer configuration to manually direct read/write traffic to the primary.

  • Edit /etc/haproxy/<pg_cluster>-primary.cfg, where <pg_cluster> is your PostgreSQL cluster name (e.g., pg-meta)
  • Comment out health check configurations
  • Comment out the server entries for the two failed nodes, keeping only the current primary
listen pg-meta-primary bind *:5433 mode tcp maxconn 5000 balance roundrobin # Comment out these four health check lines #option httpchk # <---- remove this #option http-keep-alive # <---- remove this #http-check send meth OPTIONS uri /primary # <---- remove this #http-check expect status 200 # <---- remove this default-server inter 3s fastinter 1s downinter 5s rise 3 fall 3 on-marked-down shutdown-sessions slowstart 30s maxconn 3000 maxqueue 128 weight 100 server pg-meta-1 10.10.10.10:6432 check port 8008 weight 100 # Comment out the failed nodes #server pg-meta-2 10.10.10.11:6432 check port 8008 weight 100 <---- comment this #server pg-meta-3 10.10.10.12:6432 check port 8008 weight 100 <---- comment this

Don’t rush to systemctl reload haproxy just yet - we’ll do that after promoting the primary. This configuration bypasses Patroni’s health checks and directs write traffic straight to our soon-to-be primary.


Manual Replica Promotion

SSH into the target server, switch to dbsu user, execute a CHECKPOINT to flush disk buffers, stop Patroni, restart PostgreSQL, and perform the promotion:

sudo su - postgres # Switch to database dbsu user psql -c 'checkpoint; checkpoint;' # Double CHECKPOINT for good luck (and clean buffers) sudo systemctl stop patroni # Bid farewell to Patroni pg-restart # Restart PostgreSQL pg-promote # Time for a promotion! psql -c 'SELECT pg_is_in_recovery();' # 'f' means we're primary - mission accomplished!

If you modified the HAProxy config earlier, now’s the time to systemctl reload haproxy and direct traffic to our new primary.

systemctl reload haproxy # Route write traffic to our new primary

Preventing Split-Brain

After stopping the bleeding, priority #2 is: Prevent Split-Brain. We need to ensure the other two servers don’t come back online and start a civil war with our current primary.

The simple approach:

  • Pull the plug (power/network) on the other two servers - ensure they can’t surprise us with an unexpected comeback
  • Update application connection strings to point directly to our lone survivor primary

Next steps depend on your situation:

  • A: The two servers have temporary issues (network/power outage) and can be restored in place
  • B: The two servers are permanently dead (hardware failure) and need to be decommissioned

Recovery from Temporary Failure

If the other two servers can be restored, follow these steps:

  • Handle one failed server at a time, prioritizing the management/INFRA node
  • Start the failed server and immediately stop Patroni

Once ETCD quorum is restored, start Patroni on the surviving server (current primary) to take control of PostgreSQL and reclaim cluster leadership. Put Patroni in maintenance mode:

systemctl restart patroni pg pause <pg_cluster>

On the other two instances, create a touch /pg/data/standby.signal file as postgres user to mark them as replicas, then start Patroni:

systemctl restart patroni

After confirming Patroni cluster identity/roles are correct, exit maintenance mode:

pg resume <pg_cluster>

Recovery from Permanent Failure

After permanent failure, first recover the ~/pigsty directory on the management node - particularly the crucial pigsty.yml and files/pki/ca/ca.key files.

No backup of these files? You might need to deploy a fresh Pigsty and migrate your existing cluster via backup cluster.

Pro tip: Keep your pigsty directory under version control (Git). Learn from this experience - future you will thank present you.

Config Repair

Use your surviving node as the new management node. Copy the ~/pigsty directory there and adjust the configuration. For example, replace the default management node 10.10.10.10 with the surviving node 10.10.10.12:

all: vars: admin_ip: 10.10.10.12 # New management node IP node_etc_hosts: [10.10.10.12 h.pigsty a.pigsty p.pigsty g.pigsty sss.pigsty] infra_portal: {} # Update other configs referencing old admin_ip children: infra: # Adjust Infra cluster hosts: # 10.10.10.10: { infra_seq: 1 } # Old Infra node 10.10.10.12: { infra_seq: 3 } # New Infra node etcd: # Adjust ETCD cluster hosts: #10.10.10.10: { etcd_seq: 1 } # Comment out failed node #10.10.10.11: { etcd_seq: 2 } # Comment out failed node 10.10.10.12: { etcd_seq: 3 } # Keep survivor vars: etcd_cluster: etcd pg-meta: # Adjust PGSQL cluster config hosts: #10.10.10.10: { pg_seq: 1, pg_role: primary } #10.10.10.11: { pg_seq: 2, pg_role: replica } #10.10.10.12: { pg_seq: 3, pg_role: replica , pg_offline_query: true } 10.10.10.12: { pg_seq: 3, pg_role: primary , pg_offline_query: true } vars: pg_cluster: pg-meta

ETCD Repair

Reset ETCD to a single-node cluster:

./etcd.yml -e etcd_safeguard=false -e etcd_clean=true

Follow ETCD Config Reload to adjust ETCD endpoint references.

INFRA Repair

If the surviving node lacks INFRA module, configure and install it:

./infra.yml -l 10.10.10.12

Fix monitoring on the current node:

./node.yml -t node_monitor

PGSQL Repair

./pgsql.yml -t pg_conf # Regenerate PG config systemctl reload patroni # Reload Patroni config on survivor

After module repairs, follow the standard scale-out procedure to add new nodes and restore HA.





Last modified 2025-04-09: update infra doc (5591e0a)