Database auto-failover with Heimdall proxy

Abeer Alotaibi
4 min readSep 20, 2020

We are using PostgreSQL for databases and we trying to design the architecture on high available mode with minimum changes on the application side and better performance on both the application and database sides. As a first step to design PostgreSQL on HA mode I have set up streaming replication (Master-Slaver), but this wasn’t enough, we should consider automating Failover in case of any failure is happened, a primary can become unavailable for a few reasons.

For example:

1- The operating system of the primary node can crash or become unresponsive.

2- The primary node can lose its network connectivity.

3- The PostgreSQL service in the primary node can crash, stop, or become unavailable unexpectedly.

4- The PostgreSQL service in the primary node can be stopped intentionally or accidentally.

Whenever a primary becomes unavailable, a standby does not automatically promote itself to the primary role. A standby continues to serve read-only queries, although the data will be current up to the last Log Sequence Number received from the primary. Any attempt for a write operation will fail. I started searching for Auto-Failover solutions, and I found a lot( PG-pool, pg_auto_failover extension, rempmgr, Heimdall Data proxy, and some third-party tools require to connect our local cloud to the provider’s cloud), the start was with pg_auto_failover, the flow of DB requests (shown on below diagram) The application runs SQL queries directly against the primary, with the slaver specified as a fallback connection in the application, this need to made some changes on the code level.

PG_auto_failover

Therefore, I excluded pg_auto_failover and any tool that will require me to do any change on the application side as we were working on major big changes on our application and it wasn’t the right time to made any extra changes to the code but the DB auto-failover is a must at the same time.

The reasons why I wanted to avoid application side changes are:

1- I wanted to save our super developers' efforts to focus on application functionalities refactoring without making any changes to the code related to adding fallback SQL instructions and so on.

2- After deploying our major change along with DB auto-failover solution the monitoring, troubleshooting, or rollback (either to rollback the application or DB changes)will be easier for us.

My superhero in this situation was Heimdall Data Proxy!

Heimdall Data is a solution that automated database failover and provides the application resiliency necessary for maximum uptime. The key benefit of using Heimdall is NO application or database code changes, scaling out the database tier always requires rewriting the application, not anymore with Heimdall Data! Heimdall is easy to install & configure, I installed it on Ubuntu18.04, and I found that many database types are supported: PostgreSQL, MySQL -Aurora, MariaDB, Greenplum- Oracle, and SQL server.

Heimdall Dashboard

After configuring Heimdall proxy (all from the UI), Heimdall by creating its own scheme on your database will continuously monitor replication and take recovery action when the primary is unavailable. The failover logic can be done via state change scripting, in my case, I added a single line on Heimdall statechange-pg.sh scripts for promotion, simply I followed PostgreSQL slave promotion official instructions.

promote_cmd = ‘ssh pg-slave "touch /tmp/postgresql.trigger.5432; pg_ctl promote -D /var/lib/postgresql/10/main" > ‘ + logfile
Heimdall Dashboard

To test the Auto-fallback with Heimdall , we causes a failure on primary database, after Heimdall detect the primary is not reachable anymore, the state change script is triggered and the slaver promotion is successfully done.

Note: If the primary server fails and the standby server becomes the new primary, and then the old primary restarts Heimdall by default will think the old primary is back to serve as primary again, you must have a STONITH (Shoot The Other Node In The Head) mechanism for informing the old primary that it is no longer the primary, which is necessary to avoid situations where both systems think they are the primary, which will lead to confusion and ultimately data loss.

My focus was to find a solution to automate the process of failover on PostgreSQL, but I found that Heimdall could increase the overall performance of the application by providing:

  • Automated caching.
  • Automated DB failover.
  • Read-Write split.
  • Connection pooling.
  • Query analysis.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

No responses yet

Write a response