High Availability (HA) is a fantastic topic. It's often an overlooked topic in the world of performance. It's not as sexy as low latency nor as impressive as high throughput. Yet so many business processes can be significantly adversely impacted by an unavailable application. This is especially true in auto pricing systems employed by banks. Downtime for them means they're not competing in the market and they're not reaching their customer base. This not only represents a missed opportunity but in the more severe cases, loss of business and reputational damage.
Let me share some thoughts on how to think about HA.
Disclaimer: This is a simplified view of the world and considers a single failure in a pair of nodes running on separate servers only. This does not discuss network availability, platform issues such as garbage collection, persistent stores such as databases, or the fault tolerance of the consumers with regards to stream recovery.
Overview of available high availability mechanisms
Generally people speak of HA in terms of 4 policies:
The active/passive part denotes whether the node is doing processing or not. The hot/warm/cold part denotes how active a node is. Cold are inactive nodes that are not running. Warm nodes are running but may or may not be active. Hot nodes are always active.
hot-active/cold-passive: Upon failure the secondary is started, which in addition to booting up will need to load the last good state of the primary process before it can become active. This tends to result in more latent failovers but is one of the simplest HA mechanisms to get going with and frankly, is often one of the most reliable.
hot-active/warm-passive: When failover occurs the secondary must load the last good state of the failed primary process before it can become the active. Failover occurs with less latency than the previous policy because the secondary has already booted up.
hot-active/warm-active: The secondary is running and doing processing but either emitting no output or being ignored by consumers until a failover. When failover occurs the secondary node already has the last good state (thanks to being active) and so only needs to become hot, either by starting to emit output or by the consumers recognising the secondary as the hot node. Again, this improves upon latency of the previous policy because there's no need to boot up or load the last good state.
hot-active/hot-active: Often this accompanied by either load balancing or deduplication on the consumer. The idea is that failover occurs immediately with the lowest latency possible. Users typically shouldn't notice any outage. Imagine you've sent a request, both nodes receive it, process it and emit a response. The requester must then deduplicate the responses, this is fine so long as the responses are always identical. If the response is formed by an aggregation, random or time sensitive computation then this can become difficult and consistency of the response needs considering when deduplicating (My favourite paper that help reason about this is: High-Availability Algorithms for Distributed Stream Processing)
Detecting failure and triggering failover
Detecting a failure is a more significant topic than I've the patience to go into, but fundamentally there are a few simple ways to detect a failure.
- Is the process running? A monitoring process checks the process list periodically, if the application is not running, there's been a failure.
- Is the process heartbeating? The process on a periodic interval will send a heartbeat, this may take the form of a message that's broadcast, a file that's touched or even a database table that is updated. The monitor waits for a sensible period longer than the heartbeat interval (to allow for natural variance in sending and receiving the heartbeat) before marking the process as failed.
- Is the process in a good state? This is much easier to say than it is to do. A simple (read as naive) way to accomplish this is to allow the process to send a status message that contains it's own assessment of factors contributing to it's health. This could be anything from a flag to a list of logical status' of all subsystems. This does not allow you to detect deeper errors (i.e. a subsystem has failed but the process fails to determine this). Often the simplest way of determining if a process is in a good state is by humans inspecting the process and it's output such as logs.
Most of the systems I have worked on require the standby node(s) to observe the active, and upon detecting the failure makes a decision on how to failover then triggers it automatically. That said, this is not the only way to approach this. I see there being three main triggering mechanisms:
- Humans. Don't under estimate the value in having a human inspect the environment, determine it's health then trigger a failover manually (uh, in this world where I'm (optimistically) imagining humans without their propensity to err).
- Supervisor/Monitor processes. These processes are dedicated to watching the health of the HA group nodes. When it detects an unhealthy node the common approach is to kill it off and either start the secondary (hot/cold) or to send a signal to the secondary node to take over (hot/warm). In the hot/warm case this means that the process must have been developed with a component that can understand and act on the signal.
- Automatic standby failover. Here the standby processes monitor the active and take over on failure. This typically combines the a detection and a triggering mechanism and requires your application to have been developed with an embedded component to support both. This isn't suitable for hot/cold policies since the standby isn't running.
Reasoning about the performance of your high availability mechanism
Non-functional requirements for HA are usually expressed as.
The system must not be unavailable for more than a period in a period
People often speak of a a percentage of time available, e.g. 99.9% (also known as three nines - wikipedia) but this boils down to the same expression. Occasionally the more liberal requirement is stated:
The system must recover from a failure within period
These requirements relate to one another such that in any period the number of failures a system can experience, can be expressed as:
e.g. if you cannot be unavailable for more than 10 minutes in a 60 minute period and your system can recover within 2 minutes this means the total number of failures you can handle in any 60 minute interval is:
The time it takes to failover, can be further decomposed into the time it takes to detect a failure, and the time it takes to failover, .
If you're detecting failure using a process list then the interval you should refresh your list at is . If you're using a heartbeat then your last heartbeat must have been received within the period (this means your heartbeat interval itself will be , I'd recommend the interval should typically be ).
The time it takes to failover depends on the HA policy. If we decompose this into the time it takes to boot up, and the time it takes to initialise the last known good state of the active, . Then the time it takes to failover becomes:
Then the following summaries can be considered true
- hot-active/cold-passive; , and
- hot-active/warm-passive; and but
- hot-active/warm-active; but
Note the hot-active/hot-active policy where I've made the assumption that responses to a request are always identical and the consumers can safely deduplicate them. If we consider the case where they may not be identical then a load balancer or queue is required in front of the HA group. With a load balancer . In this case you will need to consider how long it takes for the load balancer to detect a failure and remove it from it's routes.
This could mean your value of (the time it takes to failover) in hot-active/hot-active may be higher or similar to an hot-active/warm-active policy. In spite of this the hot-active/hot-active policy is very good when your system would otherwise be overloaded and thus become unavailable since the hot-active/warm-active would be insufficient to cope with this scenario.
In addition to knowing how long your system may be unavailable over a period you should consider how many simultaneous failures your system can tolerate, this capability should be stated:
The system remains highly available for simultaneous failures
So far what I've spoke about assumes that it can cope with a single process failure. I have not considered more practical elements such as how the network or network topology affects the HA. Here for instance a using a single network switch can violate this requirement - if it blows then your entire system blows.
Nor have I discussed how to cope with failure in the monitoring processes or how to consider disaster recovery (where your entire server site is unavailable for one reason or another).