In today’s IT world, users expect online services to be available 24×7 so that they can access them when it’s most convenient. But as IT failures are inevitable, to ensure that availability, rather than thinking about what might happen if things go wrong, it’s vital that you have a business continuity plan (BCP) in place to determine what will happen when things go wrong.
Failures can occur for all sorts of reasons, including:
- Infrastructure issues – hardware devices, networks and entire sites can fail
- Software issues – software and web services can fail for a variety of reasons ranging from memory leaks and thread-safe issues through to poorly tested changes
- Human action – operators might make an error (misconfiguration) or cause a component to become unavailable, for example, by taking it offline for maintenance
When designing cloud applications, it’s important to consider how they will handle these failures and errors. You should engineer your applications to respond gracefully to failures and continue servicing requests.
For example, ask yourself how your applications will behave if:
- The application client can’t connect to the application server
- An application component is unavailable, for example the application can’t connect to its database
- The response generated by a component is not as expected, for example, a postcode lookup to a web service returns an unexpected value (such as an error code)
This blog highlights some of the main considerations you should think about when designing your applications for high availability.
Eliminate single points of failure
One of the key goals of high availability is to eliminate single points of failure in your infrastructure. A single point of failure is a component in your technology stack that would cause a service interruption if it became unavailable. As such, any component that is necessary for the proper functionality of your application that does not have redundancy is considered to be a single point of failure.
Distribute across multiple locations
By distributing your application across multiple locations, you can ensure that if anything goes wrong in one location, such as network connectivity or power issues, service can continue in another. In addition, by selecting geographically diverse locations, you can mitigate against more extreme failures, such as a natural disaster, targeted terrorist attack or mass WAN failure.
Evaluate your risk appetite
Building redundancy into your environment to make it highly available can quickly become expensive, so you should carefully consider what risks you’re prepared to take for some or all of your applications and what kind of budget you have to mitigate that risk. High availability requires that you find a balance between your risk appetite and cost so you should evaluate your specific requirements carefully.
Ask yourself: if your application goes down, how quickly do you need it to come back up. Consider:
- Recovery Point Objective (RPO): how much data can you afford to lose?
- Recovery Time Objective (RTO): how long can you afford for your application to be down?
Monitor the health of your environment
By regularly checking your systems to make sure everything is behaving as it should, you can respond quickly to prevent potential outages before they have a chance to cause disruption. Some things you should be monitoring include:
- Availability – Can your users get to your applications?
- Performance – Are they responding quickly enough?
- Trending information – Are things getting better or worse over time?
Be aware of potential cyber threats
What does your security posture look like? Do you have a strategy in place to counter attempted DDoS attacks? What happens if you’re struck by malware, ransomware or a virus?
Cyber attacks are big news, so to avoid being the next headline you should ensure that you have some form of protective monitoring in place to make sure you’re covered.
An effective protective monitoring service should:
- Detect that you’re being attacked
- Defend against it immediately
- Determine what happened and what impact it had
Consider the worst-case scenario
What happens if an entire site goes down? Does your high availability strategy extend to cater for more catastrophic failures with your applications gracefully resuming operations in an alternative site until the disaster is over?
You should design your applications to expect an entire site to become unavailable and to gracefully resume operations within an alternative site until the original site becomes available again. You can extend the techniques for delivering high availability to cater for more catastrophic failures.
There are two primary challenges when running an application actively across two sites:
- Directing users to each of the two sites using solutions such as global load balancers, content delivery networks or DNS-based solutions
- Handling data consistency so that transactions can occur at either site, for example by using application-layer replication technologies or application-layer intelligence
Some final thoughts on how you can design your application architecture for high availability:
- Design your cloud applications to be loosely coupled so that, wherever possible, application components can continue to operate even if other application components fail or are unavailable
- Implement message queuing to enable components to interact with each other by submitting messages via a message queue, which is processed by the first available worker
- Make sure you have a robust backup strategy that takes a copy of your application or database data at appropriate intervals to meet your Recovery Point Objective (RPO) and store the copy somewhere outside of your current site
- Implement an N+1 model for web, application and database services to ensure you always have enough resources available
- Distribute incoming traffic across several instances using load balancing
- Store data redundantly and separately from the instance so if an instance fails the storage can be attached to a new one
- Store objects redundantly on multiple devices across multiple facilities and use versioning to protect against accidental deletion
- Consider how you can utilise additional tooling that might be available to you within your environment to support your high availability strategy
- For Platform-as-a-Service (PaaS) and Software-as-a-Service (SaaS), you have less control over the availability of your applications, so when choosing a provider of these services you should pay particular attention to the HA features and service level agreement (SLA) offered to make sure they meet your requirements
The multi-cloud experts
It’s important to note that there is no one design fits all. If you’re looking to implement a BCP or improve the availability of your applications, you can talk to UKCloud’s multi-cloud experts who can help you design the best high availability solution for your requirements.