AWS Disaster Planning

Disaster recovery (DR) is about preparing for and recovering from any event that has a negative impact on a company’s business continuity or finances. This includes hardware or software failure, a network outage, a power outage, physical damage to a building, human error, or natural disasters.

To minimize the impact of a disaster, companies invest time and resources to plan and prepare, train employees, and document and update processes. Companies that have traditional environments duplicate their infrastructure to ensure the availability of spare capacity. The infrastructure is under-utilized or over-provisioned during normal operations. AWS gives you the flexibility to optimize resources during a DR event, which can result in significant cost savings.

Disaster recovery plan failure

Not all Disaster Recovery (DR) plans are created equal, and many fail. Testing, resources, and planning are vital components of a successful DR plan.

  • Testing – Test your DR plan to validate the implementation. Regularly test failover to your workload’s DR Region to ensure that you are meeting recovery objectives. Avoid developing recovery paths that you rarely run.
  • 22Resources – Regularly run your recovery path in production. This will validate the recovery path and help you verify that resources are sufficient for operation throughout the event.
  • 33Planning – The only recovery that works is the path you test frequently. The capacity of the secondary resources, which might have been sufficient when you last tested, may no longer be able to tolerate your load. This is why it is best to have a small number of recovery paths. Establish recovery patterns and regularly test them.

Failover and Regions

AWS is available in multiple Regions around the globe. You can choose the most appropriate location for your DR site, in addition to the site where your system is fully deployed. It is highly unlikely for a Region to be unavailable. But it is possible if a very large-scale event impacts a Region—for instance, a natural disaster. 

AWS maintains a page that inventories current products and services offered by Region. AWS maintains a strict Region isolation policy so that any large-scale event in one Region will not impact any other Region. We encourage our customers to take a similar multi-Region approach to their strategy. Each Region should be able to be taken offline with no impact to any other Region.

Recovery point objective (RPO) and Recovery time objective (RTO)

RECOVERY POINT OBJECTIVE (RPO)

Recovery Point Objective (RPO) is the acceptable amount of data loss measured in time. 

For example, if a disaster occurs at 1:00 p.m. (13:00) and the RPO is 12 hours, the system should recover all data that was in the system before 1:00 a.m. (01:00) that day. Data loss will, at most, span 12 hours—between 1:00 p.m. and 1:00 a.m.


RECOVERY TIME OBJECTIVE (RTO)

Recovery Time Objective (RTO) is the time it takes after a disruption to restore a business process to its service level, as defined by the operational level agreement (OLA). 

For example, if a disaster occurs at 1:00 p.m. (13:00) and the RTO is 1 hour, the DR process should restore the business process to the acceptable service level by 2:00 p.m. (14:00).

A company typically decides on an acceptable RPO and RTO based on the financial impact to the business when systems are unavailable. The company determines financial impact by considering many factors, such as the loss of business and damage to its reputation due to downtime and the lack of systems availability.

IT organizations plan solutions to provide cost-effective system recovery based on the RPO within the timeline and the service level established by the RTO.

Essential AWS services and features for DR

Before discussing the various approaches to DR, it is important to review the AWS services and features that are the most relevant to it. This section provides a summary. 

When planning for DR, it is important to consider the services and features that support data migration and durable storage. For some of the scenarios that involve either a scaled-down or a fully scaled deployment of your system in AWS, compute resources will be required as well. 

During a disaster, you need to either provision new resources or fail over to existing preconfigured resources. These resources include code and content. But they can also include other pieces, such as Domain Name System (DNS) entries, network firewall rules, and virtual machines or instances. To learn more about the essential AWS services and features for DR.

AWS Backup

AWS Backup is a fully managed backup service that makes it easy to centralize and automate the backup of data across AWS services.  AWS Backup also helps customers support their regulatory compliance obligations and meet business continuity goals. 

AWS Backup works with AWS Organizations. It centrally deploys data protection policies to configure, manage, and govern your backup activity. It works across your AWS accounts and resources. This includes Amazon EC2 instances and Amazon EBS volumes. You can backup databases such as DynamoDB tables, Amazon DocumentDB and Amazon Neptune graph databases, and Amazon RDS databases, including Aurora database clusters. You can also backup Amazon EFS, Amazon S3, Storage Gateway volumes, and all versions of Amazon FSx, including FSx for Lustre and FSx for Windows File Server. 

Backup and restore example

In most traditional environments, data is backed up to tape and sent offsite regularly. If you use this method, it can take a long time to restore your system in the event of a disruption. Amazon S3 is an ideal destination for quick access to your backup. Transferring data to and from Amazon S3 is typically done through the network and is therefore accessible from any location. You can also use a lifecycle policy to move older backups to progressively more cost efficient storage classes over time.

If the remote server fails, you can restore services by deploying a disaster recovery VPC. Use CloudFormation to automate deployment of core networking. Create an EC2 instance using an AMI that matched your remote server. Then restore your systems by retrieving your backups from Amazon S3. You then adjust DNS records to point to AWS.

Disaster Recovery (DR) Architectures on AWS

Review the section below to learn more about the pilot light, low-capacity standby, and multi-site active-active disaster recovery architectures.

Pilot Light

With the pilot light approach, you replicate your data from one environment to another and provision a copy of your core workload infrastructure.

PILOT LIGHT RECOVERY

When disaster strikes, the servers in the recovery environment start up and then Route 53 begins sending them production traffic. The essential infrastructure pieces include DNS, networking features, and various Amazon EC2 features.

Low-capacity standby

Low-capacity standby is similar to pilot light. The warm standby approach involves creating a scaled down, but fully functional, copy of your production environment in a recovery environment. By identifying your business-critical systems, you can fully duplicate these systems on AWS and have them always on. This decreases the time to recovery because you do not have to wait for resources in the recovery environment to start up.

If the production environment is unavailable, Route 53 switches over to the recovery environment, which automatically scales its capacity out in the event of a failover from the primary system. 

For your critical loads, fully working low-capacity standby RTO is as long as it takes to fail over. For all other loads, it takes as long as it takes you to scale up. The RPO depends on the replication type.

When disaster strikes, the servers in the recovery environment start up and then Route 53 begins sending them production traffic. The essential infrastructure pieces include DNS, networking features, and various Amazon EC2 features.

Multi-site active-active

In a disaster situation in Production A, you can adjust the DNS weighting and send all traffic to the Production B environment. The capacity of the AWS service can be rapidly increased to handle the full production load. You can use Amazon EC2 Auto Scaling to automate this process. You might need some application logic to detect the failure of the primary database services and cut over to the parallel database services running in AWS. 

This pattern potentially has the least downtime of all. It has more costs associated with it, because more systems are running. The cost of this scenario is determined by how much production traffic is handled by AWS during normal operation. In the recovery phase, you pay only for what you use for the duration that the DR environment is required at full scale. To further reduce cost, purchase Amazon EC2 Reserved Instances for AWS servers that must be always on. 

Regards

Osama

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.