Designing a Disaster Recovery Strategy on Oracle Cloud Infrastructure: A Practical Guide

Let me be honest with you. Nobody likes thinking about disasters. It’s one of those topics we all know is important, but it often gets pushed to the bottom of the priority list until something goes wrong. And when it does go wrong, it’s usually at 3 AM on a Saturday.

I’ve seen organizations lose days of productivity, thousands of dollars, and sometimes customer trust because they didn’t have a proper disaster recovery plan. The good news? OCI makes disaster recovery achievable without breaking the bank or requiring a dedicated team of engineers.

In this article, I’ll walk you through building a realistic DR strategy on OCI. Not the theoretical stuff you find in whitepapers, but the practical decisions you’ll actually face when setting this up.

Understanding Recovery Objectives

Before we touch any OCI console, we need to talk about two numbers that will drive every decision we make.

Recovery Time Objective (RTO) answers the question: How long can your business survive without this system? If your e-commerce platform goes down, can you afford to be offline for 4 hours? 1 hour? 5 minutes?

Recovery Point Objective (RPO) answers a different question: How much data can you afford to lose? If we restore from a backup taken 2 hours ago, is that acceptable? Or do you need every single transaction preserved?

These aren’t technical questions. They’re business questions. And honestly, the answers might surprise you. I’ve worked with clients who assumed they needed zero RPO for everything, only to realize that most of their systems could tolerate 15-30 minutes of data loss without significant business impact.

Here’s how I typically categorize systems:

TierRTORPOExamples
Critical< 15 minNear zeroPayment processing, core databases
Important1-4 hours< 1 hourCustomer portals, internal apps
Standard4-24 hours< 24 hoursDev environments, reporting systems

Once you know your tiers, the technical implementation becomes much clearer.

OCI Regions and Availability Domains

OCI’s physical infrastructure is your foundation for DR. Let me explain how it works in plain terms.

Regions are geographically separate data center locations. Think Dubai, Jeddah, Frankfurt, London. They’re far enough apart that a natural disaster affecting one region won’t touch another.

Availability Domains (ADs) are independent data centers within a region. Not all regions have multiple ADs, but the larger ones do. Each AD has its own power, cooling, and networking.

Fault Domains are groupings within an AD that protect against hardware failures. Think of them as different racks or sections of the data center.

For disaster recovery, you’ll typically replicate across regions. For high availability within normal operations, you spread across ADs and fault domains.

Here’s what this looks like in practice:

Primary Region: Dubai (me-dubai-1)
├── Availability Domain 1
│ ├── Fault Domain 1: Web servers (set 1)
│ ├── Fault Domain 2: Web servers (set 2)
│ └── Fault Domain 3: Application servers
└── Availability Domain 2
└── Database primary + standby
DR Region: Jeddah (me-jeddah-1)
└── Full replica (activated during disaster)

Database Disaster Recovery with Data Guard

Let’s start with databases because that’s usually where the most critical data lives. OCI Autonomous Database and Base Database Service both support Data Guard, which handles replication automatically.

For Autonomous Database, enabling DR is surprisingly simple:

bash

# Create a cross-region standby for Autonomous Database
oci db autonomous-database create-cross-region-disaster-recovery-details \
--autonomous-database-id ocid1.autonomousdatabase.oc1.me-dubai-1.xxx \
--disaster-recovery-type BACKUP_BASED \
--remote-disaster-recovery-type SNAPSHOT \
--dr-region-name me-jeddah-1

But here’s where it gets interesting. You have choices:

Backup-Based DR copies backups to the remote region. It’s cheaper but has higher RPO (you might lose the data since the last backup). Good for Tier 2 and Tier 3 systems.

Real-Time DR uses Active Data Guard to replicate changes continuously. Near-zero RPO but costs more because you’re running a standby database. Essential for Tier 1 systems.

For Base Database Service with Data Guard, you configure it like this:

bash

# Enable Data Guard for DB System
oci db data-guard-association create \
--database-id ocid1.database.oc1.me-dubai-1.xxx \
--creation-type NewDbSystem \
--database-admin-password "YourSecurePassword123!" \
--protection-mode MAXIMUM_PERFORMANCE \
--transport-type ASYNC \
--peer-db-system-id ocid1.dbsystem.oc1.me-jeddah-1.xxx

The protection modes matter:

  • Maximum Performance: Transactions commit without waiting for standby confirmation. Best performance, slight risk of data loss during failover.
  • Maximum Availability: Transactions wait for standby acknowledgment but fall back to Maximum Performance if standby is unreachable.
  • Maximum Protection: Transactions fail if standby is unreachable. Zero data loss, but availability depends on standby.

Most production systems use Maximum Performance or Maximum Availability. Maximum Protection is rare because it can halt your primary if the network between regions has issues.

Compute and Application Layer DR

Databases are just one piece. Your application servers, load balancers, and supporting infrastructure also need DR planning.

Option 1: Pilot Light

This is my favorite approach for most organizations. You keep a minimal footprint running in the DR region, just enough to start recovery quickly.

hcl

# Terraform for pilot light infrastructure in DR region
# Minimal compute that can be scaled up during disaster
resource "oci_core_instance" "dr_pilot" {
availability_domain = data.oci_identity_availability_domain.dr_ad.name
compartment_id = var.compartment_id
shape = "VM.Standard.E4.Flex"
shape_config {
ocpus = 1 # Minimal during normal ops
memory_in_gbs = 8
}
display_name = "dr-pilot-instance"
source_details {
source_type = "image"
source_id = var.application_image_id
}
metadata = {
ssh_authorized_keys = var.ssh_public_key
user_data = base64encode(file("./scripts/pilot-light-startup.sh"))
}
}
# Load balancer ready but with no backends attached
resource "oci_load_balancer" "dr_lb" {
compartment_id = var.compartment_id
display_name = "dr-load-balancer"
shape = "flexible"
shape_details {
minimum_bandwidth_in_mbps = 10
maximum_bandwidth_in_mbps = 100
}
subnet_ids = [oci_core_subnet.dr_public_subnet.id]
}

The startup script keeps the instance ready without consuming resources:

bash

#!/bin/bash
# pilot-light-startup.sh
# Install application but don't start it
yum install -y application-server
# Pull latest configuration from Object Storage
oci os object get \
--bucket-name dr-config-bucket \
--name app-config.tar.gz \
--file /opt/app/config.tar.gz
tar -xzf /opt/app/config.tar.gz -C /opt/app/
# Leave application stopped until failover activation
echo "Pilot light instance ready. Application not started."

Option 2: Warm Standby

For systems that need faster recovery, you run a scaled-down version of your production environment continuously:

hcl

# Warm standby with reduced capacity
resource "oci_core_instance_pool" "dr_app_pool" {
compartment_id = var.compartment_id
instance_configuration_id = oci_core_instance_configuration.app_config.id
placement_configurations {
availability_domain = data.oci_identity_availability_domain.dr_ad.name
primary_subnet_id = oci_core_subnet.dr_app_subnet.id
}
size = 2 # Production runs 6, DR runs 2
display_name = "dr-app-pool"
}
# Autoscaling policy to expand during failover
resource "oci_autoscaling_auto_scaling_configuration" "dr_scaling" {
compartment_id = var.compartment_id
auto_scaling_resources {
id = oci_core_instance_pool.dr_app_pool.id
type = "instancePool"
}
policies {
display_name = "failover-scale-up"
policy_type = "threshold"
rules {
action {
type = "CHANGE_COUNT_BY"
value = 4 # Add 4 instances to match production
}
metric {
metric_type = "CPU_UTILIZATION"
threshold {
operator = "GT"
value = 70
}
}
}
}
}

Object Storage Replication

Your files, backups, and static assets need protection too. OCI Object Storage supports cross-region replication:

bash

# Create replication policy
oci os replication create-replication-policy \
--bucket-name production-assets \
--destination-bucket-name dr-assets \
--destination-region me-jeddah-1 \
--name "prod-to-dr-replication"

One thing people often miss: replication is asynchronous. For critical files that absolutely cannot be lost, consider writing to both regions from your application:

python

# Python example: Writing to both regions
import oci
def upload_critical_file(file_path, object_name):
config_primary = oci.config.from_file(profile_name="PRIMARY")
config_dr = oci.config.from_file(profile_name="DR")
primary_client = oci.object_storage.ObjectStorageClient(config_primary)
dr_client = oci.object_storage.ObjectStorageClient(config_dr)
with open(file_path, 'rb') as f:
file_content = f.read()
# Write to primary
primary_client.put_object(
namespace_name="your-namespace",
bucket_name="critical-files",
object_name=object_name,
put_object_body=file_content
)
# Write to DR region
dr_client.put_object(
namespace_name="your-namespace",
bucket_name="critical-files-dr",
object_name=object_name,
put_object_body=file_content
)
print(f"File {object_name} written to both regions")

DNS and Traffic Management

When disaster strikes, you need to redirect users to your DR region. OCI DNS with Traffic Management makes this manageable:

hcl

# Traffic Management Steering Policy
resource "oci_dns_steering_policy" "failover" {
compartment_id = var.compartment_id
display_name = "app-failover-policy"
template = "FAILOVER"
# Primary region answers
answers {
name = "primary"
rtype = "A"
rdata = var.primary_lb_ip
pool = "primary-pool"
is_disabled = false
}
# DR region answers
answers {
name = "dr"
rtype = "A"
rdata = var.dr_lb_ip
pool = "dr-pool"
is_disabled = false
}
rules {
rule_type = "FILTER"
}
rules {
rule_type = "HEALTH"
}
rules {
rule_type = "PRIORITY"
default_answer_data {
answer_condition = "answer.pool == 'primary-pool'"
value = 1
}
default_answer_data {
answer_condition = "answer.pool == 'dr-pool'"
value = 2
}
}
}
# Health check for primary region
resource "oci_health_checks_http_monitor" "primary_health" {
compartment_id = var.compartment_id
display_name = "primary-region-health"
interval_in_seconds = 30
targets = [var.primary_lb_ip]
protocol = "HTTPS"
port = 443
path = "/health"
timeout_in_seconds = 10
}

The Failover Runbook

All this infrastructure means nothing without a clear process. Here’s a realistic runbook:

Automated Detection

python

# OCI Function to detect and alert on regional issues
import oci
import json
def handler(ctx, data: io.BytesIO = None):
signer = oci.auth.signers.get_resource_principals_signer()
monitoring_client = oci.monitoring.MonitoringClient(config={}, signer=signer)
# Check critical metrics
response = monitoring_client.summarize_metrics_data(
compartment_id="ocid1.compartment.xxx",
summarize_metrics_data_details=oci.monitoring.models.SummarizeMetricsDataDetails(
namespace="oci_lbaas",
query='UnHealthyBackendServers[5m].sum() > 2'
)
)
if response.data:
# Trigger alert
notifications_client = oci.ons.NotificationDataPlaneClient(config={}, signer=signer)
notifications_client.publish_message(
topic_id="ocid1.onstopic.xxx",
message_details=oci.ons.models.MessageDetails(
title="DR Alert: Primary Region Degraded",
body="Multiple backend servers unhealthy. Consider initiating failover."
)
)
return response

Manual Failover Steps

bash

#!/bin/bash
# failover.sh - Execute with caution
set -e
echo "=== OCI DISASTER RECOVERY FAILOVER ==="
echo "This will switch production traffic to the DR region."
read -p "Type 'FAILOVER' to confirm: " confirmation
if [ "$confirmation" != "FAILOVER" ]; then
echo "Failover cancelled."
exit 1
fi
echo "[1/5] Initiating database switchover..."
oci db data-guard-association switchover \
--database-id $PRIMARY_DB_ID \
--data-guard-association-id $DG_ASSOCIATION_ID
echo "[2/5] Scaling up DR compute instances..."
oci compute instance-pool update \
--instance-pool-id $DR_INSTANCE_POOL_ID \
--size 6
echo "[3/5] Waiting for instances to be running..."
sleep 120
echo "[4/5] Updating load balancer backends..."
oci lb backend-set update \
--load-balancer-id $DR_LB_ID \
--backend-set-name "app-backend-set" \
--backends file://dr-backends.json
echo "[5/5] Updating DNS steering policy..."
oci dns steering-policy update \
--steering-policy-id $STEERING_POLICY_ID \
--rules file://failover-rules.json
echo "=== FAILOVER COMPLETE ==="
echo "Verify application at: https://app.example.com"

Testing Your DR Plan

Here’s the uncomfortable truth: a DR plan that hasn’t been tested is just documentation. You need to actually run failovers.

I recommend this schedule:

  • Monthly: Tabletop exercise. Walk through the runbook with your team without actually executing anything.
  • Quarterly: Partial failover. Switch one non-critical component to DR and back.
  • Annually: Full DR test. Fail over completely and run production from the DR region for at least 4 hours.

Document everything:

markdown

## DR Test Report - Q4 2025
**Date**: December 15, 2025
**Participants**: Ahmed, Sarah, Mohammed
**Test Type**: Full failover
### Timeline
- 09:00 - Initiated failover sequence
- 09:03 - Database switchover complete
- 09:08 - Compute instances running in DR
- 09:12 - DNS propagation confirmed
- 09:15 - Application accessible from DR region
### Issues Discovered
1. SSL certificate for DR load balancer had expired
- Resolution: Renewed certificate, added calendar reminder
2. One microservice had hardcoded primary region endpoint
- Resolution: Updated to use DNS name instead
### RTO Achieved
15 minutes (Target: 30 minutes) ✓
### RPO Achieved
< 30 seconds of transaction loss ✓
### Action Items
- [ ] Automate certificate renewal monitoring
- [ ] Audit all services for hardcoded endpoints
- [ ] Update runbook with SSL verification step

Cost Optimization

DR doesn’t have to be expensive. Here are real strategies I use:

Right-size your DR tier: Not everything needs instant failover. Be honest about what’s truly critical.

Use preemptible instances for testing: When you’re just validating your DR setup works, you don’t need full-price compute:

hcl

resource "oci_core_instance" "dr_test" {
# ... other config ...
preemptible_instance_config {
preemption_action {
type = "TERMINATE"
preserve_boot_volume = false
}
}
}

Schedule DR resources: If you’re running warm standby, scale it down during your off-peak hours:

bash

# Scale down at night, scale up in morning
# Cron job or OCI Scheduler
0 22 * * * oci compute instance-pool update --instance-pool-id $POOL_ID --size 1
0 6 * * * oci compute instance-pool update --instance-pool-id $POOL_ID --size 2

Leverage reserved capacity: If you’re committed to DR, reserved capacity in your DR region is cheaper than on-demand.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.