Let me be honest with you. Nobody likes thinking about disasters. It’s one of those topics we all know is important, but it often gets pushed to the bottom of the priority list until something goes wrong. And when it does go wrong, it’s usually at 3 AM on a Saturday.
I’ve seen organizations lose days of productivity, thousands of dollars, and sometimes customer trust because they didn’t have a proper disaster recovery plan. The good news? OCI makes disaster recovery achievable without breaking the bank or requiring a dedicated team of engineers.
In this article, I’ll walk you through building a realistic DR strategy on OCI. Not the theoretical stuff you find in whitepapers, but the practical decisions you’ll actually face when setting this up.
Understanding Recovery Objectives
Before we touch any OCI console, we need to talk about two numbers that will drive every decision we make.
Recovery Time Objective (RTO) answers the question: How long can your business survive without this system? If your e-commerce platform goes down, can you afford to be offline for 4 hours? 1 hour? 5 minutes?
Recovery Point Objective (RPO) answers a different question: How much data can you afford to lose? If we restore from a backup taken 2 hours ago, is that acceptable? Or do you need every single transaction preserved?
These aren’t technical questions. They’re business questions. And honestly, the answers might surprise you. I’ve worked with clients who assumed they needed zero RPO for everything, only to realize that most of their systems could tolerate 15-30 minutes of data loss without significant business impact.
Here’s how I typically categorize systems:
| Tier | RTO | RPO | Examples |
|---|---|---|---|
| Critical | < 15 min | Near zero | Payment processing, core databases |
| Important | 1-4 hours | < 1 hour | Customer portals, internal apps |
| Standard | 4-24 hours | < 24 hours | Dev environments, reporting systems |
Once you know your tiers, the technical implementation becomes much clearer.
OCI Regions and Availability Domains
OCI’s physical infrastructure is your foundation for DR. Let me explain how it works in plain terms.
Regions are geographically separate data center locations. Think Dubai, Jeddah, Frankfurt, London. They’re far enough apart that a natural disaster affecting one region won’t touch another.
Availability Domains (ADs) are independent data centers within a region. Not all regions have multiple ADs, but the larger ones do. Each AD has its own power, cooling, and networking.
Fault Domains are groupings within an AD that protect against hardware failures. Think of them as different racks or sections of the data center.
For disaster recovery, you’ll typically replicate across regions. For high availability within normal operations, you spread across ADs and fault domains.
Here’s what this looks like in practice:
Primary Region: Dubai (me-dubai-1)├── Availability Domain 1│ ├── Fault Domain 1: Web servers (set 1)│ ├── Fault Domain 2: Web servers (set 2)│ └── Fault Domain 3: Application servers└── Availability Domain 2 └── Database primary + standbyDR Region: Jeddah (me-jeddah-1)└── Full replica (activated during disaster)
Database Disaster Recovery with Data Guard
Let’s start with databases because that’s usually where the most critical data lives. OCI Autonomous Database and Base Database Service both support Data Guard, which handles replication automatically.
For Autonomous Database, enabling DR is surprisingly simple:
bash
# Create a cross-region standby for Autonomous Databaseoci db autonomous-database create-cross-region-disaster-recovery-details \ --autonomous-database-id ocid1.autonomousdatabase.oc1.me-dubai-1.xxx \ --disaster-recovery-type BACKUP_BASED \ --remote-disaster-recovery-type SNAPSHOT \ --dr-region-name me-jeddah-1
But here’s where it gets interesting. You have choices:
Backup-Based DR copies backups to the remote region. It’s cheaper but has higher RPO (you might lose the data since the last backup). Good for Tier 2 and Tier 3 systems.
Real-Time DR uses Active Data Guard to replicate changes continuously. Near-zero RPO but costs more because you’re running a standby database. Essential for Tier 1 systems.
For Base Database Service with Data Guard, you configure it like this:
bash
# Enable Data Guard for DB Systemoci db data-guard-association create \ --database-id ocid1.database.oc1.me-dubai-1.xxx \ --creation-type NewDbSystem \ --database-admin-password "YourSecurePassword123!" \ --protection-mode MAXIMUM_PERFORMANCE \ --transport-type ASYNC \ --peer-db-system-id ocid1.dbsystem.oc1.me-jeddah-1.xxx
The protection modes matter:
- Maximum Performance: Transactions commit without waiting for standby confirmation. Best performance, slight risk of data loss during failover.
- Maximum Availability: Transactions wait for standby acknowledgment but fall back to Maximum Performance if standby is unreachable.
- Maximum Protection: Transactions fail if standby is unreachable. Zero data loss, but availability depends on standby.
Most production systems use Maximum Performance or Maximum Availability. Maximum Protection is rare because it can halt your primary if the network between regions has issues.
Compute and Application Layer DR
Databases are just one piece. Your application servers, load balancers, and supporting infrastructure also need DR planning.
Option 1: Pilot Light
This is my favorite approach for most organizations. You keep a minimal footprint running in the DR region, just enough to start recovery quickly.
hcl
# Terraform for pilot light infrastructure in DR region# Minimal compute that can be scaled up during disasterresource "oci_core_instance" "dr_pilot" { availability_domain = data.oci_identity_availability_domain.dr_ad.name compartment_id = var.compartment_id shape = "VM.Standard.E4.Flex" shape_config { ocpus = 1 # Minimal during normal ops memory_in_gbs = 8 } display_name = "dr-pilot-instance" source_details { source_type = "image" source_id = var.application_image_id } metadata = { ssh_authorized_keys = var.ssh_public_key user_data = base64encode(file("./scripts/pilot-light-startup.sh")) }}# Load balancer ready but with no backends attachedresource "oci_load_balancer" "dr_lb" { compartment_id = var.compartment_id display_name = "dr-load-balancer" shape = "flexible" shape_details { minimum_bandwidth_in_mbps = 10 maximum_bandwidth_in_mbps = 100 } subnet_ids = [oci_core_subnet.dr_public_subnet.id]}
The startup script keeps the instance ready without consuming resources:
bash
#!/bin/bash# pilot-light-startup.sh# Install application but don't start ityum install -y application-server# Pull latest configuration from Object Storageoci os object get \ --bucket-name dr-config-bucket \ --name app-config.tar.gz \ --file /opt/app/config.tar.gztar -xzf /opt/app/config.tar.gz -C /opt/app/# Leave application stopped until failover activationecho "Pilot light instance ready. Application not started."
Option 2: Warm Standby
For systems that need faster recovery, you run a scaled-down version of your production environment continuously:
hcl
# Warm standby with reduced capacityresource "oci_core_instance_pool" "dr_app_pool" { compartment_id = var.compartment_id instance_configuration_id = oci_core_instance_configuration.app_config.id placement_configurations { availability_domain = data.oci_identity_availability_domain.dr_ad.name primary_subnet_id = oci_core_subnet.dr_app_subnet.id } size = 2 # Production runs 6, DR runs 2 display_name = "dr-app-pool"}# Autoscaling policy to expand during failoverresource "oci_autoscaling_auto_scaling_configuration" "dr_scaling" { compartment_id = var.compartment_id auto_scaling_resources { id = oci_core_instance_pool.dr_app_pool.id type = "instancePool" } policies { display_name = "failover-scale-up" policy_type = "threshold" rules { action { type = "CHANGE_COUNT_BY" value = 4 # Add 4 instances to match production } metric { metric_type = "CPU_UTILIZATION" threshold { operator = "GT" value = 70 } } } }}
Object Storage Replication
Your files, backups, and static assets need protection too. OCI Object Storage supports cross-region replication:
bash
# Create replication policyoci os replication create-replication-policy \ --bucket-name production-assets \ --destination-bucket-name dr-assets \ --destination-region me-jeddah-1 \ --name "prod-to-dr-replication"
One thing people often miss: replication is asynchronous. For critical files that absolutely cannot be lost, consider writing to both regions from your application:
python
# Python example: Writing to both regionsimport ocidef upload_critical_file(file_path, object_name): config_primary = oci.config.from_file(profile_name="PRIMARY") config_dr = oci.config.from_file(profile_name="DR") primary_client = oci.object_storage.ObjectStorageClient(config_primary) dr_client = oci.object_storage.ObjectStorageClient(config_dr) with open(file_path, 'rb') as f: file_content = f.read() # Write to primary primary_client.put_object( namespace_name="your-namespace", bucket_name="critical-files", object_name=object_name, put_object_body=file_content ) # Write to DR region dr_client.put_object( namespace_name="your-namespace", bucket_name="critical-files-dr", object_name=object_name, put_object_body=file_content ) print(f"File {object_name} written to both regions")
DNS and Traffic Management
When disaster strikes, you need to redirect users to your DR region. OCI DNS with Traffic Management makes this manageable:
hcl
# Traffic Management Steering Policyresource "oci_dns_steering_policy" "failover" { compartment_id = var.compartment_id display_name = "app-failover-policy" template = "FAILOVER" # Primary region answers answers { name = "primary" rtype = "A" rdata = var.primary_lb_ip pool = "primary-pool" is_disabled = false } # DR region answers answers { name = "dr" rtype = "A" rdata = var.dr_lb_ip pool = "dr-pool" is_disabled = false } rules { rule_type = "FILTER" } rules { rule_type = "HEALTH" } rules { rule_type = "PRIORITY" default_answer_data { answer_condition = "answer.pool == 'primary-pool'" value = 1 } default_answer_data { answer_condition = "answer.pool == 'dr-pool'" value = 2 } }}# Health check for primary regionresource "oci_health_checks_http_monitor" "primary_health" { compartment_id = var.compartment_id display_name = "primary-region-health" interval_in_seconds = 30 targets = [var.primary_lb_ip] protocol = "HTTPS" port = 443 path = "/health" timeout_in_seconds = 10}
The Failover Runbook
All this infrastructure means nothing without a clear process. Here’s a realistic runbook:
Automated Detection
python
# OCI Function to detect and alert on regional issuesimport ociimport jsondef handler(ctx, data: io.BytesIO = None): signer = oci.auth.signers.get_resource_principals_signer() monitoring_client = oci.monitoring.MonitoringClient(config={}, signer=signer) # Check critical metrics response = monitoring_client.summarize_metrics_data( compartment_id="ocid1.compartment.xxx", summarize_metrics_data_details=oci.monitoring.models.SummarizeMetricsDataDetails( namespace="oci_lbaas", query='UnHealthyBackendServers[5m].sum() > 2' ) ) if response.data: # Trigger alert notifications_client = oci.ons.NotificationDataPlaneClient(config={}, signer=signer) notifications_client.publish_message( topic_id="ocid1.onstopic.xxx", message_details=oci.ons.models.MessageDetails( title="DR Alert: Primary Region Degraded", body="Multiple backend servers unhealthy. Consider initiating failover." ) ) return response
Manual Failover Steps
bash
#!/bin/bash# failover.sh - Execute with cautionset -eecho "=== OCI DISASTER RECOVERY FAILOVER ==="echo "This will switch production traffic to the DR region."read -p "Type 'FAILOVER' to confirm: " confirmationif [ "$confirmation" != "FAILOVER" ]; then echo "Failover cancelled." exit 1fiecho "[1/5] Initiating database switchover..."oci db data-guard-association switchover \ --database-id $PRIMARY_DB_ID \ --data-guard-association-id $DG_ASSOCIATION_IDecho "[2/5] Scaling up DR compute instances..."oci compute instance-pool update \ --instance-pool-id $DR_INSTANCE_POOL_ID \ --size 6echo "[3/5] Waiting for instances to be running..."sleep 120echo "[4/5] Updating load balancer backends..."oci lb backend-set update \ --load-balancer-id $DR_LB_ID \ --backend-set-name "app-backend-set" \ --backends file://dr-backends.jsonecho "[5/5] Updating DNS steering policy..."oci dns steering-policy update \ --steering-policy-id $STEERING_POLICY_ID \ --rules file://failover-rules.jsonecho "=== FAILOVER COMPLETE ==="echo "Verify application at: https://app.example.com"
Testing Your DR Plan
Here’s the uncomfortable truth: a DR plan that hasn’t been tested is just documentation. You need to actually run failovers.
I recommend this schedule:
- Monthly: Tabletop exercise. Walk through the runbook with your team without actually executing anything.
- Quarterly: Partial failover. Switch one non-critical component to DR and back.
- Annually: Full DR test. Fail over completely and run production from the DR region for at least 4 hours.
Document everything:
markdown
## DR Test Report - Q4 2025**Date**: December 15, 2025**Participants**: Ahmed, Sarah, Mohammed**Test Type**: Full failover### Timeline- 09:00 - Initiated failover sequence- 09:03 - Database switchover complete- 09:08 - Compute instances running in DR- 09:12 - DNS propagation confirmed- 09:15 - Application accessible from DR region### Issues Discovered1. SSL certificate for DR load balancer had expired - Resolution: Renewed certificate, added calendar reminder2. One microservice had hardcoded primary region endpoint - Resolution: Updated to use DNS name instead### RTO Achieved15 minutes (Target: 30 minutes) ✓### RPO Achieved< 30 seconds of transaction loss ✓### Action Items- [ ] Automate certificate renewal monitoring- [ ] Audit all services for hardcoded endpoints- [ ] Update runbook with SSL verification step
Cost Optimization
DR doesn’t have to be expensive. Here are real strategies I use:
Right-size your DR tier: Not everything needs instant failover. Be honest about what’s truly critical.
Use preemptible instances for testing: When you’re just validating your DR setup works, you don’t need full-price compute:
hcl
resource "oci_core_instance" "dr_test" { # ... other config ... preemptible_instance_config { preemption_action { type = "TERMINATE" preserve_boot_volume = false } }}
Schedule DR resources: If you’re running warm standby, scale it down during your off-peak hours:
bash
# Scale down at night, scale up in morning# Cron job or OCI Scheduler0 22 * * * oci compute instance-pool update --instance-pool-id $POOL_ID --size 10 6 * * * oci compute instance-pool update --instance-pool-id $POOL_ID --size 2
Leverage reserved capacity: If you’re committed to DR, reserved capacity in your DR region is cheaper than on-demand.