Building a Multi-Cloud Architecture with OCI and AWS: A Real-World Integration Guide

I’ll tell you something that might sound controversial in cloud circles: the best cloud is often more than one cloud.

I’ve worked with dozens of enterprises over the years, and here’s what I’ve noticed. Some started with AWS years ago and built their entire infrastructure there. Then they realized Oracle Autonomous Database or Exadata could dramatically improve their database performance. Others were Oracle shops that wanted to leverage AWS’s machine learning services or global edge network.

The question isn’t really “which cloud is better?” The question is “how do we get the best of both?”

In this article, I’ll walk you through building a practical multi-cloud architecture connecting OCI and AWS. We’ll cover secure networking, data synchronization, identity federation, and the operational realities of running workloads across both platforms.

Why Multi-Cloud Actually Makes Sense

Let me be clear about something. Multi-cloud for its own sake is a terrible idea. It adds complexity, increases operational burden, and creates more things that can break. But multi-cloud for the right reasons? That’s a different story.

Here are legitimate reasons I’ve seen organizations adopt OCI and AWS together:

Database Performance: Oracle Autonomous Database and Exadata Cloud Service are genuinely difficult to match for Oracle workloads. If you’re running complex OLTP or analytics on Oracle, OCI’s database offerings are purpose-built for that.

AWS Ecosystem: AWS has services that simply don’t exist elsewhere. SageMaker for ML, Lambda’s maturity, CloudFront’s global presence, or specialized services like Rekognition and Comprehend.

Vendor Negotiation: Having workloads on multiple clouds gives you negotiating leverage. I’ve seen organizations save millions in licensing by demonstrating they could move workloads.

Acquisition and Mergers: Company A runs on AWS, Company B runs on OCI. Now they’re one company. Multi-cloud by necessity.

Regulatory Requirements: Some industries require data sovereignty or specific compliance certifications that might be easier to achieve with a particular provider in a particular region.

If none of these apply to you, stick with one cloud. Seriously. But if they do, keep reading.

Architecture Overview

Let’s design a realistic scenario. We have an e-commerce company with:

  • Application tier running on AWS (EKS, Lambda, API Gateway)
  • Core transactional database on OCI (Autonomous Transaction Processing)
  • Data warehouse on OCI (Autonomous Data Warehouse)
  • Machine learning workloads on AWS (SageMaker)
  • Shared data that needs to flow between both clouds


Setting Up Cross-Cloud Networking

The foundation of any multi-cloud architecture is networking. You need a secure, reliable, and performant connection between clouds.

Option 1: IPSec VPN (Good for Starting Out)

IPSec VPN is the quickest way to connect AWS and OCI. It runs over the public internet but encrypts everything. Good for development, testing, or low-bandwidth production workloads.

On OCI Side:

First, create a Dynamic Routing Gateway (DRG) and attach it to your VCN:

bash

# Create DRG
oci network drg create \
--compartment-id $COMPARTMENT_ID \
--display-name "aws-interconnect-drg"
# Attach DRG to VCN
oci network drg-attachment create \
--drg-id $DRG_ID \
--vcn-id $VCN_ID \
--display-name "vcn-attachment"

Create a Customer Premises Equipment (CPE) object representing AWS:

bash

# Create CPE for AWS VPN endpoint
oci network cpe create \
--compartment-id $COMPARTMENT_ID \
--ip-address $AWS_VPN_PUBLIC_IP \
--display-name "aws-vpn-endpoint"

Create the IPSec connection:

bash

# Create IPSec connection
oci network ip-sec-connection create \
--compartment-id $COMPARTMENT_ID \
--cpe-id $CPE_ID \
--drg-id $DRG_ID \
--static-routes '["10.1.0.0/16"]' \
--display-name "oci-to-aws-vpn"

On AWS Side:

Create a Customer Gateway pointing to OCI:

bash

# Create Customer Gateway
aws ec2 create-customer-gateway \
--type ipsec.1 \
--public-ip $OCI_VPN_PUBLIC_IP \
--bgp-asn 65000
# Create VPN Gateway
aws ec2 create-vpn-gateway \
--type ipsec.1
# Attach to VPC
aws ec2 attach-vpn-gateway \
--vpn-gateway-id $VGW_ID \
--vpc-id $VPC_ID
# Create VPN Connection
aws ec2 create-vpn-connection \
--type ipsec.1 \
--customer-gateway-id $CGW_ID \
--vpn-gateway-id $VGW_ID \
--options '{"StaticRoutesOnly": true}'

Update route tables on both sides:

bash

# AWS: Add route to OCI CIDR
aws ec2 create-route \
--route-table-id $ROUTE_TABLE_ID \
--destination-cidr-block 10.2.0.0/16 \
--gateway-id $VGW_ID
# OCI: Add route to AWS CIDR
oci network route-table update \
--rt-id $ROUTE_TABLE_ID \
--route-rules '[{
"destination": "10.1.0.0/16",
"destinationType": "CIDR_BLOCK",
"networkEntityId": "'$DRG_ID'"
}]'

Option 2: Private Connectivity (Production Recommended)

For production workloads, you want dedicated private connectivity. This means OCI FastConnect paired with AWS Direct Connect, meeting at a common colocation facility.

The good news is that Oracle and AWS both have presence in major colocation providers like Equinix. The setup involves:

  1. Establishing FastConnect to your colocation
  2. Establishing Direct Connect to the same colocation
  3. Connecting them via a cross-connect in the facility

hcl

# Terraform for FastConnect virtual circuit
resource "oci_core_virtual_circuit" "aws_interconnect" {
compartment_id = var.compartment_id
display_name = "aws-fastconnect"
type = "PRIVATE"
bandwidth_shape_name = "1 Gbps"
cross_connect_mappings {
customer_bgp_peering_ip = "169.254.100.1/30"
oracle_bgp_peering_ip = "169.254.100.2/30"
}
customer_asn = "65001"
gateway_id = oci_core_drg.main.id
provider_name = "Equinix"
region = "Dubai"
}

hcl

# Terraform for AWS Direct Connect
resource "aws_dx_connection" "oci_interconnect" {
name = "oci-direct-connect"
bandwidth = "1Gbps"
location = "Equinix DX1"
provider_name = "Equinix"
}
resource "aws_dx_private_virtual_interface" "oci" {
connection_id = aws_dx_connection.oci_interconnect.id
name = "oci-vif"
vlan = 4094
address_family = "ipv4"
bgp_asn = 65002
amazon_address = "169.254.100.5/30"
customer_address = "169.254.100.6/30"
dx_gateway_id = aws_dx_gateway.main.id
}

Honestly, setting this up involves coordination with both cloud providers and the colocation facility. Budget 4-8 weeks for the physical connectivity and plan for redundancy from day one.

Database Connectivity from AWS to OCI

Now that we have network connectivity, let’s connect AWS applications to OCI databases.

Configuring Autonomous Database for External Access

First, enable private endpoint access for your Autonomous Database:

bash

# Update ADB to use private endpoint
oci db autonomous-database update \
--autonomous-database-id $ADB_ID \
--is-access-control-enabled true \
--whitelisted-ips '["10.1.0.0/16"]' \ # AWS VPC CIDR
--is-mtls-connection-required false # Allow TLS without mTLS for simplicity

Get the connection string:

bash

oci db autonomous-database get \
--autonomous-database-id $ADB_ID \
--query 'data."connection-strings".profiles[?consumer=="LOW"].value | [0]'

Application Configuration on AWS

Here’s a practical Python example for connecting from AWS Lambda to OCI Autonomous Database:

python

# lambda_function.py
import cx_Oracle
import os
import boto3
from botocore.exceptions import ClientError
def get_db_credentials():
"""Retrieve database credentials from AWS Secrets Manager"""
secret_name = "oci-adb-credentials"
region_name = "us-east-1"
session = boto3.session.Session()
client = session.client(
service_name='secretsmanager',
region_name=region_name
)
try:
response = client.get_secret_value(SecretId=secret_name)
return json.loads(response['SecretString'])
except ClientError as e:
raise e
def handler(event, context):
# Get credentials
creds = get_db_credentials()
# Connection string format for Autonomous DB
dsn = """(description=
(retry_count=20)(retry_delay=3)
(address=(protocol=tcps)(port=1522)
(host=adb.me-dubai-1.oraclecloud.com))
(connect_data=(service_name=xxx_atp_low.adb.oraclecloud.com))
(security=(ssl_server_dn_match=yes)))"""
connection = cx_Oracle.connect(
user=creds['username'],
password=creds['password'],
dsn=dsn,
encoding="UTF-8"
)
cursor = connection.cursor()
cursor.execute("SELECT * FROM orders WHERE order_date = TRUNC(SYSDATE)")
results = []
for row in cursor:
results.append({
'order_id': row[0],
'customer_id': row[1],
'amount': float(row[2])
})
cursor.close()
connection.close()
return {
'statusCode': 200,
'body': json.dumps(results)
}

For containerized applications on EKS, use a connection pool:

python

# db_pool.py
import cx_Oracle
import os
class OCIDatabasePool:
_pool = None
@classmethod
def get_pool(cls):
if cls._pool is None:
cls._pool = cx_Oracle.SessionPool(
user=os.environ['OCI_DB_USER'],
password=os.environ['OCI_DB_PASSWORD'],
dsn=os.environ['OCI_DB_DSN'],
min=2,
max=10,
increment=1,
encoding="UTF-8",
threaded=True,
getmode=cx_Oracle.SPOOL_ATTRVAL_WAIT
)
return cls._pool
@classmethod
def get_connection(cls):
return cls.get_pool().acquire()
@classmethod
def release_connection(cls, connection):
cls.get_pool().release(connection)

Kubernetes deployment for the application:

yaml

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: order-service
template:
metadata:
labels:
app: order-service
spec:
containers:
- name: order-service
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/order-service:v1.0
ports:
- containerPort: 8080
env:
- name: OCI_DB_USER
valueFrom:
secretKeyRef:
name: oci-db-credentials
key: username
- name: OCI_DB_PASSWORD
valueFrom:
secretKeyRef:
name: oci-db-credentials
key: password
- name: OCI_DB_DSN
valueFrom:
configMapKeyRef:
name: oci-db-config
key: dsn
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10

Data Synchronization Between Clouds

Real multi-cloud architectures need data flowing between clouds. Here are practical patterns:

Pattern 1: Event-Driven Sync with Kafka

Use a managed Kafka service as the bridge:

python

# AWS Lambda producer - sends events to Kafka
from kafka import KafkaProducer
import json
producer = KafkaProducer(
bootstrap_servers=['kafka-broker-1:9092', 'kafka-broker-2:9092'],
value_serializer=lambda v: json.dumps(v).encode('utf-8'),
security_protocol='SASL_SSL',
sasl_mechanism='PLAIN',
sasl_plain_username=os.environ['KAFKA_USER'],
sasl_plain_password=os.environ['KAFKA_PASSWORD']
)
def handler(event, context):
# Process order and send to Kafka for OCI consumption
order_data = process_order(event)
producer.send(
'orders-topic',
key=str(order_data['order_id']).encode(),
value=order_data
)
producer.flush()
return {'statusCode': 200}

OCI side consumer using OCI Functions:

python

# OCI Function consumer
import io
import json
import logging
import cx_Oracle
from kafka import KafkaConsumer
def handler(ctx, data: io.BytesIO = None):
consumer = KafkaConsumer(
'orders-topic',
bootstrap_servers=['kafka-broker-1:9092'],
auto_offset_reset='earliest',
enable_auto_commit=True,
group_id='oci-order-processor',
value_deserializer=lambda x: json.loads(x.decode('utf-8'))
)
connection = get_adb_connection()
cursor = connection.cursor()
for message in consumer:
order = message.value
cursor.execute("""
MERGE INTO orders o
USING (SELECT :order_id AS order_id FROM dual) src
ON (o.order_id = src.order_id)
WHEN MATCHED THEN
UPDATE SET amount = :amount, status = :status, updated_at = SYSDATE
WHEN NOT MATCHED THEN
INSERT (order_id, customer_id, amount, status, created_at)
VALUES (:order_id, :customer_id, :amount, :status, SYSDATE)
""", order)
connection.commit()
cursor.close()
connection.close()

Pattern 2: Scheduled Batch Sync

For less time-sensitive data, batch synchronization is simpler and more cost-effective:

python

# AWS Step Functions state machine for batch sync
{
"Comment": "Sync data from AWS to OCI",
"StartAt": "ExtractFromAWS",
"States": {
"ExtractFromAWS": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789:function:extract-data",
"Next": "UploadToS3"
},
"UploadToS3": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789:function:upload-to-s3",
"Next": "CopyToOCI"
},
"CopyToOCI": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789:function:copy-to-oci-bucket",
"Next": "LoadToADB"
},
"LoadToADB": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789:function:load-to-adb",
"End": true
}
}
}

The Lambda function to copy data to OCI Object Storage:

python

# copy_to_oci.py
import boto3
import oci
import os
def handler(event, context):
# Get file from S3
s3 = boto3.client('s3')
s3_object = s3.get_object(
Bucket=event['bucket'],
Key=event['key']
)
file_content = s3_object['Body'].read()
# Upload to OCI Object Storage
config = oci.config.from_file()
object_storage = oci.object_storage.ObjectStorageClient(config)
namespace = object_storage.get_namespace().data
object_storage.put_object(
namespace_name=namespace,
bucket_name="data-sync-bucket",
object_name=event['key'],
put_object_body=file_content
)
return {
'oci_bucket': 'data-sync-bucket',
'object_name': event['key']
}

Load into Autonomous Database using DBMS_CLOUD:

sql

-- Create credential for OCI Object Storage access
BEGIN
DBMS_CLOUD.CREATE_CREDENTIAL(
credential_name => 'OCI_CRED',
username => 'your_oci_username',
password => 'your_auth_token'
);
END;
/
-- Load data from Object Storage
BEGIN
DBMS_CLOUD.COPY_DATA(
table_name => 'ORDERS_STAGING',
credential_name => 'OCI_CRED',
file_uri_list => 'https://objectstorage.me-dubai-1.oraclecloud.com/n/namespace/b/data-sync-bucket/o/orders_*.csv',
format => JSON_OBJECT(
'type' VALUE 'CSV',
'skipheaders' VALUE '1',
'dateformat' VALUE 'YYYY-MM-DD'
)
);
END;
/
-- Merge staging into production
MERGE INTO orders o
USING orders_staging s
ON (o.order_id = s.order_id)
WHEN MATCHED THEN
UPDATE SET o.amount = s.amount, o.status = s.status
WHEN NOT MATCHED THEN
INSERT (order_id, customer_id, amount, status)
VALUES (s.order_id, s.customer_id, s.amount, s.status);

Identity Federation

Managing identities across clouds is a headache unless you set up proper federation. Here’s how to enable SSO between AWS and OCI using a common identity provider.

Using Azure AD as Common IdP (Yes, a Third Cloud)

This is actually quite common. Many enterprises use Azure AD for identity even if their workloads run elsewhere.

Configure OCI to Trust Azure AD:

bash

# Create Identity Provider in OCI
oci iam identity-provider create-saml2-identity-provider \
--compartment-id $TENANCY_ID \
--name "AzureAD-Federation" \
--description "Federation with Azure AD" \
--product-type "IDCS" \
--metadata-url "https://login.microsoftonline.com/$TENANT_ID/federationmetadata/2007-06/federationmetadata.xml"

Configure AWS to Trust Azure AD:

bash

# Create SAML provider in AWS
aws iam create-saml-provider \
--saml-metadata-document file://azure-ad-metadata.xml \
--name AzureAD-Federation
# Create role for federated users
aws iam create-role \
--role-name AzureAD-Admins \
--assume-role-policy-document '{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"Federated": "arn:aws:iam::123456789:saml-provider/AzureAD-Federation"},
"Action": "sts:AssumeRoleWithSAML",
"Condition": {
"StringEquals": {
"SAML:aud": "https://signin.aws.amazon.com/saml"
}
}
}]
}'

Now your team can use the same Azure AD credentials to access both clouds.

Monitoring Across Clouds

You need unified observability. Here’s a practical approach using Grafana as the common dashboard:

yaml

# docker-compose.yml for centralized Grafana
version: '3.8'
services:
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
- ./provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_PASSWORD=secure_password
- GF_INSTALL_PLUGINS=oci-metrics-datasource
volumes:
grafana-data:

Configure data sources:

yaml

# provisioning/datasources/datasources.yaml
apiVersion: 1
datasources:
- name: AWS-CloudWatch
type: cloudwatch
access: proxy
jsonData:
authType: keys
defaultRegion: us-east-1
secureJsonData:
accessKey: ${AWS_ACCESS_KEY}
secretKey: ${AWS_SECRET_KEY}
- name: OCI-Monitoring
type: oci-metrics-datasource
access: proxy
jsonData:
tenancyOCID: ${OCI_TENANCY_OCID}
userOCID: ${OCI_USER_OCID}
region: me-dubai-1
secureJsonData:
privateKey: ${OCI_PRIVATE_KEY}

Create a unified dashboard that shows both clouds:

json

{
"title": "Multi-Cloud Overview",
"panels": [
{
"title": "AWS EKS CPU Utilization",
"datasource": "AWS-CloudWatch",
"targets": [{
"namespace": "AWS/EKS",
"metricName": "node_cpu_utilization",
"dimensions": {"ClusterName": "production"}
}]
},
{
"title": "OCI Autonomous DB Sessions",
"datasource": "OCI-Monitoring",
"targets": [{
"namespace": "oci_autonomous_database",
"metric": "CurrentOpenSessionCount",
"resourceGroup": "production-adb"
}]
},
{
"title": "Cross-Cloud Latency",
"datasource": "Prometheus",
"targets": [{
"expr": "histogram_quantile(0.95, rate(cross_cloud_request_duration_seconds_bucket[5m]))"
}]
}
]
}

Cost Management

Multi-cloud cost visibility is challenging. Here’s a practical approach:

python

# cost_aggregator.py
import boto3
import oci
from datetime import datetime, timedelta
def get_aws_costs(start_date, end_date):
client = boto3.client('ce')
response = client.get_cost_and_usage(
TimePeriod={
'Start': start_date.strftime('%Y-%m-%d'),
'End': end_date.strftime('%Y-%m-%d')
},
Granularity='DAILY',
Metrics=['UnblendedCost'],
GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
)
return response['ResultsByTime']
def get_oci_costs(start_date, end_date):
config = oci.config.from_file()
usage_api = oci.usage_api.UsageapiClient(config)
response = usage_api.request_summarized_usages(
request_summarized_usages_details=oci.usage_api.models.RequestSummarizedUsagesDetails(
tenant_id=config['tenancy'],
time_usage_started=start_date,
time_usage_ended=end_date,
granularity="DAILY",
group_by=["service"]
)
)
return response.data.items
def generate_report():
end_date = datetime.now()
start_date = end_date - timedelta(days=30)
aws_costs = get_aws_costs(start_date, end_date)
oci_costs = get_oci_costs(start_date, end_date)
total_aws = sum(float(day['Total']['UnblendedCost']['Amount']) for day in aws_costs)
total_oci = sum(item.computed_amount for item in oci_costs)
print(f"30-Day Multi-Cloud Cost Summary")
print(f"{'='*40}")
print(f"AWS Total: ${total_aws:,.2f}")
print(f"OCI Total: ${total_oci:,.2f}")
print(f"Combined Total: ${total_aws + total_oci:,.2f}")

Lessons Learned

After running multi-cloud architectures for several years, here’s what I’ve learned:

Network is everything. Invest in proper connectivity upfront. The $500/month you save on VPN versus dedicated connectivity will cost you thousands in debugging performance issues.

Pick one cloud for each workload type. Don’t run the same thing in both clouds. Use OCI for Oracle databases, AWS for its unique services. Avoid the temptation to replicate everything everywhere.

Standardize your tooling. Terraform works on both clouds. Use it. Same for monitoring, logging, and CI/CD. The more consistent your tooling, the less your team has to context-switch.

Document your data flows. Know exactly what data goes where and why. This will save you during security audits and incident response.

Test cross-cloud failures. What happens when the VPN goes down? Can your application degrade gracefully? Find out before your customers do.

Conclusion

Multi-cloud between OCI and AWS isn’t simple, but it’s absolutely achievable. The key is having clear reasons for using each cloud, solid networking fundamentals, and consistent operational practices.

Start small. Connect one application to one database across clouds. Get that working reliably before expanding. Build your team’s confidence and expertise incrementally.

The organizations that succeed with multi-cloud are the ones that treat it as an architectural choice, not a checkbox. They know exactly why they need both clouds and have designed their systems accordingly.

Regards,
Osama

Designing a Disaster Recovery Strategy on Oracle Cloud Infrastructure: A Practical Guide

Let me be honest with you. Nobody likes thinking about disasters. It’s one of those topics we all know is important, but it often gets pushed to the bottom of the priority list until something goes wrong. And when it does go wrong, it’s usually at 3 AM on a Saturday.

I’ve seen organizations lose days of productivity, thousands of dollars, and sometimes customer trust because they didn’t have a proper disaster recovery plan. The good news? OCI makes disaster recovery achievable without breaking the bank or requiring a dedicated team of engineers.

In this article, I’ll walk you through building a realistic DR strategy on OCI. Not the theoretical stuff you find in whitepapers, but the practical decisions you’ll actually face when setting this up.

Understanding Recovery Objectives

Before we touch any OCI console, we need to talk about two numbers that will drive every decision we make.

Recovery Time Objective (RTO) answers the question: How long can your business survive without this system? If your e-commerce platform goes down, can you afford to be offline for 4 hours? 1 hour? 5 minutes?

Recovery Point Objective (RPO) answers a different question: How much data can you afford to lose? If we restore from a backup taken 2 hours ago, is that acceptable? Or do you need every single transaction preserved?

These aren’t technical questions. They’re business questions. And honestly, the answers might surprise you. I’ve worked with clients who assumed they needed zero RPO for everything, only to realize that most of their systems could tolerate 15-30 minutes of data loss without significant business impact.

Here’s how I typically categorize systems:

TierRTORPOExamples
Critical< 15 minNear zeroPayment processing, core databases
Important1-4 hours< 1 hourCustomer portals, internal apps
Standard4-24 hours< 24 hoursDev environments, reporting systems

Once you know your tiers, the technical implementation becomes much clearer.

OCI Regions and Availability Domains

OCI’s physical infrastructure is your foundation for DR. Let me explain how it works in plain terms.

Regions are geographically separate data center locations. Think Dubai, Jeddah, Frankfurt, London. They’re far enough apart that a natural disaster affecting one region won’t touch another.

Availability Domains (ADs) are independent data centers within a region. Not all regions have multiple ADs, but the larger ones do. Each AD has its own power, cooling, and networking.

Fault Domains are groupings within an AD that protect against hardware failures. Think of them as different racks or sections of the data center.

For disaster recovery, you’ll typically replicate across regions. For high availability within normal operations, you spread across ADs and fault domains.

Here’s what this looks like in practice:

Primary Region: Dubai (me-dubai-1)
├── Availability Domain 1
│ ├── Fault Domain 1: Web servers (set 1)
│ ├── Fault Domain 2: Web servers (set 2)
│ └── Fault Domain 3: Application servers
└── Availability Domain 2
└── Database primary + standby
DR Region: Jeddah (me-jeddah-1)
└── Full replica (activated during disaster)

Database Disaster Recovery with Data Guard

Let’s start with databases because that’s usually where the most critical data lives. OCI Autonomous Database and Base Database Service both support Data Guard, which handles replication automatically.

For Autonomous Database, enabling DR is surprisingly simple:

bash

# Create a cross-region standby for Autonomous Database
oci db autonomous-database create-cross-region-disaster-recovery-details \
--autonomous-database-id ocid1.autonomousdatabase.oc1.me-dubai-1.xxx \
--disaster-recovery-type BACKUP_BASED \
--remote-disaster-recovery-type SNAPSHOT \
--dr-region-name me-jeddah-1

But here’s where it gets interesting. You have choices:

Backup-Based DR copies backups to the remote region. It’s cheaper but has higher RPO (you might lose the data since the last backup). Good for Tier 2 and Tier 3 systems.

Real-Time DR uses Active Data Guard to replicate changes continuously. Near-zero RPO but costs more because you’re running a standby database. Essential for Tier 1 systems.

For Base Database Service with Data Guard, you configure it like this:

bash

# Enable Data Guard for DB System
oci db data-guard-association create \
--database-id ocid1.database.oc1.me-dubai-1.xxx \
--creation-type NewDbSystem \
--database-admin-password "YourSecurePassword123!" \
--protection-mode MAXIMUM_PERFORMANCE \
--transport-type ASYNC \
--peer-db-system-id ocid1.dbsystem.oc1.me-jeddah-1.xxx

The protection modes matter:

  • Maximum Performance: Transactions commit without waiting for standby confirmation. Best performance, slight risk of data loss during failover.
  • Maximum Availability: Transactions wait for standby acknowledgment but fall back to Maximum Performance if standby is unreachable.
  • Maximum Protection: Transactions fail if standby is unreachable. Zero data loss, but availability depends on standby.

Most production systems use Maximum Performance or Maximum Availability. Maximum Protection is rare because it can halt your primary if the network between regions has issues.

Compute and Application Layer DR

Databases are just one piece. Your application servers, load balancers, and supporting infrastructure also need DR planning.

Option 1: Pilot Light

This is my favorite approach for most organizations. You keep a minimal footprint running in the DR region, just enough to start recovery quickly.

hcl

# Terraform for pilot light infrastructure in DR region
# Minimal compute that can be scaled up during disaster
resource "oci_core_instance" "dr_pilot" {
availability_domain = data.oci_identity_availability_domain.dr_ad.name
compartment_id = var.compartment_id
shape = "VM.Standard.E4.Flex"
shape_config {
ocpus = 1 # Minimal during normal ops
memory_in_gbs = 8
}
display_name = "dr-pilot-instance"
source_details {
source_type = "image"
source_id = var.application_image_id
}
metadata = {
ssh_authorized_keys = var.ssh_public_key
user_data = base64encode(file("./scripts/pilot-light-startup.sh"))
}
}
# Load balancer ready but with no backends attached
resource "oci_load_balancer" "dr_lb" {
compartment_id = var.compartment_id
display_name = "dr-load-balancer"
shape = "flexible"
shape_details {
minimum_bandwidth_in_mbps = 10
maximum_bandwidth_in_mbps = 100
}
subnet_ids = [oci_core_subnet.dr_public_subnet.id]
}

The startup script keeps the instance ready without consuming resources:

bash

#!/bin/bash
# pilot-light-startup.sh
# Install application but don't start it
yum install -y application-server
# Pull latest configuration from Object Storage
oci os object get \
--bucket-name dr-config-bucket \
--name app-config.tar.gz \
--file /opt/app/config.tar.gz
tar -xzf /opt/app/config.tar.gz -C /opt/app/
# Leave application stopped until failover activation
echo "Pilot light instance ready. Application not started."

Option 2: Warm Standby

For systems that need faster recovery, you run a scaled-down version of your production environment continuously:

hcl

# Warm standby with reduced capacity
resource "oci_core_instance_pool" "dr_app_pool" {
compartment_id = var.compartment_id
instance_configuration_id = oci_core_instance_configuration.app_config.id
placement_configurations {
availability_domain = data.oci_identity_availability_domain.dr_ad.name
primary_subnet_id = oci_core_subnet.dr_app_subnet.id
}
size = 2 # Production runs 6, DR runs 2
display_name = "dr-app-pool"
}
# Autoscaling policy to expand during failover
resource "oci_autoscaling_auto_scaling_configuration" "dr_scaling" {
compartment_id = var.compartment_id
auto_scaling_resources {
id = oci_core_instance_pool.dr_app_pool.id
type = "instancePool"
}
policies {
display_name = "failover-scale-up"
policy_type = "threshold"
rules {
action {
type = "CHANGE_COUNT_BY"
value = 4 # Add 4 instances to match production
}
metric {
metric_type = "CPU_UTILIZATION"
threshold {
operator = "GT"
value = 70
}
}
}
}
}

Object Storage Replication

Your files, backups, and static assets need protection too. OCI Object Storage supports cross-region replication:

bash

# Create replication policy
oci os replication create-replication-policy \
--bucket-name production-assets \
--destination-bucket-name dr-assets \
--destination-region me-jeddah-1 \
--name "prod-to-dr-replication"

One thing people often miss: replication is asynchronous. For critical files that absolutely cannot be lost, consider writing to both regions from your application:

python

# Python example: Writing to both regions
import oci
def upload_critical_file(file_path, object_name):
config_primary = oci.config.from_file(profile_name="PRIMARY")
config_dr = oci.config.from_file(profile_name="DR")
primary_client = oci.object_storage.ObjectStorageClient(config_primary)
dr_client = oci.object_storage.ObjectStorageClient(config_dr)
with open(file_path, 'rb') as f:
file_content = f.read()
# Write to primary
primary_client.put_object(
namespace_name="your-namespace",
bucket_name="critical-files",
object_name=object_name,
put_object_body=file_content
)
# Write to DR region
dr_client.put_object(
namespace_name="your-namespace",
bucket_name="critical-files-dr",
object_name=object_name,
put_object_body=file_content
)
print(f"File {object_name} written to both regions")

DNS and Traffic Management

When disaster strikes, you need to redirect users to your DR region. OCI DNS with Traffic Management makes this manageable:

hcl

# Traffic Management Steering Policy
resource "oci_dns_steering_policy" "failover" {
compartment_id = var.compartment_id
display_name = "app-failover-policy"
template = "FAILOVER"
# Primary region answers
answers {
name = "primary"
rtype = "A"
rdata = var.primary_lb_ip
pool = "primary-pool"
is_disabled = false
}
# DR region answers
answers {
name = "dr"
rtype = "A"
rdata = var.dr_lb_ip
pool = "dr-pool"
is_disabled = false
}
rules {
rule_type = "FILTER"
}
rules {
rule_type = "HEALTH"
}
rules {
rule_type = "PRIORITY"
default_answer_data {
answer_condition = "answer.pool == 'primary-pool'"
value = 1
}
default_answer_data {
answer_condition = "answer.pool == 'dr-pool'"
value = 2
}
}
}
# Health check for primary region
resource "oci_health_checks_http_monitor" "primary_health" {
compartment_id = var.compartment_id
display_name = "primary-region-health"
interval_in_seconds = 30
targets = [var.primary_lb_ip]
protocol = "HTTPS"
port = 443
path = "/health"
timeout_in_seconds = 10
}

The Failover Runbook

All this infrastructure means nothing without a clear process. Here’s a realistic runbook:

Automated Detection

python

# OCI Function to detect and alert on regional issues
import oci
import json
def handler(ctx, data: io.BytesIO = None):
signer = oci.auth.signers.get_resource_principals_signer()
monitoring_client = oci.monitoring.MonitoringClient(config={}, signer=signer)
# Check critical metrics
response = monitoring_client.summarize_metrics_data(
compartment_id="ocid1.compartment.xxx",
summarize_metrics_data_details=oci.monitoring.models.SummarizeMetricsDataDetails(
namespace="oci_lbaas",
query='UnHealthyBackendServers[5m].sum() > 2'
)
)
if response.data:
# Trigger alert
notifications_client = oci.ons.NotificationDataPlaneClient(config={}, signer=signer)
notifications_client.publish_message(
topic_id="ocid1.onstopic.xxx",
message_details=oci.ons.models.MessageDetails(
title="DR Alert: Primary Region Degraded",
body="Multiple backend servers unhealthy. Consider initiating failover."
)
)
return response

Manual Failover Steps

bash

#!/bin/bash
# failover.sh - Execute with caution
set -e
echo "=== OCI DISASTER RECOVERY FAILOVER ==="
echo "This will switch production traffic to the DR region."
read -p "Type 'FAILOVER' to confirm: " confirmation
if [ "$confirmation" != "FAILOVER" ]; then
echo "Failover cancelled."
exit 1
fi
echo "[1/5] Initiating database switchover..."
oci db data-guard-association switchover \
--database-id $PRIMARY_DB_ID \
--data-guard-association-id $DG_ASSOCIATION_ID
echo "[2/5] Scaling up DR compute instances..."
oci compute instance-pool update \
--instance-pool-id $DR_INSTANCE_POOL_ID \
--size 6
echo "[3/5] Waiting for instances to be running..."
sleep 120
echo "[4/5] Updating load balancer backends..."
oci lb backend-set update \
--load-balancer-id $DR_LB_ID \
--backend-set-name "app-backend-set" \
--backends file://dr-backends.json
echo "[5/5] Updating DNS steering policy..."
oci dns steering-policy update \
--steering-policy-id $STEERING_POLICY_ID \
--rules file://failover-rules.json
echo "=== FAILOVER COMPLETE ==="
echo "Verify application at: https://app.example.com"

Testing Your DR Plan

Here’s the uncomfortable truth: a DR plan that hasn’t been tested is just documentation. You need to actually run failovers.

I recommend this schedule:

  • Monthly: Tabletop exercise. Walk through the runbook with your team without actually executing anything.
  • Quarterly: Partial failover. Switch one non-critical component to DR and back.
  • Annually: Full DR test. Fail over completely and run production from the DR region for at least 4 hours.

Document everything:

markdown

## DR Test Report - Q4 2025
**Date**: December 15, 2025
**Participants**: Ahmed, Sarah, Mohammed
**Test Type**: Full failover
### Timeline
- 09:00 - Initiated failover sequence
- 09:03 - Database switchover complete
- 09:08 - Compute instances running in DR
- 09:12 - DNS propagation confirmed
- 09:15 - Application accessible from DR region
### Issues Discovered
1. SSL certificate for DR load balancer had expired
- Resolution: Renewed certificate, added calendar reminder
2. One microservice had hardcoded primary region endpoint
- Resolution: Updated to use DNS name instead
### RTO Achieved
15 minutes (Target: 30 minutes) ✓
### RPO Achieved
< 30 seconds of transaction loss ✓
### Action Items
- [ ] Automate certificate renewal monitoring
- [ ] Audit all services for hardcoded endpoints
- [ ] Update runbook with SSL verification step

Cost Optimization

DR doesn’t have to be expensive. Here are real strategies I use:

Right-size your DR tier: Not everything needs instant failover. Be honest about what’s truly critical.

Use preemptible instances for testing: When you’re just validating your DR setup works, you don’t need full-price compute:

hcl

resource "oci_core_instance" "dr_test" {
# ... other config ...
preemptible_instance_config {
preemption_action {
type = "TERMINATE"
preserve_boot_volume = false
}
}
}

Schedule DR resources: If you’re running warm standby, scale it down during your off-peak hours:

bash

# Scale down at night, scale up in morning
# Cron job or OCI Scheduler
0 22 * * * oci compute instance-pool update --instance-pool-id $POOL_ID --size 1
0 6 * * * oci compute instance-pool update --instance-pool-id $POOL_ID --size 2

Leverage reserved capacity: If you’re committed to DR, reserved capacity in your DR region is cheaper than on-demand.

Implementing GitOps with ArgoCD on Amazon EKS

GitOps has emerged as the dominant paradigm for managing Kubernetes deployments at scale. By treating Git as the single source of truth for declarative infrastructure and applications, teams achieve auditability, rollback capabilities, and consistent deployments across environments.

In this article, we’ll build a production-grade GitOps pipeline using ArgoCD on Amazon EKS, covering cluster setup, ArgoCD installation, application deployment patterns, secrets management, and multi-environment promotion strategies.

Why GitOps?

Traditional CI/CD pipelines push changes to clusters. GitOps inverts this model: the cluster pulls its desired state from Git. This approach provides:

  • Auditability: Every change is a Git commit with author, timestamp, and approval history
  • Declarative Configuration: The entire system state is version-controlled
  • Drift Detection: ArgoCD continuously reconciles actual vs. desired state
  • Simplified Rollbacks: Revert a deployment by reverting a commit

Architecture Overview

The architecture consists of:

  • Amazon EKS cluster running ArgoCD
  • GitHub repository containing Kubernetes manifests
  • AWS Secrets Manager for sensitive configuration
  • External Secrets Operator for secret synchronization
  • ApplicationSets for multi-environment deployments

Step 1: EKS Cluster Setup

First, create an EKS cluster with the necessary add-ons:

eksctl create cluster \
  --name gitops-cluster \
  --version 1.29 \
  --region us-east-1 \
  --nodegroup-name workers \
  --node-type t3.large \
  --nodes 3 \
  --nodes-min 2 \
  --nodes-max 5 \
  --managed

Enable OIDC provider for IAM Roles for Service Accounts (IRSA):

eksctl utils associate-iam-oidc-provider \
  --cluster gitops-cluster \
  --region us-east-1 \
  --approve

Step 2: Install ArgoCD

Create the ArgoCD namespace and install using the HA manifest:

kubectl create namespace argocd

kubectl apply -n argocd -f \
  https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/ha/install.yaml

For production, configure ArgoCD with an AWS Application Load Balancer:

# argocd-server-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: argocd-server
  namespace: argocd
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:ACCOUNT:certificate/CERT-ID
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    alb.ingress.kubernetes.io/backend-protocol: HTTPS
spec:
  rules:
  - host: argocd.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: argocd-server
            port:
              number: 443

Retrieve the initial admin password:

kubectl -n argocd get secret argocd-initial-admin-secret \
  -o jsonpath="{.data.password}" | base64 -d

Base Deployment

# apps/base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  selector:
    matchLabels:
      app: api-service
  template:
    metadata:
      labels:
        app: api-service
    spec:
      serviceAccountName: api-service
      containers:
      - name: api
        image: api-service:latest
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 3
        env:
        - name: DB_HOST
          valueFrom:
            secretKeyRef:
              name: api-secrets
              key: db-host

Environment Overlay (Production)

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: production

resources:
- ../../base

images:
- name: api-service
  newName: 123456789.dkr.ecr.us-east-1.amazonaws.com/api-service
  newTag: v1.2.3

patches:
- path: patches/replicas.yaml

commonLabels:
  environment: production
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  replicas: 5

Step 4: Secrets Management with External Secrets Operator

Never store secrets in Git. Use External Secrets Operator to synchronize from AWS Secrets Manager:

helm repo add external-secrets https://charts.external-secrets.io
helm install external-secrets external-secrets/external-secrets \
  -n external-secrets --create-namespace

Create an IAM role for the operator:

eksctl create iamserviceaccount \
  --cluster=gitops-cluster \
  --namespace=external-secrets \
  --name=external-secrets \
  --attach-policy-arn=arn:aws:iam::aws:policy/SecretsManagerReadWrite \
  --approve

Configure the SecretStore:

apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
  name: aws-secrets-manager
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-east-1
      auth:
        jwt:
          serviceAccountRef:
            name: external-secrets
            namespace: external-secrets

Define an ExternalSecret for your application:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: api-secrets
spec:
  refreshInterval: 1h
  secretStoreRef:
    kind: ClusterSecretStore
    name: aws-secrets-manager
  target:
    name: api-secrets
    creationPolicy: Owner
  data:
  - secretKey: db-host
    remoteRef:
      key: prod/api-service/database
      property: host
  - secretKey: db-password
    remoteRef:
      key: prod/api-service/database
      property: password

Step 5: ArgoCD ApplicationSet for Multi-Environment

ApplicationSets enable templated, multi-environment deployments from a single definition:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: api-service
  namespace: argocd
spec:
  generators:
  - list:
      elements:
      - env: dev
        cluster: https://kubernetes.default.svc
        namespace: development
      - env: staging
        cluster: https://kubernetes.default.svc
        namespace: staging
      - env: prod
        cluster: https://prod-cluster.example.com
        namespace: production
  template:
    metadata:
      name: 'api-service-{{env}}'
    spec:
      project: default
      source:
        repoURL: https://github.com/org/gitops-repo.git
        targetRevision: HEAD
        path: 'apps/overlays/{{env}}'
      destination:
        server: '{{cluster}}'
        namespace: '{{namespace}}'
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
        - CreateNamespace=true
        retry:
          limit: 5
          backoff:
            duration: 5s
            factor: 2
            maxDuration: 3m

Step 6: Sync Waves and Hooks

Control deployment ordering using sync waves:

# Deploy secrets first (wave -1)
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: api-secrets
  annotations:
    argocd.argoproj.io/sync-wave: "-1"
# ...

# Deploy ConfigMaps second (wave 0)
apiVersion: v1
kind: ConfigMap
metadata:
  name: api-config
  annotations:
    argocd.argoproj.io/sync-wave: "0"
# ...

# Deploy application third (wave 1)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
  annotations:
    argocd.argoproj.io/sync-wave: "1"
# ...

Add a pre-sync hook for database migrations:

apiVersion: batch/v1
kind: Job
metadata:
  name: db-migrate
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
  template:
    spec:
      containers:
      - name: migrate
        image: api-service:v1.2.3
        command: ["./migrate", "--apply"]
      restartPolicy: Never
  backoffLimit: 3

Step 7: Notifications and Monitoring

Configure ArgoCD notifications to Slack:

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-notifications-cm
  namespace: argocd
data:
  service.slack: |
    token: $slack-token
  template.app-sync-status: |
    message: |
      Application {{.app.metadata.name}} sync status: {{.app.status.sync.status}}
      Health: {{.app.status.health.status}}
  trigger.on-sync-failed: |
    - when: app.status.sync.status == 'OutOfSync'
      send: [app-sync-status]
  subscriptions: |
    - recipients:
      - slack:deployments
      triggers:
      - on-sync-failed

Production Best Practices

Repository Access

Use deploy keys with read-only access:

apiVersion: v1
kind: Secret
metadata:
  name: gitops-repo
  namespace: argocd
  labels:
    argocd.argoproj.io/secret-type: repository
stringData:
  type: git
  url: git@github.com:org/gitops-repo.git
  sshPrivateKey: |
    -----BEGIN OPENSSH PRIVATE KEY-----
    ...
    -----END OPENSSH PRIVATE KEY-----

Resource Limits for ArgoCD

apiVersion: apps/v1
kind: Deployment
metadata:
  name: argocd-repo-server
  namespace: argocd
spec:
  template:
    spec:
      containers:
      - name: argocd-repo-server
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 2
            memory: 2Gi

RBAC Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-rbac-cm
  namespace: argocd
data:
  policy.csv: |
    p, role:developer, applications, get, */*, allow
    p, role:developer, applications, sync, dev/*, allow
    p, role:ops, applications, *, */*, allow
    g, dev-team, role:developer
    g, ops-team, role:ops
  policy.default: role:readonly

Enjoy
Osama

Deep Dive into Oracle Kubernetes Engine Security and Networking in Production

Oracle Kubernetes Engine is often introduced as a managed Kubernetes service, but its real strength only becomes clear when you operate it in production. OKE tightly integrates with OCI networking, identity, and security services, which gives you a very different operational model compared to other managed Kubernetes platforms.

This article walks through OKE from a production perspective, focusing on security boundaries, networking design, ingress exposure, private access, and mutual TLS. The goal is not to explain Kubernetes basics, but to explain how OKE behaves when you run regulated, enterprise workloads.

Understanding the OKE Networking Model

OKE does not abstract networking away from you. Every cluster is deeply tied to OCI VCN constructs.

Core Components

An OKE cluster consists of:

  • A managed Kubernetes control plane
  • Worker nodes running in OCI subnets
  • OCI networking primitives controlling traffic flow

Key OCI resources involved:

  • Virtual Cloud Network
  • Subnets for control plane and workers
  • Network Security Groups
  • Route tables
  • OCI Load Balancers

Unlike some platforms, security in OKE is enforced at multiple layers simultaneously.

Worker Node and Pod Networking

OKE uses OCI VCN-native networking. Pods receive IPs from the subnet CIDR through the OCI CNI plugin.

What this means in practice

  • Pods are first-class citizens on the VCN
  • Pod IPs are routable within the VCN
  • Network policies and OCI NSGs both apply

Example subnet design:

VCN: 10.0.0.0/16

Worker Subnet: 10.0.10.0/24
Load Balancer Subnet: 10.0.20.0/24
Private Endpoint Subnet: 10.0.30.0/24

This design allows you to:

  • Keep workers private
  • Expose only ingress through OCI Load Balancer
  • Control east-west traffic using Kubernetes NetworkPolicies and OCI NSGs together

Security Boundaries in OKE

Security in OKE is layered by design.

Layer 1: OCI IAM and Compartments

OKE clusters live inside OCI compartments. IAM policies control:

  • Who can create or modify clusters
  • Who can access worker nodes
  • Who can manage load balancers and subnets

Example IAM policy snippet:

Allow group OKE-Admins to manage cluster-family in compartment OKE-PROD
Allow group OKE-Admins to manage virtual-network-family in compartment OKE-PROD

This separation is critical for regulated environments.

Layer 2: Network Security Groups

Network Security Groups act as virtual firewalls at the VNIC level.

Typical NSG rules:

  • Allow node-to-node communication
  • Allow ingress from load balancer subnet only
  • Block all public inbound traffic

Example inbound NSG rule:

Source: 10.0.20.0/24
Protocol: TCP
Port: 443

This ensures only the OCI Load Balancer can reach your ingress controller.

Layer 3: Kubernetes Network Policies

NetworkPolicies control pod-level traffic.

Example policy allowing traffic only from ingress namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-ingress
  namespace: app-prod
spec:
  podSelector: {}
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              role: ingress

This blocks all lateral movement by default.

Ingress Design in OKE

OKE integrates natively with OCI Load Balancer.

Public vs Private Ingress

You can deploy ingress in two modes:

  • Public Load Balancer
  • Internal Load Balancer

For production workloads, private ingress is strongly recommended.

Example service annotation for private ingress:

service.beta.kubernetes.io/oci-load-balancer-internal: "true"
service.beta.kubernetes.io/oci-load-balancer-subnet1: ocid1.subnet.oc1..

This ensures the load balancer has no public IP.

Private Access to the Cluster Control Plane

OKE supports private API endpoints.

When enabled:

  • The Kubernetes API is accessible only from the VCN
  • No public endpoint exists

This is critical for Zero Trust environments.

Operational impact:

  • kubectl access requires VPN, Bastion, or OCI Cloud Shell inside the VCN
  • CI/CD runners must have private connectivity

This dramatically reduces the attack surface.

Mutual TLS Inside OKE

TLS termination at ingress is not enough for sensitive workloads. Many enterprises require mTLS between services.

Typical mTLS Architecture

  • TLS termination at ingress
  • Internal mTLS between services
  • Certificate management via Vault or cert-manager

Example cert-manager issuer using OCI Vault:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: oci-vault-issuer
spec:
  vault:
    server: https://vault.oci.oraclecloud.com
    path: pki/sign/oke

Each service receives:

  • Its own certificate
  • Short-lived credentials
  • Automatic rotation

Traffic Flow Example

End-to-end request path:

  1. Client connects to OCI Load Balancer
  2. Load Balancer forwards traffic to NGINX Ingress
  3. Ingress enforces TLS and headers
  4. Service-to-service traffic uses mTLS
  5. NetworkPolicy restricts lateral movement
  6. NSGs enforce VCN-level boundaries

Every hop is authenticated and encrypted.


Observability and Security Visibility

OKE integrates with:

  • OCI Logging
  • OCI Flow Logs
  • Kubernetes audit logs

This allows:

  • Tracking ingress traffic
  • Detecting unauthorized access attempts
  • Correlating pod-level events with network flows

Regards
Osama

Basic Guide to Build a Production-Architecture on OCI

1. Why OCI for Modern Architecture?

Many architects underestimate how much OCI has matured. Today, OCI offers:

  • Low-latency networking with deterministic performance.
  • Flexible compute shapes (standard, dense I/O, high memory).
  • A Kubernetes service (OKE) with enterprise-level resilience.
  • Cloud-native storage (Block, Object, File).
  • A full security stack (Vault, Cloud Guard, WAF, IAM policies).
  • A pricing model that is often 30–50% cheaper than equivalent hyperscaler deployments.

Reference: OCI Overview
https://docs.oracle.com/en-us/iaas/Content/home.htm

2. Multi-Tier Production Architecture Overview

A typical production workload on OCI includes:

  • Network Layer: VCN, subnets, NAT, DRG, Load Balancers
  • Compute Layer: OKE, VMs, Functions
  • Data Layer: Autonomous DB, PostgreSQL, MySQL, Object Storage
  • Security Layer: OCI Vault, WAF, IAM policies
  • Observability Layer: Logging, Monitoring, Alarms, Prometheus/Grafana
  • Automation Layer: Terraform, OCI CLI, GitHub Actions/Azure DevOps

3. Networking Foundation

You start with a Virtual Cloud Network (VCN), structured in a way that isolates traffic properly:

VCN Example Layout

  • 10.10.0.0/16 — VCN Root
    • 10.10.1.0/24 — Public Subnet (Load Balancers)
    • 10.10.2.0/24 — Private Subnet (Applications / OKE Nodes)
    • 10.10.3.0/24 — DB Subnet
    • 10.10.4.0/24 — Bastion Subnet

Terraform Example

resource "oci_core_vcn" "main" {
  cidr_block = "10.10.0.0/16"
  compartment_id = var.compartment_ocid
  display_name = "prod-vcn"
}

resource "oci_core_subnet" "private_app" {
  vcn_id = oci_core_vcn.main.id
  cidr_block = "10.10.2.0/24"
  prohibit_public_ip_on_vnic = true
  display_name = "app-private-subnet"
}

Reference: OCI Networking Concepts
https://docs.oracle.com/en-us/iaas/Content/Network/Concepts/overview.htm


4. Deploying Workloads on OKE (Oracle Kubernetes Engine)

OKE is one of OCI’s strongest services due to:

  • Native integration with VCN
  • Worker nodes running inside your own subnets
  • The ability to use OCI Load Balancers or NGINX ingress
  • Strong security by default

Cluster Creation Example (CLI)

oci ce cluster create \
  --name prod-oke \
  --vcn-id ocid1.vcn.oc1... \
  --kubernetes-version "1.30.1" \
  --compartment-id <compartment_ocid>

Node Pool Example

oci ce node-pool create \
  --name prod-nodepool \
  --cluster-id <cluster_ocid> \
  --node-shape VM.Standard3.Flex \
  --node-shape-config '{"ocpus":4,"memoryInGBs":32}' \
  --subnet-ids '["<subnet_ocid>"]'

5. Adding Ingress Traffic: OCI LB + NGINX

In multi-cloud architectures (Azure, GCP, OCI), it’s common to use Cloudflare or F5 for global routing, but within OCI you typically rely on:

  • OCI Load Balancer (Layer 4/7)
  • NGINX Ingress Controller on OKE

Example: Basic Ingress for Microservices

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: payments-ingress
spec:
  ingressClassName: nginx
  rules:
  - host: payments.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: payments-svc
            port:
              number: 8080

6. Secure Secrets With OCI Vault

Never store secrets in ConfigMaps or Docker images.
OCI Vault integrates tightly with:

  • Kubernetes Secrets via CSI Driver
  • Database credential rotation
  • Key management (KMS)

Example: Using OCI Vault with Kubernetes

apiVersion: v1
kind: Secret
metadata:
  name: db-secret
type: Opaque
stringData:
  username: appuser
  password: ${OCI_VAULT_SECRET_DB_PASSWORD}

7. Observability: Logging + Monitoring + Prometheus

OCI Monitoring handles metrics out of the box (CPU, memory, LB metrics, OKE metrics).
But for application-level observability, you deploy Prometheus/Grafana.

Prometheus Helm Install

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring

Add ServiceMonitor for your applications:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: payments-monitor
spec:
  selector:
    matchLabels:
      app: payments
  endpoints:
  - port: http

8. Disaster Recovery and Multi-Region Strategy

OCI provides:

  • Block Volume replication
  • Object Storage Cross-Region Replication
  • Multi-AD (Availability Domain) deployment
  • Cross-region DR using Remote Peering

Example: Autonomous DB Cross-Region DR

oci db autonomous-database create-adb-cross-region-disaster-recovery \
  --autonomous-database-id <db_ocid> \
  --disaster-recovery-region "eu-frankfurt-1"

9. CI/CD on OCI Using GitHub Actions

Example pipeline to build a Docker image and deploy to OKE:

name: Deploy to OKE

on:
  push:
    branches: [ "main" ]

jobs:
  build-deploy:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3

    - name: Build Docker Image
      run: docker build -t myapp:${{ github.sha }} .

    - name: OCI CLI Login
      run: |
        oci session authenticate

    - name: Push Image to OCIR
      run: |
        docker tag myapp:${{ github.sha }} \
        iad.ocir.io/tenancy/myapp:${{ github.sha }}
        docker push iad.ocir.io/tenancy/myapp:${{ github.sha }}

    - name: Deploy to OKE
      run: |
        kubectl set image deployment/myapp myapp=iad.ocir.io/tenancy/myapp:${{ github.sha }}

The Final Architecture will look like this

Building a Fully Private, Zero-Trust API Platform on OCI Using API Gateway, Private Endpoints, and VCN Integration

1. Why a Private API Gateway Matters

A typical API Gateway sits at the edge and exposes public REST endpoints.
But some environments require:

  • APIs callable only from internal systems
  • Backend microservices running in private subnets
  • Zero inbound public access
  • Authentication and authorization enforced at gateway level
  • Isolation between dev, test, pprd, prod

These requirements push you toward a private deployment using Private Endpoint Mode.

This means:

  • The API Gateway receives traffic only from inside your VCN
  • Clients must be inside the private network (on-prem, FastConnect, VPN, or private OCI services)
  • The entire flow stays within the private topology

2. Architecture Overview

A private API Gateway requires several OCI components working together:

  • API Gateway (Private Endpoint Mode)
  • VCN with private subnets
  • Service Gateway for private object storage access
  • Private Load Balancer for backend microservices
  • IAM policies controlling which groups can deploy APIs
  • VCN routing configuration to direct requests correctly
  • Optional WAF (private) for east-west inspection inside the VCN

The call flow:

  1. A client inside your VCN sends a request to the Gateway’s private IP.
  2. The Gateway handles authentication, request validation, and OCI IAM signature checks.
  3. The Gateway forwards traffic to a backend private LB or private OKE services.
  4. Logs go privately to Logging service via the service gateway.

All traffic stays private. No NAT, no public egress.

3. Deploying the Gateway in Private Endpoint Mode

When creating the API Gateway:

  • Choose Private Gateway Type
  • Select the VCN and Private Subnet
  • Ensure the subnet has no internet gateway
  • Disable public routing

You will receive a private IP instead of a public endpoint.

Example shape:

Private Gateway IP: 10.0.4.15
Subnet: app-private-subnet-1
VCN CIDR: 10.0.0.0/16

Only systems inside the 10.x.x.x network (or connected networks) can call it.

4. Routing APIs to Private Microservices

Your backend might be:

  • A microservice running in OKE
  • A VM instance
  • A container on Container Instances
  • A private load balancer
  • A function in a private subnet
  • An internal Oracle DB REST endpoint

For reliable routing:

a. Attach a Private Load Balancer

It’s best practice to put microservices behind an internal load balancer.

Example LB private IP: 10.0.20.10

b. Add Route Table Entries

Ensure the subnet hosting the API Gateway can route to the backend:

Destination: 10.0.20.0/24
Target: local

If OKE is involved, ensure proper security list or NSG rules:

  • Allow port 80 or 443 from Gateway subnet to LB subnet
  • Allow health checks

5. Creating an API Deployment (Technical Example)

Here is a minimal private deployment using a backend running at internal LB:

Deployment specification

{
  "routes": [
    {
      "path": "/v1/customer",
      "methods": ["GET"],
      "backend": {
        "type": "HTTP_BACKEND",
        "url": "http://10.0.20.10:8080/api/customer"
      }
    }
  ]
}

Upload this JSON file and create a new deployment under your private API Gateway.

The Gateway privately calls 10.0.20.10 using internal routing.

6. Adding Authentication and Authorization

OCI API Gateway supports:

  • OCI IAM Authorization (for IAM-authenticated clients)
  • JWT validation (OIDC tokens)
  • Custom authorizers using Functions

Example: validate a token from an internal identity provider.

"authentication": {
  "type": "JWT_AUTHENTICATION",
  "tokenHeader": "Authorization",
  "jwksUri": "https://id.internal.example.com/.well-known/jwks.json"
}

This ensures zero-trust by requiring token validation even inside the private network.

7. Logging, Metrics, and Troubleshooting 100 Percent Privately

Because we are running in private-only mode, logs and metrics must also stay private.

Use:

  • Service Gateway for Logging service
  • VCN Flow Logs for traffic inspection
  • WAF (private deployment) if deeper L7 filtering is needed

Enable Access Logs:

Enable access logs: Yes
Retention: 90 days

You will see logs in the Logging service with no public egress.

8. Common Mistakes and How to Avoid Them

Route table missing entries

Most issues come from mismatched route tables between:

  • Gateway subnet
  • Backend subnet
  • OKE node pools

Security Lists or NSGs blocking traffic

Ensure the backend allows inbound traffic from the Gateway subnet.

Incorrect backend URL

Use private IP or private LB hostname.

Backend certificate errors

If using HTTPS internally, ensure trusted CA is loaded on Gateway.

Regards

Osama

Building a Real-Time Data Enrichment & Inference Pipeline on AWS Using Kinesis, Lambda, DynamoDB, and SageMaker

Modern cloud applications increasingly depend on real-time processing, especially when dealing with fraud detection, personalization, IoT telemetry, or operational monitoring.
In this post, we’ll build a fully functional AWS pipeline that:

  • Streams events using Amazon Kinesis
  • Enriches and transforms them via AWS Lambda
  • Stores real-time feature data in Amazon DynamoDB
  • Performs machine-learning inference using a SageMaker Endpoint

1. Architecture Overview

2. Step-By-Step Pipeline Build


2.1. Create a Kinesis Data Stream

aws kinesis create-stream \
  --stream-name RealtimeEvents \
  --shard-count 2 \
  --region us-east-1

This stream will accept incoming events from your apps, IoT devices, or microservices.


2.2. DynamoDB Table for Real-Time Features

aws dynamodb create-table \
  --table-name UserFeatureStore \
  --attribute-definitions AttributeName=userId,AttributeType=S \
  --key-schema AttributeName=userId,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region us-east-1

This table holds live user features, updated every time an event arrives.


2.3. Lambda Function (Real-Time Data Enrichment)

This Lambda:

  • Reads events from Kinesis
  • Computes simple features (e.g., last event time, rolling count)
  • Saves enriched data to DynamoDB
import json
import boto3
from datetime import datetime, timedelta

ddb = boto3.resource("dynamodb")
table = ddb.Table("UserFeatureStore")

def lambda_handler(event, context):

    for record in event["Records"]:
        payload = json.loads(record["kinesis"]["data"])

        user = payload["userId"]
        metric = payload["metric"]
        ts = datetime.fromisoformat(payload["timestamp"])

        # Fetch old features
        old = table.get_item(Key={"userId": user}).get("Item", {})

        last_ts = old.get("lastTimestamp")
        count = old.get("count", 0)

        # Update rolling 5-minute count
        if last_ts:
            prev_ts = datetime.fromisoformat(last_ts)
            if ts - prev_ts < timedelta(minutes=5):
                count += 1
            else:
                count = 1
        else:
            count = 1

        # Save new enriched features
        table.put_item(Item={
            "userId": user,
            "lastTimestamp": ts.isoformat(),
            "count": count,
            "lastMetric": metric
        })

    return {"status": "ok"}

Attach the Lambda to the Kinesis stream.


2.4. Creating a SageMaker Endpoint for Inference

Train your model offline, then deploy it:

aws sagemaker create-endpoint-config \
  --endpoint-config-name RealtimeInferenceConfig \
  --production-variants VariantName=AllInOne,ModelName=MyInferenceModel,InitialInstanceCount=1,InstanceType=ml.m5.large

aws sagemaker create-endpoint \
  --endpoint-name RealtimeInference \
  --endpoint-config-name RealtimeInferenceConfig


2.5. API Layer Performing Live Inference

Your application now requests predictions like this:

import boto3
import json

runtime = boto3.client("sagemaker-runtime")
ddb = boto3.resource("dynamodb").Table("UserFeatureStore")

def predict(user_id, extra_input):

    user_features = ddb.get_item(Key={"userId": user_id}).get("Item")

    payload = {
        "userId": user_id,
        "features": user_features,
        "input": extra_input
    }

    response = runtime.invoke_endpoint(
        EndpointName="RealtimeInference",
        ContentType="application/json",
        Body=json.dumps(payload)
    )

    return json.loads(response["Body"].read())

This combines live enriched features + model inference for maximum accuracy.


3. Production Considerations

Performance

  • Enable Lambda concurrency
  • Use DynamoDB DAX caching
  • Use Kinesis Enhanced Fan-Out for high throughput

Security

  • Use IAM roles with least privilege
  • Encrypt Kinesis, Lambda, DynamoDB, and SageMaker with KMS

Monitoring

  • CloudWatch Metrics
  • CloudWatch Logs Insights queries
  • DynamoDB capacity alarms
  • SageMaker Model error monitoring

Cost Optimization

  • Use PAY_PER_REQUEST DynamoDB
  • Use Lambda Power Tuning
  • Scale SageMaker endpoints with autoscaling

Implementing a Real-Time Anomaly Detection Pipeline on OCI Using Streaming Data, Oracle Autonomous Database & ML

Detecting unusual patterns in real time is critical to preventing outages, catching fraud, ensuring SLA compliance, and maintaining high-quality user experiences.
In this post, we build a real working pipeline on OCI that:

  • Ingests streaming data
  • Computes features in near-real time
  • Stores results in Autonomous Database
  • Runs anomaly detection logic
  • Sends alerts and exposes dashboards

This guide contains every technical step, including:
Streaming → Function → Autonomous DB → Anomaly Logic → Notifications → Dashboards

1. Architecture Overview

Components Used

  • OCI Streaming
  • OCI Functions
  • Oracle Autonomous Database
  • DBMS_SCHEDULER for anomaly detection job
  • OCI Notifications
  • Oracle Analytics Cloud / Grafana

2. Step-by-Step Implementation


2.1 Create OCI Streaming Stream

oci streaming stream create \
  --compartment-id $COMPARTMENT_OCID \
  --display-name "anomaly-events-stream" \
  --partitions 3

2.2 Autonomous Database Table

CREATE TABLE raw_events (
  event_id       VARCHAR2(50),
  event_time     TIMESTAMP,
  metric_value   NUMBER,
  feature1       NUMBER,
  feature2       NUMBER,
  processed_flag CHAR(1) DEFAULT 'N',
  anomaly_flag   CHAR(1) DEFAULT 'N',
  CONSTRAINT pk_raw_events PRIMARY KEY(event_id)
);

2.3 OCI Function – Feature Extraction

func.py:

import oci
import cx_Oracle
import json
from datetime import datetime

def handler(ctx, data: bytes=None):
    event = json.loads(data.decode('utf-8'))

    evt_id = event['id']
    evt_time = datetime.fromisoformat(event['time'])
    value = event['metric']

    # DB Connection
    conn = cx_Oracle.connect(user='USER', password='PWD', dsn='dsn')
    cur = conn.cursor()

    # Fetch previous value if exists
    cur.execute("SELECT metric_value FROM raw_events WHERE event_id=:1", (evt_id,))
    prev = cur.fetchone()
    prev_val = prev[0] if prev else 1.0

    # Compute features
    feature1 = value - prev_val
    feature2 = value / prev_val

    # Insert new event
    cur.execute("""
        INSERT INTO raw_events(event_id, event_time, metric_value, feature1, feature2)
        VALUES(:1, :2, :3, :4, :5)
    """, (evt_id, evt_time, value, feature1, feature2))

    conn.commit()
    cur.close()
    conn.close()

    return "ok"

Deploy the function and attach the streaming trigger.


2.4 Anomaly Detection Job (DBMS_SCHEDULER)

BEGIN
  FOR rec IN (
    SELECT event_id, feature1
    FROM raw_events
    WHERE processed_flag = 'N'
  ) LOOP
    DECLARE
      meanv NUMBER;
      stdv  NUMBER;
      zscore NUMBER;
    BEGIN
      SELECT AVG(feature1), STDDEV(feature1) INTO meanv, stdv FROM raw_events;

      zscore := (rec.feature1 - meanv) / NULLIF(stdv, 0);

      IF ABS(zscore) > 3 THEN
        UPDATE raw_events SET anomaly_flag='Y' WHERE event_id=rec.event_id;
      END IF;

      UPDATE raw_events SET processed_flag='Y' WHERE event_id=rec.event_id;
    END;
  END LOOP;
END;

Schedule this to run every 2 minutes:

BEGIN
  DBMS_SCHEDULER.CREATE_JOB (
    job_name        => 'ANOMALY_JOB',
    job_type        => 'PLSQL_BLOCK',
    job_action      => 'BEGIN anomaly_detection_proc; END;',
    repeat_interval => 'FREQ=MINUTELY;INTERVAL=2;',
    enabled         => TRUE
  );
END;


2.5 Notifications

oci ons topic create \
  --compartment-id $COMPARTMENT_OCID \
  --name "AnomalyAlerts"

In the DB, add a trigger:

CREATE OR REPLACE TRIGGER notify_anomaly
AFTER UPDATE ON raw_events
FOR EACH ROW
WHEN (NEW.anomaly_flag='Y' AND OLD.anomaly_flag='N')
BEGIN
  DBMS_OUTPUT.PUT_LINE('Anomaly detected for event ' || :NEW.event_id);
END;
/


2.6 Dashboarding

You may use:

  • Oracle Analytics Cloud (OAC)
  • Grafana + ADW Integration
  • Any BI tool with SQL

Example Query:

SELECT event_time, metric_value, anomaly_flag 
FROM raw_events
ORDER BY event_time;

2. Terraform + OCI CLI Script Bundle

Terraform – Streaming + Function + Policies

resource "oci_streaming_stream" "anomaly" {
  name           = "anomaly-events-stream"
  partitions     = 3
  compartment_id = var.compartment_id
}

resource "oci_functions_application" "anomaly_app" {
  compartment_id = var.compartment_id
  display_name   = "anomaly-function-app"
  subnet_ids     = var.subnets
}

Terraform Notification Topic

resource "oci_ons_notification_topic" "anomaly" {
  compartment_id = var.compartment_id
  name           = "AnomalyAlerts"
}

CLI Insert Test Events

oci streaming stream message put \
  --stream-id $STREAM_OCID \
  --messages '[{"key":"1","value":"{\"id\":\"1\",\"time\":\"2025-01-01T10:00:00\",\"metric\":58}"}]'

Deploying Real-Time Feature Store on Amazon SageMaker Feature Store with Amazon Kinesis Data Streams & Amazon DynamoDB for Low-Latency ML Inference

Modern ML inference often depends on up-to-date features (customer behaviour, session counts, recent events) that need to be available in low-latency operations. In this article you’ll learn how to build a real-time feature store on AWS using:

  • Amazon Kinesis Data Streams for streaming events
  • AWS Lambda for processing and feature computation
  • Amazon DynamoDB (or SageMaker Feature Store) for storage of feature vectors
  • Amazon SageMaker Endpoint for low-latency inference
    You’ll see end-to-end code snippets and architecture guidance so you can implement this in your environment.

1. Architecture Overview

The pipeline works like this:

  1. Front-end/app produces events (e.g., user click, transaction) → published to Kinesis.
  2. A Lambda function consumes from Kinesis, computes derived features (for example: rolling window counts, recency, session features).
  3. The Lambda writes/updates these features into a DynamoDB table (or directly into SageMaker Feature Store).
  4. When a request arrives for inference, the application fetches the current feature set from DynamoDB (or Feature Store) and calls a SageMaker endpoint.
  5. Optionally, after inference you can stream feedback events for model refinement.

This architecture provides real-time feature freshness and low-latencyinference.

2. Setup & Implementation

2.1 Create the Kinesis data stream

aws kinesis create-stream \
  --stream-name UserEventsStream \
  --shard-count 2 \
  --region us-east-1

2.2 Create DynamoDB table for features

aws dynamodb create-table \
  --table-name RealTimeFeatures \
  --attribute-definitions AttributeName=userId,AttributeType=S \
  --key-schema AttributeName=userId,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region us-east-1

2.3 Lambda function to compute features

Here is a Python snippet (using boto3) which will be triggered by Kinesis:

import json
import boto3
from datetime import datetime, timedelta

dynamo = boto3.resource('dynamodb', region_name='us-east-1')
table = dynamo.Table('RealTimeFeatures')

def lambda_handler(event, context):
    for record in event['Records']:
        payload = json.loads(record['kinesis']['data'])
        user_id = payload['userId']
        event_type = payload['eventType']
        ts = datetime.fromisoformat(payload['timestamp'])

        # Fetch current features
        resp = table.get_item(Key={'userId': user_id})
        item = resp.get('Item', {})
        
        # Derive features: e.g., event_count_last_5min, last_event_type
        last_update = item.get('lastUpdate', ts.isoformat())
        count_5min = item.get('count5min', 0)
        then = datetime.fromisoformat(last_update)
        if ts - then < timedelta(minutes=5):
            count_5min += 1
        else:
            count_5min = 1
        
        # Update feature item
        new_item = {
            'userId': user_id,
            'lastEventType': event_type,
            'count5min': count_5min,
            'lastUpdate': ts.isoformat()
        }
        table.put_item(Item=new_item)
    return {'statusCode': 200}

2.4 Deploy and connect Lambda to Kinesis

  • Create Lambda function in AWS console or via CLI.
  • Add Kinesis stream UserEventsStream as event source with batch size and start position = TRIM_HORIZON.
  • Assign IAM role allowing kinesis:DescribeStream, kinesis:GetRecords, dynamodb:PutItem, etc.

2.5 Prepare SageMaker endpoint for inference

  • Train model offline (outside scope here) with features stored in training dataset matching real-time features.
  • Deploy model as endpoint, e.g., arn:aws:sagemaker:us-east-1:123456789012:endpoint/RealtimeModel.
  • In your application code call endpoint by fetching features from DynamoDB then invoking endpoint:
import boto3
sagemaker = boto3.client('sagemaker-runtime', region_name='us-east-1')
dynamo = boto3.resource('dynamodb', region_name='us-east-1')
table = dynamo.Table('RealTimeFeatures')

def get_prediction(user_id, input_payload):
    resp = table.get_item(Key={'userId': user_id})
    features = resp.get('Item')
    payload = {
        'features': features,
        'input': input_payload
    }
    response = sagemaker.invoke_endpoint(
        EndpointName='RealtimeModel',
        ContentType='application/json',
        Body=json.dumps(payload)
    )
    result = json.loads(response['Body'].read().decode())
    return result

Conclusion

In this blog post you learned how to build a real-time feature store on AWS: streaming event ingestion with Kinesis, real-time feature computation with Lambda, storage in DynamoDB, and serving via SageMaker. You got specific code examples and operational considerations for production readiness. With this setup, you’re well-positioned to deliver low-latency, ML-powered applications.

Enjoy the cloud
Osama

Automating Cost-Governance Workflows in Oracle Cloud Infrastructure (OCI) with APIs & Infrastructure as Code

Introduction

Cloud cost management isn’t just about checking invoices once a month — it’s about embedding automation, governance, and insights into your infrastructure so that your engineering teams make cost-aware decisions in real time. With OCI, you have native tools (Cost Analysis, Usage APIs, Budgets, etc.) and infrastructure-as-code (IaC) tooling that can help turn cost governance from an after-thought into a proactive part of your DevOps workflow.

In this article you’ll learn how to:

  1. Extract usage and cost data via the OCI Usage API / Cost Reports.
  2. Define IaC workflows (e.g., with Terraform) that enforce budget/usage guardrails.
  3. Build a simple example where you automatically tag resources, monitor spend by tag, and alert/correct when thresholds are exceeded.
  4. Discuss best practices, pitfalls, and governance recommendations for embedding FinOps into OCI operations.

1. Understanding OCI Cost & Usage Data

What data is available?

OCI provides several cost/usage-data mechanisms:

  • The Cost Analysis tool in the console allows you to view trends by service, compartment, tag, etc. Oracle Docs+1
  • The Usage/Cost Reports (CSV format) which you can download or programmatically access via the Usage API. Oracle Docs+1
  • The Usage API (CLI/SDK) to query usage-and-cost programmatically. Oracle Docs+1

Why this matters

By surfacing cost data at a resource, compartment, or tag level, teams can answer questions like:

  • “Which tag values are consuming cost disproportionately?”
  • “Which compartments have heavy spend growth month-over-month?”
  • “Which services (Compute, Storage, Database, etc.) are the highest spenders and require optimization?”

Example: Downloading a cost report via CLI

Here’s a Python/CLI snippet that shows how to download a cost-report CSV from your tenancy:

oci os object get \
  --namespace-name bling \
  --bucket-name <your-tenancy-OCID> \
  --name reports/usage-csv/<report_name>.csv.gz \
  --file local_report.csv.gz
import oci
config = oci.config.from_file("~/.oci/config", "DEFAULT")
os_client = oci.object_storage.ObjectStorageClient(config)
namespace = "bling"
bucket = "<your-tenancy-OCID>"
object_name = "reports/usage-csv/2025-10-19-report-00001.csv.gz"

resp = os_client.get_object(namespace, bucket, object_name)
with open("report-2025-10-19.csv.gz", "wb") as f:
    for chunk in resp.data.raw.stream(1024*1024, decode_content=False):
        f.write(chunk)

2. Defining Cost-Governance Workflows with IaC

Once you have data flowing in, you can enforce guardrails and automate actions. Here’s one example pattern.

a) Enforce tagging rules

Ensure that every resource created in a compartment has a cost_center tag (for example). You can do this via policy + IaC.

# Example Terraform policy for tagging requirement
resource "oci_identity_tag_namespace" "governance" {
  compartment_id = var.compartment_id
  display_name   = "governance_tags"
  is_retired     = false
}

resource "oci_identity_tag_definition" "cost_center" {
  compartment_id = var.compartment_id
  tag_namespace_id = oci_identity_tag_namespace.governance.id
  name            = "cost_center"
  description     = "Cost Center code for FinOps tracking"
  is_retired      = false
}

You can then add an IAM policy that prevents creation of resources if the tag isn’t applied (or fails to meet allowed values). For example:

Allow group ComputeAdmins to manage instance-family in compartment Prod
  where request.operation = “CreateInstance”
  and request.resource.tag.cost_center is not null

b) Monitor vs budget

Use the Usage API or Cost Reports to pull monthly spend per tag, then compare against defined budgets. If thresholds are exceeded, trigger an alert or remediation.

Here’s an example Python pseudo-code:

from datetime import datetime, timedelta
import oci

config = oci.config.from_file()
usage_client = oci.usage_api.UsageapiClient(config)

today = datetime.utcnow()
start = today.replace(day=1)
end = today

req = oci.usage_api.models.RequestSummarizedUsagesDetails(
    tenant_id = config["tenancy"],
    time_usage_started = start,
    time_usage_ended   = end,
    granularity        = "DAILY",
    group_by           = ["tag.cost_center"]
)

resp = usage_client.request_summarized_usages(req)
for item in resp.data.items:
    tag_value = item.tag_map.get("cost_center", "untagged")
    cost     = float(item.computed_amount or 0)
    print(f"Cost for cost_center={tag_value}: {cost}")

    if cost > budget_for(tag_value):
        send_alert(tag_value, cost)
        take_remediation(tag_value)

c) Automated remediation

Remediation could mean:

  • Auto-shut down non-production instances in compartments after hours.
  • Resize or terminate idle resources.
  • Notify owners of over-spend via email/Slack.

Terraform, OCI Functions and Event-Service can help orchestrate that. For example, set up an Event when “cost by compartment exceeds X” → invoke Function → tag resources with “cost_alerted” → optional shutdown.

3. Putting It All Together

Here is a step-by-step scenario:

  1. Define budget categories – e.g., cost_center codes: CC-101, CC-202, CC-303.
  2. Tag resources on creation – via policy/IaC ensure all resources include cost_center tag with one of those codes.
  3. Collect cost data – using Usage API daily, group by tag.cost_center.
  4. Evaluate current spend vs budget – for each code, compare cumulative cost for current month against budget.
  5. If over budget – then:
    • send an alert to the team (via SNS, email, Slack)
    • optionally trigger remediation: e.g., stop non-critical compute in that cost center’s compartments.
  6. Dashboard & visibility – load cost data into a BI tool (could be OCI Analytics Cloud or Oracle Analytics) with trends, forecasts, anomaly detection. Use the “Show cost” in OCI Ops Insights to view usage & forecast cost. Oracle Docs
  7. Continuous improvement – right-size instances, pause dev/test at night, switch to cheaper shapes or reserved/commit models (depending on your discount model). See OCI best practice guide for optimizing cost. Oracle Docs

Example snippet – alerting logic in CLI

# example command to get summarized usage for last 7 days
oci usage-api request-summarized-usages \
  --tenant-id $TENANCY_OCID \
  --time-usage-started $(date -u -d '-7 days' +%Y-%m-%dT00:00:00Z) \
  --time-usage-ended   $(date -u +%Y-%m-%dT00:00:00Z) \
  --granularity DAILY \
  --group-by "tag.cost_center" \
  --query "data.items[?tagMap.cost_center=='CC-101'].computedAmount" \
  --raw-output

Enjoy the OCI
Osama