Cross-Cloud Secret Synchronization: AWS Secrets Manager and OCI Vault in a Production Multi-Cloud Setup

Posted on April 24, 2026 by Osama Mustafa in Uncategorized

One of the most overlooked problems in multi-cloud environments is secrets management across providers. Teams usually solve it badly: they store the same secret in both clouds manually, forget to rotate one of them, and find out during an outage that the credentials have been out of sync for three months.

In this post I will walk through building an automated secrets synchronization pipeline between AWS Secrets Manager and OCI Vault. When a secret rotates in AWS, the pipeline detects the rotation event, retrieves the new value, and pushes it into OCI Vault automatically. Everything is built with Terraform, an AWS Lambda function, and OCI IAM. No manual steps after the initial deployment.

This is a pattern I have used in environments where the database layer runs on OCI (leveraging Oracle Database pricing and performance) while the application layer runs on AWS. Both sides need the same database credentials, and both sides need to stay in sync without human intervention.

Architecture

The flow works like this:

AWS Secrets Manager rotation event fires via EventBridge, which triggers a Lambda function. The Lambda retrieves the new secret value, authenticates to OCI using an API key stored in its own environment (not hardcoded), and calls the OCI Vault API to update the corresponding secret version. OCI Vault stores the new value and makes it available to workloads running in OCI.

Prerequisites

Before starting you need:

AWS account with permissions to manage Secrets Manager, Lambda, EventBridge, and IAM
OCI tenancy with permissions to manage Vault, Keys, and IAM policies
Terraform 1.5 or later
Python 3.11 for the Lambda function
An existing OCI Vault and master encryption key (or we will create one)

Step 1: OCI Vault and IAM Setup

Start with OCI. We need a Vault, a master key, and an IAM user whose API key the Lambda will use to authenticate.

hcl

			
# OCI Vault
resource "oci_kms_vault" "app_vault" {
  compartment_id = var.compartment_id
  display_name   = "multi-cloud-secrets-vault"
  vault_type     = "DEFAULT"
}
# Master Encryption Key inside the Vault
resource "oci_kms_key" "secrets_key" {
  compartment_id      = var.compartment_id
  display_name        = "secrets-master-key"
  management_endpoint = oci_kms_vault.app_vault.management_endpoint
  key_shape {
    algorithm = "AES"
    length    = 32
  }
}
# IAM user for cross-cloud access
resource "oci_identity_user" "sync_user" {
  compartment_id = var.tenancy_ocid
  name           = "aws-secrets-sync-user"
  description    = "Service user for AWS Lambda to push secrets into OCI Vault"
  email          = "sync-user@internal.example.com"
}
# API key for the sync user (you will generate the actual key pair separately)
resource "oci_identity_api_key" "sync_user_key" {
  user_id   = oci_identity_user.sync_user.id
  key_value = var.oci_sync_user_public_key_pem
}
# IAM group for the sync user
resource "oci_identity_group" "sync_group" {
  compartment_id = var.tenancy_ocid
  name           = "secrets-sync-group"
  description    = "Group for cross-cloud secrets sync service users"
}
resource "oci_identity_user_group_membership" "sync_membership" {
  group_id = oci_identity_group.sync_group.id
  user_id  = oci_identity_user.sync_user.id
}
# Minimal IAM policy - only what is needed, nothing more
resource "oci_identity_policy" "sync_policy" {
  compartment_id = var.compartment_id
  name           = "secrets-sync-policy"
  description    = "Allows sync user to manage secrets in the app vault only"
  statements = [
    "Allow group secrets-sync-group to manage secret-family in compartment id ${var.compartment_id} where target.vault.id = '${oci_kms_vault.app_vault.id}'",
    "Allow group secrets-sync-group to use keys in compartment id ${var.compartment_id} where target.key.id = '${oci_kms_key.secrets_key.id}'"
  ]
}

		

The policy scope is intentionally narrow. The sync user can only manage secrets inside this specific vault and can only use this specific key. If the AWS Lambda credentials are ever compromised, the blast radius is limited to this vault.

Step 2: Create the Initial Secret in OCI Vault

We need a secret placeholder in OCI Vault that the Lambda will update. The initial value does not matter since it will be overwritten on the first sync.

hcl

			
resource "oci_vault_secret" "db_password" {
  compartment_id = var.compartment_id
  vault_id       = oci_kms_vault.app_vault.id
  key_id         = oci_kms_key.secrets_key.id
  secret_name    = "prod-db-password"
  secret_content {
    content_type = "BASE64"
    content      = base64encode("initial-placeholder-value")
    name         = "v1"
    stage        = "CURRENT"
  }
  metadata = {
    source      = "aws-secrets-manager"
    aws_secret  = "prod/database/password"
    environment = "production"
  }
}

		

Step 3: AWS Secrets Manager and the Source Secret

On the AWS side, create the authoritative secret and enable automatic rotation.

hcl

			
resource "aws_secretsmanager_secret" "db_password" {
  name                    = "prod/database/password"
  description             = "Production database password - synced to OCI Vault"
  recovery_window_in_days = 7
  tags = {
    Environment   = "production"
    SyncTarget    = "oci-vault"
    OciSecretName = "prod-db-password"
  }
}
resource "aws_secretsmanager_secret_version" "db_password_v1" {
  secret_id     = aws_secretsmanager_secret.db_password.id
  secret_string = jsonencode({
    username = "db_admin",
    password = var.initial_db_password,
    host     = var.db_host,
    port     = 1521,
    database = "PRODDB"
  })
}
# Rotation configuration - rotate every 30 days
resource "aws_secretsmanager_secret_rotation" "db_password_rotation" {
  secret_id           = aws_secretsmanager_secret.db_password.id
  rotation_lambda_arn = aws_lambda_function.db_rotation_lambda.arn
  rotation_rules {
    automatically_after_days = 30
  }
}

		

Step 4: Store OCI Credentials in AWS Secrets Manager

The Lambda needs OCI API credentials to authenticate. Store them as a secret in AWS Secrets Manager so they never appear in Lambda environment variables in plaintext.

hcl

			
resource "aws_secretsmanager_secret" "oci_credentials" {
  name        = "internal/oci-sync-credentials"
  description = "OCI API key credentials for secrets sync Lambda"
  tags = {
    Environment = "production"
    Purpose     = "cross-cloud-sync"
  }
}
resource "aws_secretsmanager_secret_version" "oci_credentials_v1" {
  secret_id = aws_secretsmanager_secret.oci_credentials.id
  secret_string = jsonencode({
    tenancy_ocid  = var.oci_tenancy_ocid,
    user_ocid     = var.oci_sync_user_ocid,
    fingerprint   = var.oci_api_key_fingerprint,
    private_key   = var.oci_private_key_pem,
    region        = var.oci_region
  })
}

		

Step 5: The Lambda Function

This is the core of the pipeline. The Lambda retrieves the rotated secret from AWS Secrets Manager, loads OCI credentials from its own secrets store, and calls the OCI Vault API to create a new secret version.

python

			
import boto3
import json
import base64
import oci
import logging
import os
from datetime import datetime, timezone
from botocore.exceptions import ClientError
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def get_oci_config():
    """Retrieve OCI credentials from AWS Secrets Manager."""
    client = boto3.client("secretsmanager", region_name=os.environ["AWS_REGION"])
    
    try:
        response = client.get_secret_value(
            SecretId=os.environ["OCI_CREDENTIALS_SECRET_ARN"]
        )
        creds = json.loads(response["SecretString"])
        
        return {
            "tenancy": creds["tenancy_ocid"],
            "user": creds["user_ocid"],
            "fingerprint": creds["fingerprint"],
            "key_content": creds["private_key"],
            "region": creds["region"]
        }
    except ClientError as e:
        logger.error(f"Failed to retrieve OCI credentials: {e}")
        raise
def get_aws_secret(secret_arn: str) -> str:
    """Retrieve the current value of an AWS secret."""
    client = boto3.client("secretsmanager", region_name=os.environ["AWS_REGION"])
    
    try:
        response = client.get_secret_value(SecretId=secret_arn)
        return response.get("SecretString") or base64.b64decode(
            response["SecretBinary"]
        ).decode("utf-8")
    except ClientError as e:
        logger.error(f"Failed to retrieve AWS secret {secret_arn}: {e}")
        raise
def push_to_oci_vault(
    oci_config: dict,
    vault_id: str,
    key_id: str,
    secret_ocid: str,
    secret_value: str
):
    """Create a new version of an OCI Vault secret."""
    vaults_client = oci.vault.VaultsClient(oci_config)
    
    encoded_value = base64.b64encode(secret_value.encode("utf-8")).decode("utf-8")
    
    update_details = oci.vault.models.UpdateSecretDetails(
        secret_content=oci.vault.models.Base64SecretContentDetails(
            content_type=oci.vault.models.SecretContentDetails.CONTENT_TYPE_BASE64,
            content=encoded_value,
            name=f"sync-{datetime.now(timezone.utc).strftime('%Y%m%d%H%M%S')}",
            stage="CURRENT"
        ),
        metadata={
            "synced_from": "aws-secrets-manager",
            "synced_at": datetime.now(timezone.utc).isoformat()
        }
    )
    
    response = vaults_client.update_secret(
        secret_id=secret_ocid,
        update_secret_details=update_details
    )
    
    logger.info(
        f"OCI secret updated. OCID: {secret_ocid}, "
        f"New version: {response.data.current_version_number}"
    )
    
    return response.data
def handler(event, context):
    """
    EventBridge trigger handler.
    Expects event detail to contain:
      - aws_secret_arn: ARN of the rotated AWS secret
      - oci_secret_ocid: OCID of the target OCI Vault secret
      - oci_vault_id: OCID of the target OCI Vault
      - oci_key_id: OCID of the OCI KMS key
    """
    logger.info(f"Received event: {json.dumps(event)}")
    
    detail = event.get("detail", {})
    aws_secret_arn  = detail.get("aws_secret_arn")
    oci_secret_ocid = detail.get("oci_secret_ocid")
    oci_vault_id    = detail.get("oci_vault_id")
    oci_key_id      = detail.get("oci_key_id")
    
    if not all([aws_secret_arn, oci_secret_ocid, oci_vault_id, oci_key_id]):
        logger.error("Missing required fields in event detail")
        raise ValueError("Event detail must include aws_secret_arn, oci_secret_ocid, oci_vault_id, oci_key_id")
    
    logger.info(f"Syncing secret: {aws_secret_arn} to OCI: {oci_secret_ocid}")
    
    # Step 1: Get OCI credentials
    oci_config = get_oci_config()
    
    # Step 2: Retrieve the rotated AWS secret
    secret_value = get_aws_secret(aws_secret_arn)
    
    # Step 3: Push to OCI Vault
    result = push_to_oci_vault(
        oci_config=oci_config,
        vault_id=oci_vault_id,
        key_id=oci_key_id,
        secret_ocid=oci_secret_ocid,
        secret_value=secret_value
    )
    
    return {
        "statusCode": 200,
        "body": {
            "message": "Secret synced successfully",
            "oci_secret_ocid": oci_secret_ocid,
            "oci_version": result.current_version_number
        }
    }

		

Step 6: Lambda IAM Role and Deployment

hcl

			
data "aws_iam_policy_document" "lambda_assume_role" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole"]
    principals {
      type        = "Service"
      identifiers = ["lambda.amazonaws.com"]
    }
  }
}
data "aws_iam_policy_document" "lambda_permissions" {
  statement {
    effect = "Allow"
    actions = [
      "secretsmanager:GetSecretValue",
      "secretsmanager:DescribeSecret"
    ]
    resources = [
      aws_secretsmanager_secret.db_password.arn,
      aws_secretsmanager_secret.oci_credentials.arn
    ]
  }
  statement {
    effect = "Allow"
    actions = [
      "logs:CreateLogGroup",
      "logs:CreateLogStream",
      "logs:PutLogEvents"
    ]
    resources = ["arn:aws:logs:*:*:*"]
  }
}
resource "aws_iam_role" "sync_lambda_role" {
  name               = "secrets-sync-lambda-role"
  assume_role_policy = data.aws_iam_policy_document.lambda_assume_role.json
}
resource "aws_iam_role_policy" "sync_lambda_policy" {
  name   = "secrets-sync-lambda-policy"
  role   = aws_iam_role.sync_lambda_role.id
  policy = data.aws_iam_policy_document.lambda_permissions.json
}
resource "aws_lambda_function" "secrets_sync" {
  filename         = "${path.module}/lambda/secrets_sync.zip"
  function_name    = "oci-secrets-sync"
  role             = aws_iam_role.sync_lambda_role.arn
  handler          = "main.handler"
  runtime          = "python3.11"
  timeout          = 60
  memory_size      = 256
  source_code_hash = filebase64sha256("${path.module}/lambda/secrets_sync.zip")
  environment {
    variables = {
      OCI_CREDENTIALS_SECRET_ARN = aws_secretsmanager_secret.oci_credentials.arn
      AWS_REGION                 = var.aws_region
    }
  }
  layers = [aws_lambda_layer_version.oci_sdk_layer.arn]
}

		

Bundle the OCI Python SDK as a Lambda Layer so the function does not need to package it inline:

bash

			
mkdir -p lambda_layer/python
pip install oci --target lambda_layer/python
cd lambda_layer && zip -r ../oci_sdk_layer.zip python/

hcl

			
resource "aws_lambda_layer_version" "oci_sdk_layer" {
  filename            = "${path.module}/oci_sdk_layer.zip"
  layer_name          = "oci-python-sdk"
  compatible_runtimes = ["python3.11"]
  source_code_hash    = filebase64sha256("${path.module}/oci_sdk_layer.zip")
}

		

Step 7: EventBridge Rule to Trigger on Rotation

hcl

			
resource "aws_cloudwatch_event_rule" "secret_rotation_rule" {
  name        = "detect-secret-rotation"
  description = "Fires when a Secrets Manager secret rotation completes"
  event_pattern = jsonencode({
    source      = ["aws.secretsmanager"],
    detail-type = ["AWS API Call via CloudTrail"],
    detail = {
      eventSource = ["secretsmanager.amazonaws.com"],
      eventName   = ["RotateSecret", "PutSecretValue"]
    }
  })
}
resource "aws_cloudwatch_event_target" "sync_lambda_target" {
  rule      = aws_cloudwatch_event_rule.secret_rotation_rule.name
  target_id = "SyncToOCI"
  arn       = aws_lambda_function.secrets_sync.arn
  input_transformer {
    input_paths = {
      secret_arn = "$.detail.requestParameters.secretId"
    }
    input_template = <<EOF
{
  "detail": {
    "aws_secret_arn": "<secret_arn>",
    "oci_secret_ocid": "${var.oci_db_password_secret_ocid}",
    "oci_vault_id": "${oci_kms_vault.app_vault.id}",
    "oci_key_id": "${oci_kms_key.secrets_key.id}"
  }
}
EOF
  }
}
resource "aws_lambda_permission" "allow_eventbridge" {
  statement_id  = "AllowEventBridgeInvoke"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.secrets_sync.function_name
  principal     = "events.amazonaws.com"
  source_arn    = aws_cloudwatch_event_rule.secret_rotation_rule.arn
}

		

Step 8: Verifying the Pipeline

Manually trigger a rotation to test the full pipeline without waiting 30 days:

bash

			
# Force a rotation in AWS
aws secretsmanager rotate-secret \
  --secret-id prod/database/password \
  --region us-east-1
# Check Lambda execution logs
aws logs tail /aws/lambda/oci-secrets-sync --follow
# Verify the new version appeared in OCI Vault
oci vault secret get \
  --secret-id <your-oci-secret-ocid> \
  --query 'data.{name:secret-name, version:"current-version-number", updated:"time-of-current-version-need-rotation"}' \
  --output table

		

A successful sync produces output similar to this in the Lambda logs:

			
INFO: Syncing secret: arn:aws:secretsmanager:us-east-1:123456789:secret:prod/database/password to OCI: ocid1.vaultsecret.oc1...
INFO: OCI secret updated. OCID: ocid1.vaultsecret.oc1..., New version: 3

Handling Failures and Drift

The pipeline as built is synchronous and event-driven, which means if the Lambda fails, the OCI secret does not get updated. Add a dead-letter queue and a reconciliation function that runs on a schedule to catch any drift.

hcl

			
resource "aws_sqs_queue" "sync_dlq" {
  name                      = "secrets-sync-dlq"
  message_retention_seconds = 86400
}
resource "aws_lambda_function_event_invoke_config" "sync_retry" {
  function_name                = aws_lambda_function.secrets_sync.function_name
  maximum_retry_attempts       = 2
  maximum_event_age_in_seconds = 300
  destination_config {
    on_failure {
      destination = aws_sqs_queue.sync_dlq.arn
    }
  }
}

		

For reconciliation, a scheduled Lambda that runs every hour compares the LastRotatedDate on the AWS secret against the synced_at metadata tag on the OCI secret. If they differ by more than five minutes, it triggers a forced sync.

Security Considerations

A few things to keep in mind when running this in production.

The OCI private key stored in AWS Secrets Manager should be rotated periodically, just like any other credential. Add it to your rotation schedule.

Enable CloudTrail in AWS and OCI Audit logging so every access to both secrets stores is recorded. If something is off with the sync, the audit logs tell you exactly which principal made the change and when.

Use VPC endpoints for Secrets Manager in AWS so the Lambda traffic never crosses the public internet when retrieving credentials.

On the OCI side, enable Vault audit logging to the OCI Logging service so every secret version write is captured.

Wrapping Up

This pipeline solves a real operational problem without requiring a third-party secrets broker. AWS Secrets Manager stays the authoritative source. OCI Vault stays current automatically. The only manual step is the initial deployment.

The pattern extends to other cross-cloud credential types. Database connection strings, API tokens, TLS certificates — any secret that needs to exist on both clouds can follow the same EventBridge to Lambda to OCI Vault flow. Extend the Lambda to support a mapping table of AWS secret ARNs to OCI secret OCIDs and one function handles your entire secrets estate across both providers.

Regards,
Osama

Building Kubernetes Sentinel: An AI-Powered Cluster Health Dashboard

Posted on April 6, 2026April 13, 2026 by Osama Mustafa in Application

When you manage Kubernetes clusters at scale, the hardest part is not keeping things running. It is knowing when something is about to break, understanding why it broke, and fixing it before it affects users. Traditional monitoring tools give you metrics and alerts, but they leave the diagnosis entirely up to you. You still have to correlate events, read logs, cross-reference namespaces, and figure out the right kubectl commands to run.

I wanted to change that. So I built Kubernetes Sentinel, an open-source dashboard that not only watches your entire cluster in real time but also uses Claude AI to explain what went wrong and tell you exactly how to fix it.

The Problem with Kubernetes Observability

Anyone who has been on call for a Kubernetes cluster knows the feeling. Your phone goes off at 2am. A pod is crashlooping. You open your terminal, start running kubectl commands, and spend the next twenty minutes piecing together what happened from logs, events, and resource descriptions spread across multiple namespaces.

The tooling has not kept up with the complexity. Prometheus and Grafana are powerful, but they require significant setup and expertise to use effectively. Most teams end up with dashboards full of graphs they never look at and alerts that fire so often they get ignored.

What I wanted was something simpler. A single view of the entire cluster, automatic detection of anything that looks wrong, and an AI that could look at the same data an experienced SRE would look at and tell me what is happening in plain English.

What Kubernetes Sentinel Does

Kubernetes Sentinel is a FastAPI backend that runs either locally or as a pod inside your cluster. It polls the Kubernetes API every 15 seconds across all namespaces, not just one, and stores the current state in memory. A React frontend connects to it over HTTP and receives live updates via Server-Sent Events.

The dashboard gives you four things at once. A health score from 0 to 100 that reflects the overall state of your cluster. A live pod table showing every pod across every namespace with restart counts, phase, and node assignment. An event stream showing everything Kubernetes has logged, filtered and color-coded by severity. And a resources view covering your nodes, deployments, services, and persistent volume claims.

On top of that, the backend runs seven anomaly detection rules continuously. CrashLoopBackOff, OOMKilled, NodeNotReady, FailedMount, BackOff, CPUThrottling, and high restart counts. When any of these fire, an anomaly banner appears at the top of the dashboard immediately.

The AI diagnosis feature is where it gets interesting. When you click Run Diagnosis, the backend assembles the current cluster state into a structured prompt and sends it to Claude. Within seconds you get back a plain-English summary of what is wrong, a root cause explanation, and three kubectl commands you can copy and run immediately to fix it. No more correlating events manually. No more searching Stack Overflow for the right flags.

The Technical Decisions

I made a few deliberate architectural choices that I think are worth explaining.

The backend runs as a single process with one Uvicorn worker. This is intentional. The background polling thread lives inside the same process, so multiple workers would each start their own independent loop and you would end up with redundant API calls and inconsistent state. One process, one source of truth.

Authentication with the Kubernetes API uses the official Python client, which handles both scenarios automatically. When the sentinel runs inside a cluster as a pod, it reads the ServiceAccount token that Kubernetes mounts automatically at a well-known path. When you run it locally for development, it falls back to your kubeconfig. The same code works in both environments without any changes.

The RBAC configuration is strictly read-only. The ClusterRole I wrote grants get, list, and watch on pods, events, nodes, services, persistent volume claims, configmaps, secrets, deployments, statefulsets, daemonsets, and replicasets. Nothing else. The sentinel can observe everything but change nothing. This was a hard requirement for me. A monitoring tool should never have write access to the cluster it is watching.

For the frontend I deliberately chose a single React file with no build step. The dashboard runs as a Claude.ai artifact or drops straight into any React project. There is nothing to compile, no node_modules to install, no webpack config to debug. The entire UI is one file you can read and understand in an afternoon.

I also added a DEV_MODE flag that bypasses the Kubernetes connection entirely and loads realistic mock data instead. This means anyone can clone the repo, set DEV_MODE=true, start the backend, and see the full dashboard working within five minutes even if they have never touched Kubernetes before. It made development much faster and makes the project far more accessible for contributors.

The Stack

The backend is Python 3.12 with FastAPI and the official Kubernetes client library. I used sse-starlette for Server-Sent Events, httpx for calling the Claude API, and Pydantic v2 for data validation. The Docker image is a two-stage build that ends up running as a non-root user.

The frontend is React 18 with no external UI library. All styling is plain inline JavaScript objects, which makes it trivially portable and means there are zero CSS conflicts when you embed it somewhere else.

Kubernetes manifests cover the full production deployment: namespace, ClusterRole, ClusterRoleBinding, ServiceAccount, ConfigMap, Deployment with liveness and readiness probes, and a ClusterIP Service. The Anthropic API key is never stored in any manifest file. It goes into a Kubernetes Secret created directly with kubectl.

What I Learned Building This

The biggest challenge was not the Kubernetes integration or the AI features. It was the import path problem. Claude Code generated all the backend files correctly, but because the server is started from inside the backend directory, every import had to be relative to that directory as the root. Files using from backend.core.x import y worked fine in isolation but crashed immediately when uvicorn tried to load them. Once I understood the issue it was a one-line fix in every file, but it cost me an hour of debugging.

The second thing I learned is that mock data is not optional for a project like this. Without DEV_MODE, you need a running Kubernetes cluster to develop against, which means either paying for cloud infrastructure or running a local cluster with kind. Adding ten lines of mock data to the poller made the development loop dramatically faster and opened the project up to contributors who want to work on the frontend without needing any cluster at all.

The AI diagnosis feature turned out to be far more useful than I expected. I assumed it would be a nice addition but not something I would rely on. After running it against realistic failure scenarios, the quality of the root cause analysis was genuinely impressive. It correctly identified memory limit misconfiguration from OOMKill events, correlated restart back-off with recent image pull failures, and suggested the right sequence of commands to investigate and resolve each issue.

Running It Yourself

The project is open source and available on GitHub. There are three ways to run it.

If you just want to see the dashboard without any cluster setup, clone the repo, copy .env.example to .env, set DEV_MODE=true and your Anthropic API key, then run uvicorn from the backend directory. The whole setup takes under five minutes.

If you have a Kubernetes cluster, set DEV_MODE=false and point it at your kubeconfig. The backend will start polling your real cluster immediately and the dashboard will show live data.

If you want to run it inside your cluster, build the Docker image, push it to your registry, create a Kubernetes Secret with your API key, and apply the manifests with kubectl. The deploy script handles the apply order automatically.

The repository is at https://github.com/OsamaOracle/k8s-sentinel/. Contributions, issues, and feedback are welcome.

Regards
Osama

Kubernetes in the Multi-cloud: Orchestrating Workloads Across AWS and OCI

Posted on March 8, 2026 by Osama Mustafa in Uncategorized

Why Multicloud Kubernetes Is No Longer Optional

The conversation has shifted. Running Kubernetes on a single cloud provider was once considered best practice simpler networking, unified IAM, one support contract. But modern enterprise reality tells a different story.

Vendor lock-in risk, regional compliance mandates, cost arbitrage opportunities, and resilience requirements are pushing engineering teams to operate Kubernetes clusters across multiple clouds simultaneously. Among the most compelling combinations today is AWS (EKS) paired with Oracle Cloud Infrastructure (OCI/OKE) two providers with fundamentally different strengths that, when combined, can form a genuinely powerful platform.

This post walks through the architectural decisions, tooling choices, and operational patterns for running a production-grade multicloud Kubernetes setup spanning AWS EKS and OCI OKE.

Understanding What Each Cloud Brings

Before designing a multicloud strategy, you need to be honest about why you’re using each provider not just “for redundancy.”

AWS EKS is mature, battle-tested, and has the richest ecosystem of Kubernetes-native tooling. Its managed node groups, Karpenter autoscaler, and deep integration with IAM Roles for Service Accounts (IRSA) make it a natural fit for compute-heavy, stateless microservices. The tradeoff: cost can escalate fast at scale.

OCI OKE (Oracle Container Engine for Kubernetes) is increasingly competitive on price, particularly for compute and egress and has genuine strengths in Oracle Database integrations, bare metal instances, and deterministic network performance via its RDMA fabric. For workloads that touch Oracle DB, Exadata, or need high-throughput interconnects, OKE is not just a fallback, it’s the right tool.

The insight that unlocks a real multicloud strategy: stop treating one cloud as primary and the other as DR. Design for active-active.

The Core Architecture

A production multicloud Kubernetes setup across EKS and OKE requires solving four problems:

Cluster federation or virtual cluster abstraction
Cross-cloud networking
Unified identity and secrets management
Consistent GitOps delivery

Let’s break each down.

1. Cluster Federation: Choosing Your Control Plane Philosophy

There are two schools of thought:

Option A Independent clusters, unified GitOps (recommended) Each cluster (EKS, OKE) is fully autonomous. A GitOps tool typically Flux or Argo CD manages both from a single source of truth. No shared control plane exists between clusters. Workloads are deployed to each cluster independently based on targeting labels or Kustomize overlays.

Option B Virtual Cluster Mesh (Liqo, Admiralty, or Karmada) Tools like Karmada introduce a meta-control plane that federates multiple clusters. You submit workloads to the Karmada API server, and it distributes them across member clusters based on propagation policies.

For most teams, Option A is the right starting point. Karmada adds power but also operational complexity. The GitOps approach keeps blast radius contained a misconfiguration in one cluster doesn’t cascade.

2. Cross-Cloud Networking: The Hard Problem

Kubernetes pods in EKS can’t natively reach pods in OKE, and vice versa. You need a data plane that spans both clouds.

Recommended approach: WireGuard-based mesh with Cilium Cluster Mesh

Cilium’s Cluster Mesh feature allows pods across clusters to communicate using their native pod IPs, with WireGuard encryption in transit. The setup requires:

Each cluster runs Cilium as its CNI (replacing the default VPC CNI on EKS and the flannel-based CNI on OKE)
A ClusterMesh resource is created linking the two API servers
Cross-cluster ServiceExport and ServiceImport resources (via the Kubernetes MCS API) expose services across the mesh

On the infrastructure layer, you need an encrypted tunnel between your AWS VPC and OCI VCN. Options:

Site-to-site VPN (quickest to set up, ~1.25 Gbps cap)
AWS Direct Connect + OCI FastConnect (for production private, dedicated bandwidth)
Overlay via Tailscale or Netbird (great for dev/staging multicloud setups, not production-grade for high-throughput)

yaml

			
# Example: Cilium ClusterMesh config snippet
apiVersion: cilium.io/v2alpha1
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: allow-cross-cluster-services
spec:
  endpointSelector: {}
  ingress:
    - fromEndpoints:
        - matchLabels:
            io.cilium.k8s.policy.cluster: oci-oke-prod

		

3. Unified Identity: IRSA on AWS, Workload Identity on OCI

This is where multicloud gets philosophically interesting. Each cloud has its own identity system, and they don’t speak the same language.

On AWS (EKS): Use IRSA (IAM Roles for Service Accounts). Your pod’s service account is annotated with an IAM role ARN. The Pod Identity Webhook injects environment variables that allow the AWS SDK to exchange a projected service account token for temporary AWS credentials.

On OCI (OKE): Use OCI Workload Identity, introduced in recent OKE versions. It works analogously to IRSA a Kubernetes service account is bound to an OCI Dynamic Group and IAM policy, and the pod receives a workload identity token that can be exchanged for OCI API credentials.

The challenge: your application code should not need to know which cloud it’s running on. Use a secrets abstraction layer.

External Secrets Operator (ESO) elegantly solves this. Deploy ESO on both clusters. Point the EKS instance at AWS Secrets Manager; point the OKE instance at OCI Vault. Your application consumes a SecretStore resource with a consistent name. ESO handles the transparent fetching of backend-specific credentials.

			
# SecretStore on EKS (AWS Secrets Manager backend)
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: app-secrets
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-east-1
      auth:
        jwt:
          serviceAccountRef:
            name: external-secrets-sa
---
# SecretStore on OKE (OCI Vault backend)  same name, different spec
apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: app-secrets
spec:
  provider:
    oracle:
      vault: ocid1.vault.oc1...
      region: us-ashburn-1
      auth:
        workloadIdentity: {}
```
Your application's `ExternalSecret` resources reference `app-secrets` in both environments the YAML is identical.
### 4. GitOps: One Repository, Multiple Targets
Use **Argo CD ApplicationSets** or **Flux's `Kustomization` with cluster selectors** to manage both clusters from a monorepo.
A typical repo layout:
```
/clusters
  /eks-us-east-1
    kustomization.yaml    # EKS-specific patches
  /oke-us-ashburn-1
    kustomization.yaml    # OKE-specific patches
/base
  /apps
    deployment.yaml
    service.yaml
  /infra
    external-secrets.yaml
    cilium-config.yaml

		

Flux’s Kustomization resource lets you target specific clusters using the cluster’s kubeconfig context or label selectors. Argo CD’s ApplicationSet with a list generator can enumerate your clusters and deploy the same app with environment-specific values.

The key rule: the base layer must be cloud-agnostic. Patches in cluster-specific overlays handle anything that diverges storage classes, ingress annotations, node selectors.

Observability Across Clouds

A multicloud cluster setup with no unified observability is an incident waiting to happen.

Recommended stack:

Prometheus + Thanos for metrics each cluster runs Prometheus; Thanos Sidecar ships blocks to object storage (S3 on AWS, OCI Object Storage on OCI); Thanos Querier federates across both
Grafana with both Thanos endpoints as datasources single pane of glass
OpenTelemetry Collector deployed as a DaemonSet on each cluster, shipping traces to a common backend (Grafana Tempo, Jaeger, or Honeycomb)
Loki for logs, with agents on each cluster shipping to a common Loki instance

Label discipline is critical: ensure every metric, trace, and log carries cluster, cloud_provider, and region labels from the source. Without this, correlation during incidents across clouds becomes extremely difficult.

Cost Management: The Overlooked Dimension

Multicloud adds a new cost vector: egress. Data leaving AWS costs money. Data entering OCI is free. Cross-cloud service calls that seemed free in a single-cloud setup now carry per-GB charges.

Practical rules:

Colocate tightly coupled services in the same cluster/cloud don’t split microservices that call each other thousands of times per second across clouds
Use Cilium’s network policy to audit cross-cluster traffic volume before enabling services in the mesh
Consider OCI’s free egress to the internet for user-facing workloads where latency to OCI regions is acceptable
Tag every namespace with cost center labels and use Kubecost or OpenCost deployed on each cluster with a shared object storage backend for unified cost attribution

Operational Runbook Considerations

A few things that will bite you if not planned for:

Clock skew: mTLS certificates and OIDC token validation are sensitive to time drift. Ensure NTP is configured identically on all nodes across both clouds. A 5-minute clock skew will silently break IRSA on EKS and workload identity on OKE.

DNS: Use ExternalDNS on both clusters pointing to a shared DNS provider (Route 53, Cloudflare). Services that need cross-cloud discoverability get DNS entries automatically on deploy.

Cluster upgrades: EKS and OKE release Kubernetes versions on different schedules. Maintain a maximum one-minor-version skew between clusters. Use a canary upgrade pattern: upgrade your OKE cluster first (typically lower blast radius), validate for 48 hours, then upgrade EKS.

Node image parity: Your application containers are cloud-agnostic, but your node OS images are not. Use Bottlerocket on EKS and Oracle Linux 8 on OKE both are minimal, hardened, and have predictable patching cycles.

When NOT to Do This

Multicloud Kubernetes is a force multiplier but only if your team has the operational maturity to support it.

Don’t pursue this architecture if:

Your team is still stabilizing single-cluster Kubernetes operations
Your workloads have no actual cross-cloud requirement (cost, compliance, or resilience)
You lack dedicated platform engineering capacity to maintain the toolchain
Your application isn’t designed for network partitioning tolerance

A well-run single-cloud EKS or OKE setup will outperform a poorly-run multicloud one every time. Add complexity only when you’ve exhausted simpler options.

Closing Thoughts

The multicloud Kubernetes story has matured considerably. Tools like Cilium Cluster Mesh, External Secrets Operator, Karmada, and OpenTelemetry have closed most of the operational gaps that made this approach impractical two years ago.

The AWS + OCI combination in particular is underrated. AWS brings ecosystem breadth; OCI brings pricing, Oracle database integration, and a network fabric that punches above its weight. For the right workloads and with the right tooling discipline the combination is genuinely compelling.

The architecture isn’t magic. It’s plumbing. But when it’s done right, it disappears and your developers ship to two clouds the same way they ship to one.

Have questions about multicloud Kubernetes design or EKS/OKE specifics? Reach out or leave a comment below.

Building a Multi-Cloud Secrets Management Strategy with HashiCorp Vault

Posted on February 22, 2026 by Osama Mustafa in Application

Let me ask you something. Where are your database passwords right now? Your API keys? Your TLS certificates?

If you’re like most teams I’ve worked with, the honest answer is “scattered everywhere.” Some are in environment variables. Some are in Kubernetes secrets (base64 encoded, which isn’t encryption by the way). A few are probably still hardcoded in configuration files that someone committed to Git three years ago.

I’m not judging. We’ve all been there. But as your infrastructure grows across multiple clouds, this approach becomes a ticking time bomb. One leaked credential can compromise everything.

In this article, I’ll show you how to build a centralized secrets management strategy using HashiCorp Vault. We’ll deploy it properly, integrate it with AWS, Azure, and GCP, and set up dynamic secrets that rotate automatically. No more shared passwords. No more “who has access to what” mysteries.

Why Vault? Why Now?

Before we dive into implementation, let me explain why I recommend Vault over cloud-native solutions like AWS Secrets Manager, Azure Key Vault, or GCP Secret Manager.

Don’t get me wrong. Those services are excellent. If you’re running entirely on one cloud, they might be all you need. But here’s the reality for most organizations:

You have workloads on AWS. Your data team uses GCP for BigQuery. Your enterprise applications run on Azure. Maybe you still have some on-premises systems. And you need a consistent way to manage secrets across all of them.

Vault gives you that single control plane. One audit log. One policy engine. One place to rotate credentials. And it integrates with everything.

Architecture Overview

Here’s what we’re building:

The key principle here is that applications never store long-lived credentials. Instead, they authenticate to Vault and receive short-lived, automatically rotated credentials for the specific resources they need.

Building a Multi-Cloud Secrets Management Strategy with HashiCorp Vault

Let me ask you something. Where are your database passwords right now? Your API keys? Your TLS certificates?

I’m not judging. We’ve all been there. But as your infrastructure grows across multiple clouds, this approach becomes a ticking time bomb. One leaked credential can compromise everything.

Why Vault? Why Now?

Before we dive into implementation, let me explain why I recommend Vault over cloud-native solutions like AWS Secrets Manager, Azure Key Vault, or GCP Secret Manager.

Don’t get me wrong. Those services are excellent. If you’re running entirely on one cloud, they might be all you need. But here’s the reality for most organizations:

Vault gives you that single control plane. One audit log. One policy engine. One place to rotate credentials. And it integrates with everything.

Architecture Overview

Here’s what we’re building:

Step 1: Deploy Vault on Kubernetes

I prefer running Vault on Kubernetes because it gives you high availability, easy scaling, and integrates beautifully with your existing workloads. We’ll use the official Helm chart.

Prerequisites

You’ll need a Kubernetes cluster. Any managed Kubernetes service works: EKS, AKS, GKE, or even OKE. For this guide, I’ll use commands that work across all of them.

Create the Namespace and Storage

bash

			
kubectl create namespace vault
# Create storage class for Vault data
# This example uses AWS EBS, adjust for your cloud
cat <<EOF | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: vault-storage
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  encrypted: "true"
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstConsumer
EOF

		

Configure Vault Helm Values

yaml

			
# vault-values.yaml
global:
  enabled: true
  tlsDisable: false
injector:
  enabled: true
  replicas: 2
  
  resources:
    requests:
      memory: 256Mi
      cpu: 250m
    limits:
      memory: 512Mi
      cpu: 500m
server:
  enabled: true
  
  # Run 3 replicas for high availability
  ha:
    enabled: true
    replicas: 3
    
    # Use Raft for integrated storage
    raft:
      enabled: true
      setNodeId: true
      
      config: |
        ui = true
        
        listener "tcp" {
          tls_disable = false
          address = "[::]:8200"
          cluster_address = "[::]:8201"
          tls_cert_file = "/vault/userconfig/vault-tls/tls.crt"
          tls_key_file = "/vault/userconfig/vault-tls/tls.key"
        }
        
        storage "raft" {
          path = "/vault/data"
          
          retry_join {
            leader_api_addr = "https://vault-0.vault-internal:8200"
            leader_ca_cert_file = "/vault/userconfig/vault-tls/ca.crt"
          }
          retry_join {
            leader_api_addr = "https://vault-1.vault-internal:8200"
            leader_ca_cert_file = "/vault/userconfig/vault-tls/ca.crt"
          }
          retry_join {
            leader_api_addr = "https://vault-2.vault-internal:8200"
            leader_ca_cert_file = "/vault/userconfig/vault-tls/ca.crt"
          }
        }
        
        service_registration "kubernetes" {}
        
        seal "awskms" {
          region     = "us-east-1"
          kms_key_id = "alias/vault-unseal-key"
        }
  
  resources:
    requests:
      memory: 1Gi
      cpu: 500m
    limits:
      memory: 2Gi
      cpu: 2000m
  
  dataStorage:
    enabled: true
    size: 20Gi
    storageClass: vault-storage
  
  auditStorage:
    enabled: true
    size: 10Gi
    storageClass: vault-storage
  # Service account for cloud integrations
  serviceAccount:
    create: true
    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT_ID:role/vault-server-role
ui:
  enabled: true
  serviceType: LoadBalancer
  
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    service.beta.kubernetes.io/aws-load-balancer-internal: "true"

		

Generate TLS Certificates

Vault should always use TLS. Here’s how to create certificates using cert-manager:

yaml

			
# vault-certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: vault-tls
  namespace: vault
spec:
  secretName: vault-tls
  duration: 8760h # 1 year
  renewBefore: 720h # 30 days
  subject:
    organizations:
      - YourCompany
  commonName: vault.vault.svc.cluster.local
  dnsNames:
    - vault
    - vault.vault
    - vault.vault.svc
    - vault.vault.svc.cluster.local
    - vault-0.vault-internal
    - vault-1.vault-internal
    - vault-2.vault-internal
    - "*.vault-internal"
  ipAddresses:
    - 127.0.0.1
  issuerRef:
    name: cluster-issuer
    kind: ClusterIssuer

		

Install Vault

bash

			
helm repo add hashicorp https://helm.releases.hashicorp.com
helm repo update
helm install vault hashicorp/vault \
  --namespace vault \
  --values vault-values.yaml \
  --version 0.27.0

		

Initialize and Unseal

This is a one-time operation. Keep these keys safe. I mean really safe. Like offline, in multiple secure locations.

bash

			
# Initialize Vault
kubectl exec -n vault vault-0 -- vault operator init \
  -key-shares=5 \
  -key-threshold=3 \
  -format=json > vault-init.json
# The output contains your unseal keys and root token
# Store these securely!
# If not using auto-unseal, you'd need to unseal manually:
# kubectl exec -n vault vault-0 -- vault operator unseal <key1>
# kubectl exec -n vault vault-0 -- vault operator unseal <key2>
# kubectl exec -n vault vault-0 -- vault operator unseal <key3>
# With AWS KMS auto-unseal configured, Vault unseals automatically

		

Step 2: Configure Authentication Methods

Now we need to tell Vault how applications will authenticate. This is where it gets interesting.

Kubernetes Authentication

Applications running in Kubernetes can authenticate using their service account tokens. No passwords needed.

bash

			
# Enable Kubernetes auth
vault auth enable kubernetes
# Configure it to trust our cluster
vault write auth/kubernetes/config \
  kubernetes_host="https://$KUBERNETES_PORT_443_TCP_ADDR:443" \
  token_reviewer_jwt="$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
  kubernetes_ca_cert=@/var/run/secrets/kubernetes.io/serviceaccount/ca.crt \
  issuer="https://kubernetes.default.svc.cluster.local"

		

AWS IAM Authentication

For workloads running on EC2, Lambda, or ECS, they can authenticate using their IAM roles.

bash

			
# Enable AWS auth
vault auth enable aws
# Configure AWS credentials for Vault to verify requests
vault write auth/aws/config/client \
  secret_key=$AWS_SECRET_KEY \
  access_key=$AWS_ACCESS_KEY
# Create a role that EC2 instances can use
vault write auth/aws/role/ec2-app-role \
  auth_type=iam \
  bound_iam_principal_arn="arn:aws:iam::ACCOUNT_ID:role/app-server-role" \
  policies=app-policy \
  ttl=1h

		

Azure Authentication

For Azure workloads using Managed Identities:

bash

			
# Enable Azure auth
vault auth enable azure
# Configure Azure
vault write auth/azure/config \
  tenant_id=$AZURE_TENANT_ID \
  resource="https://management.azure.com/" \
  client_id=$AZURE_CLIENT_ID \
  client_secret=$AZURE_CLIENT_SECRET
# Create a role for Azure VMs
vault write auth/azure/role/azure-app-role \
  policies=app-policy \
  bound_subscription_ids=$AZURE_SUBSCRIPTION_ID \
  bound_resource_groups=production-rg \
  ttl=1h

		

GCP Authentication

For GCP workloads using service accounts:

bash

			
# Enable GCP auth
vault auth enable gcp
# Configure GCP
vault write auth/gcp/config \
  credentials=@gcp-credentials.json
# Create a role for GCE instances
vault write auth/gcp/role/gce-app-role \
  type="gce" \
  policies=app-policy \
  bound_projects="my-project-id" \
  bound_zones="us-central1-a,us-central1-b" \
  ttl=1h

		

Step 3: Set Up Dynamic Secrets

Here’s where the magic happens. Instead of storing static database passwords, Vault can generate unique credentials on demand and revoke them automatically when they expire.

Dynamic AWS Credentials

bash

			
# Enable AWS secrets engine
vault secrets enable aws
# Configure root credentials (Vault uses these to create dynamic creds)
vault write aws/config/root \
  access_key=$AWS_ACCESS_KEY \
  secret_key=$AWS_SECRET_KEY \
  region=us-east-1
# Create a role that generates S3 read-only credentials
vault write aws/roles/s3-reader \
  credential_type=iam_user \
  policy_document=-<<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-bucket",
        "arn:aws:s3:::my-bucket/*"
      ]
    }
  ]
}
EOF
# Now any authenticated client can get temporary AWS credentials
vault read aws/creds/s3-reader
# Returns:
# access_key     AKIA...
# secret_key     xyz123...
# lease_duration 1h
# These credentials will be automatically revoked after 1 hour

		

Dynamic Database Credentials

This is probably my favorite feature. Every time an application needs to connect to a database, it gets a unique username and password that only it knows.

bash

			
# Enable database secrets engine
vault secrets enable database
# Configure PostgreSQL connection
vault write database/config/production-postgres \
  plugin_name=postgresql-database-plugin \
  allowed_roles="app-readonly,app-readwrite" \
  connection_url="postgresql://{{username}}:{{password}}@db.example.com:5432/appdb?sslmode=require" \
  username="vault_admin" \
  password="vault_admin_password"
# Create a read-only role
vault write database/roles/app-readonly \
  db_name=production-postgres \
  creation_statements="CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; \
    GRANT SELECT ON ALL TABLES IN SCHEMA public TO \"{{name}}\";" \
  revocation_statements="DROP ROLE IF EXISTS \"{{name}}\";" \
  default_ttl="1h" \
  max_ttl="24h"
# Create a read-write role
vault write database/roles/app-readwrite \
  db_name=production-postgres \
  creation_statements="CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; \
    GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO \"{{name}}\";" \
  revocation_statements="DROP ROLE IF EXISTS \"{{name}}\";" \
  default_ttl="1h" \
  max_ttl="24h"

		

Now when your application requests credentials:

bash

			
vault read database/creds/app-readonly
# Returns:
# username    v-kubernetes-app-readonly-abc123
# password    A1B2C3D4E5F6...
# lease_duration 1h

		

Every request gets a different username and password. If credentials are compromised, they expire automatically. And you have a complete audit trail of who accessed what, when.

Dynamic Azure Credentials

bash

			
# Enable Azure secrets engine
vault secrets enable azure
# Configure Azure
vault write azure/config \
  subscription_id=$AZURE_SUBSCRIPTION_ID \
  tenant_id=$AZURE_TENANT_ID \
  client_id=$AZURE_CLIENT_ID \
  client_secret=$AZURE_CLIENT_SECRET
# Create a role that generates Azure Service Principals
vault write azure/roles/contributor \
  ttl=1h \
  azure_roles=-<<EOF
[
  {
    "role_name": "Contributor",
    "scope": "/subscriptions/$AZURE_SUBSCRIPTION_ID/resourceGroups/production-rg"
  }
]
EOF

		

Step 4: Application Integration

Let’s see how applications actually use Vault. I’ll show you several patterns.

Pattern 1: Vault Agent Sidecar (Kubernetes)

This is my recommended approach for Kubernetes. Vault Agent runs alongside your application and handles authentication and secret retrieval automatically.

yaml

			
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    metadata:
      annotations:
        # These annotations tell Vault Agent what to do
        vault.hashicorp.com/agent-inject: "true"
        vault.hashicorp.com/role: "my-app-role"
        vault.hashicorp.com/agent-inject-secret-db-creds: "database/creds/app-readonly"
        vault.hashicorp.com/agent-inject-template-db-creds: |
          {{- with secret "database/creds/app-readonly" -}}
          export DB_USERNAME="{{ .Data.username }}"
          export DB_PASSWORD="{{ .Data.password }}"
          {{- end }}
    spec:
      serviceAccountName: my-app
      containers:
      - name: my-app
        image: my-app:latest
        command: ["/bin/sh", "-c"]
        args:
          - source /vault/secrets/db-creds && ./start-app.sh

		

When this pod starts, Vault Agent automatically:

Authenticates to Vault using the Kubernetes service account
Retrieves database credentials
Writes them to /vault/secrets/db-creds
Renews the credentials before they expire
Updates the file when credentials change

Your application just reads from a file. It doesn’t need to know anything about Vault.

Pattern 2: Direct SDK Integration

For applications that need more control, you can use the Vault SDK directly:

python

			
# Python example
import hvac
import os
def get_vault_client():
    """Create Vault client using Kubernetes auth."""
    client = hvac.Client(url=os.environ['VAULT_ADDR'])
    
    # Read the service account token
    with open('/var/run/secrets/kubernetes.io/serviceaccount/token') as f:
        jwt = f.read()
    
    # Authenticate to Vault
    client.auth.kubernetes.login(
        role='my-app-role',
        jwt=jwt,
        mount_point='kubernetes'
    )
    
    return client
def get_database_credentials():
    """Get dynamic database credentials."""
    client = get_vault_client()
    
    # Request new database credentials
    response = client.secrets.database.generate_credentials(
        name='app-readonly',
        mount_point='database'
    )
    
    return {
        'username': response['data']['username'],
        'password': response['data']['password'],
        'lease_id': response['lease_id'],
        'lease_duration': response['lease_duration']
    }
def connect_to_database():
    """Connect to database with dynamic credentials."""
    creds = get_database_credentials()
    
    connection = psycopg2.connect(
        host='db.example.com',
        database='appdb',
        user=creds['username'],
        password=creds['password']
    )
    
    return connection

		

Pattern 3: External Secrets Operator

If you prefer Kubernetes-native secrets, use External Secrets Operator to sync Vault secrets to Kubernetes:

yaml

			
# external-secret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: app-secrets
spec:
  refreshInterval: 1h
  secretStoreRef:
    kind: ClusterSecretStore
    name: vault-backend
  target:
    name: app-secrets
    creationPolicy: Owner
  data:
  - secretKey: api-key
    remoteRef:
      key: secret/data/app/api-key
      property: value
  - secretKey: db-password
    remoteRef:
      key: secret/data/app/database
      property: password

		

Step 5: Policies and Access Control

Vault policies determine who can access what. Be specific and follow the principle of least privilege.

hcl

			
# app-policy.hcl
# Allow reading dynamic database credentials
path "database/creds/app-readonly" {
  capabilities = ["read"]
}
# Allow reading application secrets
path "secret/data/app/*" {
  capabilities = ["read", "list"]
}
# Deny access to admin paths
path "sys/*" {
  capabilities = ["deny"]
}
# Allow the app to renew its own token
path "auth/token/renew-self" {
  capabilities = ["update"]
}

		

Apply the policy:

bash

			
vault policy write app-policy app-policy.hcl
# Create a Kubernetes auth role that uses this policy
vault write auth/kubernetes/role/my-app-role \
  bound_service_account_names=my-app \
  bound_service_account_namespaces=production \
  policies=app-policy \
  ttl=1h

		

Step 6: Monitoring and Audit

You need visibility into who’s accessing secrets. Enable audit logging:

bash

			
# Enable file audit device
vault audit enable file file_path=/vault/audit/vault-audit.log
# Enable syslog for centralized logging
vault audit enable syslog tag="vault" facility="AUTH"

For monitoring, Vault exposes Prometheus metrics:

yaml

			
# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vault
  namespace: vault
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: vault
  endpoints:
  - port: http
    path: /v1/sys/metrics
    params:
      format: ["prometheus"]
    scheme: https
    tlsConfig:
      insecureSkipVerify: true

		

Key metrics to alert on:

yaml

			
# Prometheus alerting rules
groups:
- name: vault
  rules:
  - alert: VaultSealed
    expr: vault_core_unsealed == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Vault is sealed"
      description: "Vault instance {{ $labels.instance }} is sealed and unable to serve requests"
  
  - alert: VaultTooManyPendingTokens
    expr: vault_token_count > 10000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Too many Vault tokens"
      description: "Vault has {{ $value }} active tokens. Consider reducing TTLs."
  
  - alert: VaultLeadershipLost
    expr: increase(vault_core_leadership_lost_count[5m]) > 0
    labels:
      severity: warning
    annotations:
      summary: "Vault leadership changes detected"

		

Common Mistakes to Avoid

Let me save you some headaches by sharing mistakes I’ve seen (and made):

Mistake 1: Using the root token for applications

The root token has unlimited access. Create specific policies and tokens for each application.

Mistake 2: Not rotating the root token

After initial setup, generate a new root token and revoke the original:

bash

			
vault operator generate-root -init
# Follow the process to generate a new root token
vault token revoke <old-root-token>

Mistake 3: Setting TTLs too long

Short TTLs mean compromised credentials are valid for less time. Start with 1 hour and adjust based on your needs.

Mistake 4: Not testing recovery procedures

Practice unsealing Vault. Practice recovering from backup. Do it regularly. The worst time to learn is during an actual incident.

Mistake 5: Storing unseal keys together

Distribute unseal keys to different people in different locations. Use a threshold scheme (3 of 5) so no single person can unseal Vault.

Regards, Enjoy the Cloud
Osama

Building a Multi-Cloud Architecture with OCI and AWS: A Real-World Integration Guide

Posted on February 19, 2026February 19, 2026 by Osama Mustafa in Cloud

I’ll tell you something that might sound controversial in cloud circles: the best cloud is often more than one cloud.

I’ve worked with dozens of enterprises over the years, and here’s what I’ve noticed. Some started with AWS years ago and built their entire infrastructure there. Then they realized Oracle Autonomous Database or Exadata could dramatically improve their database performance. Others were Oracle shops that wanted to leverage AWS’s machine learning services or global edge network.

The question isn’t really “which cloud is better?” The question is “how do we get the best of both?”

In this article, I’ll walk you through building a practical multi-cloud architecture connecting OCI and AWS. We’ll cover secure networking, data synchronization, identity federation, and the operational realities of running workloads across both platforms.

Why Multi-Cloud Actually Makes Sense

Let me be clear about something. Multi-cloud for its own sake is a terrible idea. It adds complexity, increases operational burden, and creates more things that can break. But multi-cloud for the right reasons? That’s a different story.

Here are legitimate reasons I’ve seen organizations adopt OCI and AWS together:

Database Performance: Oracle Autonomous Database and Exadata Cloud Service are genuinely difficult to match for Oracle workloads. If you’re running complex OLTP or analytics on Oracle, OCI’s database offerings are purpose-built for that.

AWS Ecosystem: AWS has services that simply don’t exist elsewhere. SageMaker for ML, Lambda’s maturity, CloudFront’s global presence, or specialized services like Rekognition and Comprehend.

Vendor Negotiation: Having workloads on multiple clouds gives you negotiating leverage. I’ve seen organizations save millions in licensing by demonstrating they could move workloads.

Acquisition and Mergers: Company A runs on AWS, Company B runs on OCI. Now they’re one company. Multi-cloud by necessity.

Regulatory Requirements: Some industries require data sovereignty or specific compliance certifications that might be easier to achieve with a particular provider in a particular region.

If none of these apply to you, stick with one cloud. Seriously. But if they do, keep reading.

Architecture Overview

Let’s design a realistic scenario. We have an e-commerce company with:

Application tier running on AWS (EKS, Lambda, API Gateway)
Core transactional database on OCI (Autonomous Transaction Processing)
Data warehouse on OCI (Autonomous Data Warehouse)
Machine learning workloads on AWS (SageMaker)
Shared data that needs to flow between both clouds

Setting Up Cross-Cloud Networking

The foundation of any multi-cloud architecture is networking. You need a secure, reliable, and performant connection between clouds.

Option 1: IPSec VPN (Good for Starting Out)

IPSec VPN is the quickest way to connect AWS and OCI. It runs over the public internet but encrypts everything. Good for development, testing, or low-bandwidth production workloads.

On OCI Side:

First, create a Dynamic Routing Gateway (DRG) and attach it to your VCN:

bash

			
# Create DRG
oci network drg create \
  --compartment-id $COMPARTMENT_ID \
  --display-name "aws-interconnect-drg"
# Attach DRG to VCN
oci network drg-attachment create \
  --drg-id $DRG_ID \
  --vcn-id $VCN_ID \
  --display-name "vcn-attachment"

		

Create a Customer Premises Equipment (CPE) object representing AWS:

bash

			
# Create CPE for AWS VPN endpoint
oci network cpe create \
  --compartment-id $COMPARTMENT_ID \
  --ip-address $AWS_VPN_PUBLIC_IP \
  --display-name "aws-vpn-endpoint"

		

Create the IPSec connection:

bash

			
# Create IPSec connection
oci network ip-sec-connection create \
  --compartment-id $COMPARTMENT_ID \
  --cpe-id $CPE_ID \
  --drg-id $DRG_ID \
  --static-routes '["10.1.0.0/16"]' \
  --display-name "oci-to-aws-vpn"

		

On AWS Side:

Create a Customer Gateway pointing to OCI:

bash

			
# Create Customer Gateway
aws ec2 create-customer-gateway \
  --type ipsec.1 \
  --public-ip $OCI_VPN_PUBLIC_IP \
  --bgp-asn 65000
# Create VPN Gateway
aws ec2 create-vpn-gateway \
  --type ipsec.1
# Attach to VPC
aws ec2 attach-vpn-gateway \
  --vpn-gateway-id $VGW_ID \
  --vpc-id $VPC_ID
# Create VPN Connection
aws ec2 create-vpn-connection \
  --type ipsec.1 \
  --customer-gateway-id $CGW_ID \
  --vpn-gateway-id $VGW_ID \
  --options '{"StaticRoutesOnly": true}'

		

Update route tables on both sides:

bash

			
# AWS: Add route to OCI CIDR
aws ec2 create-route \
  --route-table-id $ROUTE_TABLE_ID \
  --destination-cidr-block 10.2.0.0/16 \
  --gateway-id $VGW_ID
# OCI: Add route to AWS CIDR
oci network route-table update \
  --rt-id $ROUTE_TABLE_ID \
  --route-rules '[{
    "destination": "10.1.0.0/16",
    "destinationType": "CIDR_BLOCK",
    "networkEntityId": "'$DRG_ID'"
  }]'

		

Option 2: Private Connectivity (Production Recommended)

For production workloads, you want dedicated private connectivity. This means OCI FastConnect paired with AWS Direct Connect, meeting at a common colocation facility.

The good news is that Oracle and AWS both have presence in major colocation providers like Equinix. The setup involves:

Establishing FastConnect to your colocation
Establishing Direct Connect to the same colocation
Connecting them via a cross-connect in the facility

hcl

			
# Terraform for FastConnect virtual circuit
resource "oci_core_virtual_circuit" "aws_interconnect" {
  compartment_id         = var.compartment_id
  display_name           = "aws-fastconnect"
  type                   = "PRIVATE"
  bandwidth_shape_name   = "1 Gbps"
  
  cross_connect_mappings {
    customer_bgp_peering_ip = "169.254.100.1/30"
    oracle_bgp_peering_ip   = "169.254.100.2/30"
  }
  
  customer_asn    = "65001"
  gateway_id      = oci_core_drg.main.id
  provider_name   = "Equinix"
  region          = "Dubai"
}

		

hcl

			
# Terraform for AWS Direct Connect
resource "aws_dx_connection" "oci_interconnect" {
  name            = "oci-direct-connect"
  bandwidth       = "1Gbps"
  location        = "Equinix DX1"
  provider_name   = "Equinix"
}
resource "aws_dx_private_virtual_interface" "oci" {
  connection_id    = aws_dx_connection.oci_interconnect.id
  name             = "oci-vif"
  vlan             = 4094
  address_family   = "ipv4"
  bgp_asn          = 65002
  amazon_address   = "169.254.100.5/30"
  customer_address = "169.254.100.6/30"
  dx_gateway_id    = aws_dx_gateway.main.id
}

		

Honestly, setting this up involves coordination with both cloud providers and the colocation facility. Budget 4-8 weeks for the physical connectivity and plan for redundancy from day one.

Database Connectivity from AWS to OCI

Now that we have network connectivity, let’s connect AWS applications to OCI databases.

Configuring Autonomous Database for External Access

First, enable private endpoint access for your Autonomous Database:

bash

			
# Update ADB to use private endpoint
oci db autonomous-database update \
  --autonomous-database-id $ADB_ID \
  --is-access-control-enabled true \
  --whitelisted-ips '["10.1.0.0/16"]' \  # AWS VPC CIDR
  --is-mtls-connection-required false     # Allow TLS without mTLS for simplicity

		

Get the connection string:

bash

			
oci db autonomous-database get \
  --autonomous-database-id $ADB_ID \
  --query 'data."connection-strings".profiles[?consumer=="LOW"].value | [0]'

Application Configuration on AWS

Here’s a practical Python example for connecting from AWS Lambda to OCI Autonomous Database:

python

			
# lambda_function.py
import cx_Oracle
import os
import boto3
from botocore.exceptions import ClientError
def get_db_credentials():
    """Retrieve database credentials from AWS Secrets Manager"""
    secret_name = "oci-adb-credentials"
    region_name = "us-east-1"
    
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )
    
    try:
        response = client.get_secret_value(SecretId=secret_name)
        return json.loads(response['SecretString'])
    except ClientError as e:
        raise e
def handler(event, context):
    # Get credentials
    creds = get_db_credentials()
    
    # Connection string format for Autonomous DB
    dsn = """(description= 
        (retry_count=20)(retry_delay=3)
        (address=(protocol=tcps)(port=1522)
        (host=adb.me-dubai-1.oraclecloud.com))
        (connect_data=(service_name=xxx_atp_low.adb.oraclecloud.com))
        (security=(ssl_server_dn_match=yes)))"""
    
    connection = cx_Oracle.connect(
        user=creds['username'],
        password=creds['password'],
        dsn=dsn,
        encoding="UTF-8"
    )
    
    cursor = connection.cursor()
    cursor.execute("SELECT * FROM orders WHERE order_date = TRUNC(SYSDATE)")
    
    results = []
    for row in cursor:
        results.append({
            'order_id': row[0],
            'customer_id': row[1],
            'amount': float(row[2])
        })
    
    cursor.close()
    connection.close()
    
    return {
        'statusCode': 200,
        'body': json.dumps(results)
    }

		

For containerized applications on EKS, use a connection pool:

python

			
# db_pool.py
import cx_Oracle
import os
class OCIDatabasePool:
    _pool = None
    
    @classmethod
    def get_pool(cls):
        if cls._pool is None:
            cls._pool = cx_Oracle.SessionPool(
                user=os.environ['OCI_DB_USER'],
                password=os.environ['OCI_DB_PASSWORD'],
                dsn=os.environ['OCI_DB_DSN'],
                min=2,
                max=10,
                increment=1,
                encoding="UTF-8",
                threaded=True,
                getmode=cx_Oracle.SPOOL_ATTRVAL_WAIT
            )
        return cls._pool
    
    @classmethod
    def get_connection(cls):
        return cls.get_pool().acquire()
    
    @classmethod
    def release_connection(cls, connection):
        cls.get_pool().release(connection)

		

Kubernetes deployment for the application:

yaml

			
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-service
  template:
    metadata:
      labels:
        app: order-service
    spec:
      containers:
      - name: order-service
        image: 123456789.dkr.ecr.us-east-1.amazonaws.com/order-service:v1.0
        ports:
        - containerPort: 8080
        env:
        - name: OCI_DB_USER
          valueFrom:
            secretKeyRef:
              name: oci-db-credentials
              key: username
        - name: OCI_DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: oci-db-credentials
              key: password
        - name: OCI_DB_DSN
          valueFrom:
            configMapKeyRef:
              name: oci-db-config
              key: dsn
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 1Gi
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10

		

Data Synchronization Between Clouds

Real multi-cloud architectures need data flowing between clouds. Here are practical patterns:

Pattern 1: Event-Driven Sync with Kafka

Use a managed Kafka service as the bridge:

python

			
# AWS Lambda producer - sends events to Kafka
from kafka import KafkaProducer
import json
producer = KafkaProducer(
    bootstrap_servers=['kafka-broker-1:9092', 'kafka-broker-2:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
    security_protocol='SASL_SSL',
    sasl_mechanism='PLAIN',
    sasl_plain_username=os.environ['KAFKA_USER'],
    sasl_plain_password=os.environ['KAFKA_PASSWORD']
)
def handler(event, context):
    # Process order and send to Kafka for OCI consumption
    order_data = process_order(event)
    
    producer.send(
        'orders-topic',
        key=str(order_data['order_id']).encode(),
        value=order_data
    )
    producer.flush()
    
    return {'statusCode': 200}

		

OCI side consumer using OCI Functions:

python

			
# OCI Function consumer
import io
import json
import logging
import cx_Oracle
from kafka import KafkaConsumer
def handler(ctx, data: io.BytesIO = None):
    consumer = KafkaConsumer(
        'orders-topic',
        bootstrap_servers=['kafka-broker-1:9092'],
        auto_offset_reset='earliest',
        enable_auto_commit=True,
        group_id='oci-order-processor',
        value_deserializer=lambda x: json.loads(x.decode('utf-8'))
    )
    
    connection = get_adb_connection()
    cursor = connection.cursor()
    
    for message in consumer:
        order = message.value
        
        cursor.execute("""
            MERGE INTO orders o
            USING (SELECT :order_id AS order_id FROM dual) src
            ON (o.order_id = src.order_id)
            WHEN MATCHED THEN
                UPDATE SET amount = :amount, status = :status, updated_at = SYSDATE
            WHEN NOT MATCHED THEN
                INSERT (order_id, customer_id, amount, status, created_at)
                VALUES (:order_id, :customer_id, :amount, :status, SYSDATE)
        """, order)
        
        connection.commit()
    
    cursor.close()
    connection.close()

		

Pattern 2: Scheduled Batch Sync

For less time-sensitive data, batch synchronization is simpler and more cost-effective:

python

			
# AWS Step Functions state machine for batch sync
{
  "Comment": "Sync data from AWS to OCI",
  "StartAt": "ExtractFromAWS",
  "States": {
    "ExtractFromAWS": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789:function:extract-data",
      "Next": "UploadToS3"
    },
    "UploadToS3": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789:function:upload-to-s3",
      "Next": "CopyToOCI"
    },
    "CopyToOCI": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789:function:copy-to-oci-bucket",
      "Next": "LoadToADB"
    },
    "LoadToADB": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789:function:load-to-adb",
      "End": true
    }
  }
}

		

The Lambda function to copy data to OCI Object Storage:

python

			
# copy_to_oci.py
import boto3
import oci
import os
def handler(event, context):
    # Get file from S3
    s3 = boto3.client('s3')
    s3_object = s3.get_object(
        Bucket=event['bucket'],
        Key=event['key']
    )
    file_content = s3_object['Body'].read()
    
    # Upload to OCI Object Storage
    config = oci.config.from_file()
    object_storage = oci.object_storage.ObjectStorageClient(config)
    
    namespace = object_storage.get_namespace().data
    
    object_storage.put_object(
        namespace_name=namespace,
        bucket_name="data-sync-bucket",
        object_name=event['key'],
        put_object_body=file_content
    )
    
    return {
        'oci_bucket': 'data-sync-bucket',
        'object_name': event['key']
    }

		

Load into Autonomous Database using DBMS_CLOUD:

sql

			
-- Create credential for OCI Object Storage access
BEGIN
  DBMS_CLOUD.CREATE_CREDENTIAL(
    credential_name => 'OCI_CRED',
    username        => 'your_oci_username',
    password        => 'your_auth_token'
  );
END;
/
-- Load data from Object Storage
BEGIN
  DBMS_CLOUD.COPY_DATA(
    table_name      => 'ORDERS_STAGING',
    credential_name => 'OCI_CRED',
    file_uri_list   => 'https://objectstorage.me-dubai-1.oraclecloud.com/n/namespace/b/data-sync-bucket/o/orders_*.csv',
    format          => JSON_OBJECT(
      'type' VALUE 'CSV',
      'skipheaders' VALUE '1',
      'dateformat' VALUE 'YYYY-MM-DD'
    )
  );
END;
/
-- Merge staging into production
MERGE INTO orders o
USING orders_staging s
ON (o.order_id = s.order_id)
WHEN MATCHED THEN
  UPDATE SET o.amount = s.amount, o.status = s.status
WHEN NOT MATCHED THEN
  INSERT (order_id, customer_id, amount, status)
  VALUES (s.order_id, s.customer_id, s.amount, s.status);

		

Identity Federation

Managing identities across clouds is a headache unless you set up proper federation. Here’s how to enable SSO between AWS and OCI using a common identity provider.

Using Azure AD as Common IdP (Yes, a Third Cloud)

This is actually quite common. Many enterprises use Azure AD for identity even if their workloads run elsewhere.

Configure OCI to Trust Azure AD:

bash

			
# Create Identity Provider in OCI
oci iam identity-provider create-saml2-identity-provider \
  --compartment-id $TENANCY_ID \
  --name "AzureAD-Federation" \
  --description "Federation with Azure AD" \
  --product-type "IDCS" \
  --metadata-url "https://login.microsoftonline.com/$TENANT_ID/federationmetadata/2007-06/federationmetadata.xml"

		

Configure AWS to Trust Azure AD:

bash

			
# Create SAML provider in AWS
aws iam create-saml-provider \
  --saml-metadata-document file://azure-ad-metadata.xml \
  --name AzureAD-Federation
# Create role for federated users
aws iam create-role \
  --role-name AzureAD-Admins \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {"Federated": "arn:aws:iam::123456789:saml-provider/AzureAD-Federation"},
      "Action": "sts:AssumeRoleWithSAML",
      "Condition": {
        "StringEquals": {
          "SAML:aud": "https://signin.aws.amazon.com/saml"
        }
      }
    }]
  }'

		

Now your team can use the same Azure AD credentials to access both clouds.

Monitoring Across Clouds

You need unified observability. Here’s a practical approach using Grafana as the common dashboard:

yaml

			
# docker-compose.yml for centralized Grafana
version: '3.8'
services:
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
      - ./provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=secure_password
      - GF_INSTALL_PLUGINS=oci-metrics-datasource
volumes:
  grafana-data:

		

Configure data sources:

yaml

			
# provisioning/datasources/datasources.yaml
apiVersion: 1
datasources:
  - name: AWS-CloudWatch
    type: cloudwatch
    access: proxy
    jsonData:
      authType: keys
      defaultRegion: us-east-1
    secureJsonData:
      accessKey: ${AWS_ACCESS_KEY}
      secretKey: ${AWS_SECRET_KEY}
  
  - name: OCI-Monitoring
    type: oci-metrics-datasource
    access: proxy
    jsonData:
      tenancyOCID: ${OCI_TENANCY_OCID}
      userOCID: ${OCI_USER_OCID}
      region: me-dubai-1
    secureJsonData:
      privateKey: ${OCI_PRIVATE_KEY}

		

Create a unified dashboard that shows both clouds:

json

			
{
  "title": "Multi-Cloud Overview",
  "panels": [
    {
      "title": "AWS EKS CPU Utilization",
      "datasource": "AWS-CloudWatch",
      "targets": [{
        "namespace": "AWS/EKS",
        "metricName": "node_cpu_utilization",
        "dimensions": {"ClusterName": "production"}
      }]
    },
    {
      "title": "OCI Autonomous DB Sessions",
      "datasource": "OCI-Monitoring",
      "targets": [{
        "namespace": "oci_autonomous_database",
        "metric": "CurrentOpenSessionCount",
        "resourceGroup": "production-adb"
      }]
    },
    {
      "title": "Cross-Cloud Latency",
      "datasource": "Prometheus",
      "targets": [{
        "expr": "histogram_quantile(0.95, rate(cross_cloud_request_duration_seconds_bucket[5m]))"
      }]
    }
  ]
}

		

Cost Management

Multi-cloud cost visibility is challenging. Here’s a practical approach:

python

			
# cost_aggregator.py
import boto3
import oci
from datetime import datetime, timedelta
def get_aws_costs(start_date, end_date):
    client = boto3.client('ce')
    response = client.get_cost_and_usage(
        TimePeriod={
            'Start': start_date.strftime('%Y-%m-%d'),
            'End': end_date.strftime('%Y-%m-%d')
        },
        Granularity='DAILY',
        Metrics=['UnblendedCost'],
        GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
    )
    return response['ResultsByTime']
def get_oci_costs(start_date, end_date):
    config = oci.config.from_file()
    usage_api = oci.usage_api.UsageapiClient(config)
    
    response = usage_api.request_summarized_usages(
        request_summarized_usages_details=oci.usage_api.models.RequestSummarizedUsagesDetails(
            tenant_id=config['tenancy'],
            time_usage_started=start_date,
            time_usage_ended=end_date,
            granularity="DAILY",
            group_by=["service"]
        )
    )
    return response.data.items
def generate_report():
    end_date = datetime.now()
    start_date = end_date - timedelta(days=30)
    
    aws_costs = get_aws_costs(start_date, end_date)
    oci_costs = get_oci_costs(start_date, end_date)
    
    total_aws = sum(float(day['Total']['UnblendedCost']['Amount']) for day in aws_costs)
    total_oci = sum(item.computed_amount for item in oci_costs)
    
    print(f"30-Day Multi-Cloud Cost Summary")
    print(f"{'='*40}")
    print(f"AWS Total: ${total_aws:,.2f}")
    print(f"OCI Total: ${total_oci:,.2f}")
    print(f"Combined Total: ${total_aws + total_oci:,.2f}")

		

Lessons Learned

After running multi-cloud architectures for several years, here’s what I’ve learned:

Network is everything. Invest in proper connectivity upfront. The $500/month you save on VPN versus dedicated connectivity will cost you thousands in debugging performance issues.

Pick one cloud for each workload type. Don’t run the same thing in both clouds. Use OCI for Oracle databases, AWS for its unique services. Avoid the temptation to replicate everything everywhere.

Standardize your tooling. Terraform works on both clouds. Use it. Same for monitoring, logging, and CI/CD. The more consistent your tooling, the less your team has to context-switch.

Document your data flows. Know exactly what data goes where and why. This will save you during security audits and incident response.

Test cross-cloud failures. What happens when the VPN goes down? Can your application degrade gracefully? Find out before your customers do.

Conclusion

Multi-cloud between OCI and AWS isn’t simple, but it’s absolutely achievable. The key is having clear reasons for using each cloud, solid networking fundamentals, and consistent operational practices.

Start small. Connect one application to one database across clouds. Get that working reliably before expanding. Build your team’s confidence and expertise incrementally.

The organizations that succeed with multi-cloud are the ones that treat it as an architectural choice, not a checkbox. They know exactly why they need both clouds and have designed their systems accordingly.

Regards,
Osama

Deep Dive into Oracle Kubernetes Engine Security and Networking in Production

Posted on December 22, 2025 by Osama Mustafa in Cloud, OCI

Oracle Kubernetes Engine is often introduced as a managed Kubernetes service, but its real strength only becomes clear when you operate it in production. OKE tightly integrates with OCI networking, identity, and security services, which gives you a very different operational model compared to other managed Kubernetes platforms.

This article walks through OKE from a production perspective, focusing on security boundaries, networking design, ingress exposure, private access, and mutual TLS. The goal is not to explain Kubernetes basics, but to explain how OKE behaves when you run regulated, enterprise workloads.

Understanding the OKE Networking Model

OKE does not abstract networking away from you. Every cluster is deeply tied to OCI VCN constructs.

Core Components

An OKE cluster consists of:

A managed Kubernetes control plane
Worker nodes running in OCI subnets
OCI networking primitives controlling traffic flow

Key OCI resources involved:

Virtual Cloud Network
Subnets for control plane and workers
Network Security Groups
Route tables
OCI Load Balancers

Unlike some platforms, security in OKE is enforced at multiple layers simultaneously.

Worker Node and Pod Networking

OKE uses OCI VCN-native networking. Pods receive IPs from the subnet CIDR through the OCI CNI plugin.

What this means in practice

Pods are first-class citizens on the VCN
Pod IPs are routable within the VCN
Network policies and OCI NSGs both apply

Example subnet design:

VCN: 10.0.0.0/16

Worker Subnet: 10.0.10.0/24
Load Balancer Subnet: 10.0.20.0/24
Private Endpoint Subnet: 10.0.30.0/24

This design allows you to:

Keep workers private
Expose only ingress through OCI Load Balancer
Control east-west traffic using Kubernetes NetworkPolicies and OCI NSGs together

Security Boundaries in OKE

Security in OKE is layered by design.

Layer 1: OCI IAM and Compartments

OKE clusters live inside OCI compartments. IAM policies control:

Who can create or modify clusters
Who can access worker nodes
Who can manage load balancers and subnets

Example IAM policy snippet:

Allow group OKE-Admins to manage cluster-family in compartment OKE-PROD
Allow group OKE-Admins to manage virtual-network-family in compartment OKE-PROD

This separation is critical for regulated environments.

Layer 2: Network Security Groups

Network Security Groups act as virtual firewalls at the VNIC level.

Typical NSG rules:

Allow node-to-node communication
Allow ingress from load balancer subnet only
Block all public inbound traffic

Example inbound NSG rule:

Source: 10.0.20.0/24
Protocol: TCP
Port: 443

This ensures only the OCI Load Balancer can reach your ingress controller.

Layer 3: Kubernetes Network Policies

NetworkPolicies control pod-level traffic.

Example policy allowing traffic only from ingress namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-ingress
  namespace: app-prod
spec:
  podSelector: {}
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              role: ingress

This blocks all lateral movement by default.

Ingress Design in OKE

OKE integrates natively with OCI Load Balancer.

Public vs Private Ingress

You can deploy ingress in two modes:

Public Load Balancer
Internal Load Balancer

For production workloads, private ingress is strongly recommended.

Example service annotation for private ingress:

service.beta.kubernetes.io/oci-load-balancer-internal: "true"
service.beta.kubernetes.io/oci-load-balancer-subnet1: ocid1.subnet.oc1..

This ensures the load balancer has no public IP.

Private Access to the Cluster Control Plane

OKE supports private API endpoints.

When enabled:

The Kubernetes API is accessible only from the VCN
No public endpoint exists

This is critical for Zero Trust environments.

Operational impact:

kubectl access requires VPN, Bastion, or OCI Cloud Shell inside the VCN
CI/CD runners must have private connectivity

This dramatically reduces the attack surface.

Mutual TLS Inside OKE

TLS termination at ingress is not enough for sensitive workloads. Many enterprises require mTLS between services.

Typical mTLS Architecture

TLS termination at ingress
Internal mTLS between services
Certificate management via Vault or cert-manager

Example cert-manager issuer using OCI Vault:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: oci-vault-issuer
spec:
  vault:
    server: https://vault.oci.oraclecloud.com
    path: pki/sign/oke

Each service receives:

Its own certificate
Short-lived credentials
Automatic rotation

Traffic Flow Example

End-to-end request path:

Client connects to OCI Load Balancer
Load Balancer forwards traffic to NGINX Ingress
Ingress enforces TLS and headers
Service-to-service traffic uses mTLS
NetworkPolicy restricts lateral movement
NSGs enforce VCN-level boundaries

Every hop is authenticated and encrypted.

Observability and Security Visibility

OKE integrates with:

OCI Logging
OCI Flow Logs
Kubernetes audit logs

This allows:

Tracking ingress traffic
Detecting unauthorized access attempts
Correlating pod-level events with network flows

Regards
Osama

Building a Real-Time Data Enrichment & Inference Pipeline on AWS Using Kinesis, Lambda, DynamoDB, and SageMaker

Posted on November 25, 2025 by Osama Mustafa in AWS, Cloud

Modern cloud applications increasingly depend on real-time processing, especially when dealing with fraud detection, personalization, IoT telemetry, or operational monitoring.
In this post, we’ll build a fully functional AWS pipeline that:

Streams events using Amazon Kinesis
Enriches and transforms them via AWS Lambda
Stores real-time feature data in Amazon DynamoDB
Performs machine-learning inference using a SageMaker Endpoint

1. Architecture Overview

2. Step-By-Step Pipeline Build

2.1. Create a Kinesis Data Stream

aws kinesis create-stream \
  --stream-name RealtimeEvents \
  --shard-count 2 \
  --region us-east-1

This stream will accept incoming events from your apps, IoT devices, or microservices.

2.2. DynamoDB Table for Real-Time Features

aws dynamodb create-table \
  --table-name UserFeatureStore \
  --attribute-definitions AttributeName=userId,AttributeType=S \
  --key-schema AttributeName=userId,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region us-east-1

This table holds live user features, updated every time an event arrives.

2.3. Lambda Function (Real-Time Data Enrichment)

This Lambda:

Reads events from Kinesis
Computes simple features (e.g., last event time, rolling count)
Saves enriched data to DynamoDB

import json
import boto3
from datetime import datetime, timedelta

ddb = boto3.resource("dynamodb")
table = ddb.Table("UserFeatureStore")

def lambda_handler(event, context):

    for record in event["Records"]:
        payload = json.loads(record["kinesis"]["data"])

        user = payload["userId"]
        metric = payload["metric"]
        ts = datetime.fromisoformat(payload["timestamp"])

        # Fetch old features
        old = table.get_item(Key={"userId": user}).get("Item", {})

        last_ts = old.get("lastTimestamp")
        count = old.get("count", 0)

        # Update rolling 5-minute count
        if last_ts:
            prev_ts = datetime.fromisoformat(last_ts)
            if ts - prev_ts < timedelta(minutes=5):
                count += 1
            else:
                count = 1
        else:
            count = 1

        # Save new enriched features
        table.put_item(Item={
            "userId": user,
            "lastTimestamp": ts.isoformat(),
            "count": count,
            "lastMetric": metric
        })

    return {"status": "ok"}

Attach the Lambda to the Kinesis stream.

2.4. Creating a SageMaker Endpoint for Inference

Train your model offline, then deploy it:

aws sagemaker create-endpoint-config \
  --endpoint-config-name RealtimeInferenceConfig \
  --production-variants VariantName=AllInOne,ModelName=MyInferenceModel,InitialInstanceCount=1,InstanceType=ml.m5.large

aws sagemaker create-endpoint \
  --endpoint-name RealtimeInference \
  --endpoint-config-name RealtimeInferenceConfig

2.5. API Layer Performing Live Inference

Your application now requests predictions like this:

import boto3
import json

runtime = boto3.client("sagemaker-runtime")
ddb = boto3.resource("dynamodb").Table("UserFeatureStore")

def predict(user_id, extra_input):

    user_features = ddb.get_item(Key={"userId": user_id}).get("Item")

    payload = {
        "userId": user_id,
        "features": user_features,
        "input": extra_input
    }

    response = runtime.invoke_endpoint(
        EndpointName="RealtimeInference",
        ContentType="application/json",
        Body=json.dumps(payload)
    )

    return json.loads(response["Body"].read())

This combines live enriched features + model inference for maximum accuracy.

3. Production Considerations

Performance

Enable Lambda concurrency
Use DynamoDB DAX caching
Use Kinesis Enhanced Fan-Out for high throughput

Security

Use IAM roles with least privilege
Encrypt Kinesis, Lambda, DynamoDB, and SageMaker with KMS

Monitoring

CloudWatch Metrics
CloudWatch Logs Insights queries
DynamoDB capacity alarms
SageMaker Model error monitoring

Cost Optimization

Use PAY_PER_REQUEST DynamoDB
Use Lambda Power Tuning
Scale SageMaker endpoints with autoscaling

Implementing a Real-Time Anomaly Detection Pipeline on OCI Using Streaming Data, Oracle Autonomous Database & ML

Posted on November 22, 2025November 22, 2025 by Osama Mustafa in Cloud, OCI

Detecting unusual patterns in real time is critical to preventing outages, catching fraud, ensuring SLA compliance, and maintaining high-quality user experiences.
In this post, we build a real working pipeline on OCI that:

Ingests streaming data
Computes features in near-real time
Stores results in Autonomous Database
Runs anomaly detection logic
Sends alerts and exposes dashboards

This guide contains every technical step, including:
Streaming → Function → Autonomous DB → Anomaly Logic → Notifications → Dashboards

1. Architecture Overview

Components Used

OCI Streaming
OCI Functions
Oracle Autonomous Database
DBMS_SCHEDULER for anomaly detection job
OCI Notifications
Oracle Analytics Cloud / Grafana

2. Step-by-Step Implementation

2.1 Create OCI Streaming Stream

oci streaming stream create \
  --compartment-id $COMPARTMENT_OCID \
  --display-name "anomaly-events-stream" \
  --partitions 3

2.2 Autonomous Database Table

CREATE TABLE raw_events (
  event_id       VARCHAR2(50),
  event_time     TIMESTAMP,
  metric_value   NUMBER,
  feature1       NUMBER,
  feature2       NUMBER,
  processed_flag CHAR(1) DEFAULT 'N',
  anomaly_flag   CHAR(1) DEFAULT 'N',
  CONSTRAINT pk_raw_events PRIMARY KEY(event_id)
);

2.3 OCI Function – Feature Extraction

func.py:

import oci
import cx_Oracle
import json
from datetime import datetime

def handler(ctx, data: bytes=None):
    event = json.loads(data.decode('utf-8'))

    evt_id = event['id']
    evt_time = datetime.fromisoformat(event['time'])
    value = event['metric']

    # DB Connection
    conn = cx_Oracle.connect(user='USER', password='PWD', dsn='dsn')
    cur = conn.cursor()

    # Fetch previous value if exists
    cur.execute("SELECT metric_value FROM raw_events WHERE event_id=:1", (evt_id,))
    prev = cur.fetchone()
    prev_val = prev[0] if prev else 1.0

    # Compute features
    feature1 = value - prev_val
    feature2 = value / prev_val

    # Insert new event
    cur.execute("""
        INSERT INTO raw_events(event_id, event_time, metric_value, feature1, feature2)
        VALUES(:1, :2, :3, :4, :5)
    """, (evt_id, evt_time, value, feature1, feature2))

    conn.commit()
    cur.close()
    conn.close()

    return "ok"

Deploy the function and attach the streaming trigger.

2.4 Anomaly Detection Job (DBMS_SCHEDULER)

BEGIN
  FOR rec IN (
    SELECT event_id, feature1
    FROM raw_events
    WHERE processed_flag = 'N'
  ) LOOP
    DECLARE
      meanv NUMBER;
      stdv  NUMBER;
      zscore NUMBER;
    BEGIN
      SELECT AVG(feature1), STDDEV(feature1) INTO meanv, stdv FROM raw_events;

      zscore := (rec.feature1 - meanv) / NULLIF(stdv, 0);

      IF ABS(zscore) > 3 THEN
        UPDATE raw_events SET anomaly_flag='Y' WHERE event_id=rec.event_id;
      END IF;

      UPDATE raw_events SET processed_flag='Y' WHERE event_id=rec.event_id;
    END;
  END LOOP;
END;

Schedule this to run every 2 minutes:

BEGIN
  DBMS_SCHEDULER.CREATE_JOB (
    job_name        => 'ANOMALY_JOB',
    job_type        => 'PLSQL_BLOCK',
    job_action      => 'BEGIN anomaly_detection_proc; END;',
    repeat_interval => 'FREQ=MINUTELY;INTERVAL=2;',
    enabled         => TRUE
  );
END;

2.5 Notifications

oci ons topic create \
  --compartment-id $COMPARTMENT_OCID \
  --name "AnomalyAlerts"

In the DB, add a trigger:

CREATE OR REPLACE TRIGGER notify_anomaly
AFTER UPDATE ON raw_events
FOR EACH ROW
WHEN (NEW.anomaly_flag='Y' AND OLD.anomaly_flag='N')
BEGIN
  DBMS_OUTPUT.PUT_LINE('Anomaly detected for event ' || :NEW.event_id);
END;
/

2.6 Dashboarding

You may use:

Oracle Analytics Cloud (OAC)
Grafana + ADW Integration
Any BI tool with SQL

Example Query:

SELECT event_time, metric_value, anomaly_flag 
FROM raw_events
ORDER BY event_time;

2. Terraform + OCI CLI Script Bundle

Terraform – Streaming + Function + Policies

resource "oci_streaming_stream" "anomaly" {
  name           = "anomaly-events-stream"
  partitions     = 3
  compartment_id = var.compartment_id
}

resource "oci_functions_application" "anomaly_app" {
  compartment_id = var.compartment_id
  display_name   = "anomaly-function-app"
  subnet_ids     = var.subnets
}

Terraform Notification Topic

resource "oci_ons_notification_topic" "anomaly" {
  compartment_id = var.compartment_id
  name           = "AnomalyAlerts"
}

CLI Insert Test Events

oci streaming stream message put \
  --stream-id $STREAM_OCID \
  --messages '[{"key":"1","value":"{\"id\":\"1\",\"time\":\"2025-01-01T10:00:00\",\"metric\":58}"}]'

Deploying Real-Time Feature Store on Amazon SageMaker Feature Store with Amazon Kinesis Data Streams & Amazon DynamoDB for Low-Latency ML Inference

Posted on November 10, 2025 by Osama Mustafa in AWS, Cloud

Modern ML inference often depends on up-to-date features (customer behaviour, session counts, recent events) that need to be available in low-latency operations. In this article you’ll learn how to build a real-time feature store on AWS using:

Amazon Kinesis Data Streams for streaming events
AWS Lambda for processing and feature computation
Amazon DynamoDB (or SageMaker Feature Store) for storage of feature vectors
Amazon SageMaker Endpoint for low-latency inference
You’ll see end-to-end code snippets and architecture guidance so you can implement this in your environment.

1. Architecture Overview

The pipeline works like this:

Front-end/app produces events (e.g., user click, transaction) → published to Kinesis.
A Lambda function consumes from Kinesis, computes derived features (for example: rolling window counts, recency, session features).
The Lambda writes/updates these features into a DynamoDB table (or directly into SageMaker Feature Store).
When a request arrives for inference, the application fetches the current feature set from DynamoDB (or Feature Store) and calls a SageMaker endpoint.
Optionally, after inference you can stream feedback events for model refinement.

This architecture provides real-time feature freshness and low-latencyinference.

2. Setup & Implementation

2.1 Create the Kinesis data stream

aws kinesis create-stream \
  --stream-name UserEventsStream \
  --shard-count 2 \
  --region us-east-1

2.2 Create DynamoDB table for features

aws dynamodb create-table \
  --table-name RealTimeFeatures \
  --attribute-definitions AttributeName=userId,AttributeType=S \
  --key-schema AttributeName=userId,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region us-east-1

2.3 Lambda function to compute features

Here is a Python snippet (using boto3) which will be triggered by Kinesis:

import json
import boto3
from datetime import datetime, timedelta

dynamo = boto3.resource('dynamodb', region_name='us-east-1')
table = dynamo.Table('RealTimeFeatures')

def lambda_handler(event, context):
    for record in event['Records']:
        payload = json.loads(record['kinesis']['data'])
        user_id = payload['userId']
        event_type = payload['eventType']
        ts = datetime.fromisoformat(payload['timestamp'])

        # Fetch current features
        resp = table.get_item(Key={'userId': user_id})
        item = resp.get('Item', {})
        
        # Derive features: e.g., event_count_last_5min, last_event_type
        last_update = item.get('lastUpdate', ts.isoformat())
        count_5min = item.get('count5min', 0)
        then = datetime.fromisoformat(last_update)
        if ts - then < timedelta(minutes=5):
            count_5min += 1
        else:
            count_5min = 1
        
        # Update feature item
        new_item = {
            'userId': user_id,
            'lastEventType': event_type,
            'count5min': count_5min,
            'lastUpdate': ts.isoformat()
        }
        table.put_item(Item=new_item)
    return {'statusCode': 200}

2.4 Deploy and connect Lambda to Kinesis

Create Lambda function in AWS console or via CLI.
Add Kinesis stream UserEventsStream as event source with batch size and start position = TRIM_HORIZON.
Assign IAM role allowing kinesis:DescribeStream, kinesis:GetRecords, dynamodb:PutItem, etc.

2.5 Prepare SageMaker endpoint for inference

Train model offline (outside scope here) with features stored in training dataset matching real-time features.
Deploy model as endpoint, e.g., arn:aws:sagemaker:us-east-1:123456789012:endpoint/RealtimeModel.
In your application code call endpoint by fetching features from DynamoDB then invoking endpoint:

import boto3
sagemaker = boto3.client('sagemaker-runtime', region_name='us-east-1')
dynamo = boto3.resource('dynamodb', region_name='us-east-1')
table = dynamo.Table('RealTimeFeatures')

def get_prediction(user_id, input_payload):
    resp = table.get_item(Key={'userId': user_id})
    features = resp.get('Item')
    payload = {
        'features': features,
        'input': input_payload
    }
    response = sagemaker.invoke_endpoint(
        EndpointName='RealtimeModel',
        ContentType='application/json',
        Body=json.dumps(payload)
    )
    result = json.loads(response['Body'].read().decode())
    return result

Conclusion

In this blog post you learned how to build a real-time feature store on AWS: streaming event ingestion with Kinesis, real-time feature computation with Lambda, storage in DynamoDB, and serving via SageMaker. You got specific code examples and operational considerations for production readiness. With this setup, you’re well-positioned to deliver low-latency, ML-powered applications.

Enjoy the cloud
Osama

Automating Cost-Governance Workflows in Oracle Cloud Infrastructure (OCI) with APIs & Infrastructure as Code

Posted on October 24, 2025 by Osama Mustafa in Cloud, OCI

Introduction

Cloud cost management isn’t just about checking invoices once a month — it’s about embedding automation, governance, and insights into your infrastructure so that your engineering teams make cost-aware decisions in real time. With OCI, you have native tools (Cost Analysis, Usage APIs, Budgets, etc.) and infrastructure-as-code (IaC) tooling that can help turn cost governance from an after-thought into a proactive part of your DevOps workflow.

In this article you’ll learn how to:

Extract usage and cost data via the OCI Usage API / Cost Reports.
Define IaC workflows (e.g., with Terraform) that enforce budget/usage guardrails.
Build a simple example where you automatically tag resources, monitor spend by tag, and alert/correct when thresholds are exceeded.
Discuss best practices, pitfalls, and governance recommendations for embedding FinOps into OCI operations.

1. Understanding OCI Cost & Usage Data

What data is available?

OCI provides several cost/usage-data mechanisms:

The Cost Analysis tool in the console allows you to view trends by service, compartment, tag, etc. Oracle Docs+1
The Usage/Cost Reports (CSV format) which you can download or programmatically access via the Usage API. Oracle Docs+1
The Usage API (CLI/SDK) to query usage-and-cost programmatically. Oracle Docs+1

Why this matters

By surfacing cost data at a resource, compartment, or tag level, teams can answer questions like:

“Which tag values are consuming cost disproportionately?”
“Which compartments have heavy spend growth month-over-month?”
“Which services (Compute, Storage, Database, etc.) are the highest spenders and require optimization?”

Example: Downloading a cost report via CLI

Here’s a Python/CLI snippet that shows how to download a cost-report CSV from your tenancy:

oci os object get \
  --namespace-name bling \
  --bucket-name <your-tenancy-OCID> \
  --name reports/usage-csv/<report_name>.csv.gz \
  --file local_report.csv.gz

import oci
config = oci.config.from_file("~/.oci/config", "DEFAULT")
os_client = oci.object_storage.ObjectStorageClient(config)
namespace = "bling"
bucket = "<your-tenancy-OCID>"
object_name = "reports/usage-csv/2025-10-19-report-00001.csv.gz"

resp = os_client.get_object(namespace, bucket, object_name)
with open("report-2025-10-19.csv.gz", "wb") as f:
    for chunk in resp.data.raw.stream(1024*1024, decode_content=False):
        f.write(chunk)

2. Defining Cost-Governance Workflows with IaC

Once you have data flowing in, you can enforce guardrails and automate actions. Here’s one example pattern.

a) Enforce tagging rules

Ensure that every resource created in a compartment has a cost_center tag (for example). You can do this via policy + IaC.

# Example Terraform policy for tagging requirement
resource "oci_identity_tag_namespace" "governance" {
  compartment_id = var.compartment_id
  display_name   = "governance_tags"
  is_retired     = false
}

resource "oci_identity_tag_definition" "cost_center" {
  compartment_id = var.compartment_id
  tag_namespace_id = oci_identity_tag_namespace.governance.id
  name            = "cost_center"
  description     = "Cost Center code for FinOps tracking"
  is_retired      = false
}

You can then add an IAM policy that prevents creation of resources if the tag isn’t applied (or fails to meet allowed values). For example:

Allow group ComputeAdmins to manage instance-family in compartment Prod
  where request.operation = “CreateInstance”
  and request.resource.tag.cost_center is not null

b) Monitor vs budget

Use the Usage API or Cost Reports to pull monthly spend per tag, then compare against defined budgets. If thresholds are exceeded, trigger an alert or remediation.

Here’s an example Python pseudo-code:

from datetime import datetime, timedelta
import oci

config = oci.config.from_file()
usage_client = oci.usage_api.UsageapiClient(config)

today = datetime.utcnow()
start = today.replace(day=1)
end = today

req = oci.usage_api.models.RequestSummarizedUsagesDetails(
    tenant_id = config["tenancy"],
    time_usage_started = start,
    time_usage_ended   = end,
    granularity        = "DAILY",
    group_by           = ["tag.cost_center"]
)

resp = usage_client.request_summarized_usages(req)
for item in resp.data.items:
    tag_value = item.tag_map.get("cost_center", "untagged")
    cost     = float(item.computed_amount or 0)
    print(f"Cost for cost_center={tag_value}: {cost}")

    if cost > budget_for(tag_value):
        send_alert(tag_value, cost)
        take_remediation(tag_value)

c) Automated remediation

Remediation could mean:

Auto-shut down non-production instances in compartments after hours.
Resize or terminate idle resources.
Notify owners of over-spend via email/Slack.

Terraform, OCI Functions and Event-Service can help orchestrate that. For example, set up an Event when “cost by compartment exceeds X” → invoke Function → tag resources with “cost_alerted” → optional shutdown.

3. Putting It All Together

Here is a step-by-step scenario:

Define budget categories – e.g., cost_center codes: CC-101, CC-202, CC-303.
Tag resources on creation – via policy/IaC ensure all resources include cost_center tag with one of those codes.
Collect cost data – using Usage API daily, group by tag.cost_center.
Evaluate current spend vs budget – for each code, compare cumulative cost for current month against budget.
If over budget – then:
- send an alert to the team (via SNS, email, Slack)
- optionally trigger remediation: e.g., stop non-critical compute in that cost center’s compartments.
Dashboard & visibility – load cost data into a BI tool (could be OCI Analytics Cloud or Oracle Analytics) with trends, forecasts, anomaly detection. Use the “Show cost” in OCI Ops Insights to view usage & forecast cost. Oracle Docs
Continuous improvement – right-size instances, pause dev/test at night, switch to cheaper shapes or reserved/commit models (depending on your discount model). See OCI best practice guide for optimizing cost. Oracle Docs

Example snippet – alerting logic in CLI

# example command to get summarized usage for last 7 days
oci usage-api request-summarized-usages \
  --tenant-id $TENANCY_OCID \
  --time-usage-started $(date -u -d '-7 days' +%Y-%m-%dT00:00:00Z) \
  --time-usage-ended   $(date -u +%Y-%m-%dT00:00:00Z) \
  --granularity DAILY \
  --group-by "tag.cost_center" \
  --query "data.items[?tagMap.cost_center=='CC-101'].computedAmount" \
  --raw-output

Enjoy the OCI
Osama