Orchestrating Production Workflows with AWS Step Functions

Posted on June 1, 2026 by Osama Mustafa in Cloud, AWS

I want to tell you about a production incident that still bothers me.

We had a payment processing system built on Lambda. Each function did one thing: validate the card, charge the customer, update the order, send the receipt, trigger fulfillment. Clean separation of concerns. Looked great on paper.

Then a Lambda timed out in the middle of the charge step. The card had been charged. The order had not been updated. The receipt never went out. Fulfillment never started. And because there was no central record of what had run, we had no way to resume from where things broke. We ended up with a manual cleanup process, a refund, and an angry customer.

The root problem was not the timeout. The root problem was that we had orchestration logic scattered across function calls, SQS queues, and environment variables. When something went wrong, we had no visibility and no way to recover cleanly.

AWS Step Functions exists to solve exactly this problem. It gives you a managed, visual, stateful orchestration layer that sits above your compute. In this article I will walk you through how Step Functions actually works, the patterns that matter in production, and the mistakes I see teams make when they first adopt it.

What Step Functions Actually Does

Step Functions is a serverless orchestration service. You define a workflow as a state machine using Amazon States Language, a JSON-based specification. Each state in the machine can invoke a Lambda function, call an AWS service directly, wait for a human approval, run a parallel branch, or retry on failure with configurable backoff.

The key thing that separates Step Functions from gluing Lambdas together with SQS is that the state machine itself is the source of truth. Every execution has a complete audit trail. You can look at any execution and see exactly which states ran, what input and output they received, when they ran, and whether they succeeded or failed. When something goes wrong you have a complete picture.

There are two workflow types and the choice matters.

Standard Workflows are designed for long-running, durable processes. They can run for up to a year. Every state transition is recorded in the execution history. You pay per state transition. This is what you want for anything involving payments, order processing, document workflows, or human approvals.

Express Workflows are designed for high-volume, short-duration workloads. They run for up to five minutes, have at-least-once execution semantics, and you pay per execution duration. Use them for event processing pipelines where you need to handle thousands of events per second and idempotency is handled at the application level.

Your First Production State Machine

Let me walk through a real example: an e-commerce order processing workflow. This is a Standard Workflow since order processing is exactly the kind of thing you need full durability and auditability for.

			
{
  "Comment": "Order processing workflow",
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789:function:validate-order",
      "Next": "CheckInventory",
      "Retry": [
        {
          "ErrorEquals": ["Lambda.ServiceException", "Lambda.TooManyRequestsException"],
          "IntervalSeconds": 2,
          "MaxAttempts": 3,
          "BackoffRate": 2
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["OrderValidationError"],
          "Next": "OrderRejected",
          "ResultPath": "$.error"
        }
      ]
    },
    "CheckInventory": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789:function:check-inventory",
      "Next": "ProcessPayment",
      "Retry": [
        {
          "ErrorEquals": ["States.TaskFailed"],
          "IntervalSeconds": 5,
          "MaxAttempts": 2,
          "BackoffRate": 1.5
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["InsufficientInventoryError"],
          "Next": "NotifyOutOfStock",
          "ResultPath": "$.error"
        }
      ]
    },
    "ProcessPayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789:function:process-payment",
      "Next": "FulfillmentAndNotification",
      "Retry": [
        {
          "ErrorEquals": ["Lambda.ServiceException"],
          "IntervalSeconds": 1,
          "MaxAttempts": 2,
          "BackoffRate": 2
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["PaymentDeclinedError"],
          "Next": "NotifyPaymentFailed",
          "ResultPath": "$.error"
        },
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "OrderProcessingFailed",
          "ResultPath": "$.error"
        }
      ]
    },
    "FulfillmentAndNotification": {
      "Type": "Parallel",
      "Branches": [
        {
          "StartAt": "TriggerFulfillment",
          "States": {
            "TriggerFulfillment": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-east-1:123456789:function:trigger-fulfillment",
              "End": true
            }
          }
        },
        {
          "StartAt": "SendConfirmationEmail",
          "States": {
            "SendConfirmationEmail": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-east-1:123456789:function:send-email",
              "End": true
            }
          }
        }
      ],
      "Next": "OrderComplete"
    },
    "OrderComplete":         { "Type": "Succeed" },
    "OrderRejected":         { "Type": "Fail", "Error": "OrderRejected" },
    "NotifyOutOfStock":      { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789:function:notify-out-of-stock", "End": true },
    "NotifyPaymentFailed":   { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789:function:notify-payment-failed", "End": true },
    "OrderProcessingFailed": { "Type": "Fail", "Error": "ProcessingFailed" }
  }
}

		

A few things worth pointing out in this definition.

The Retry blocks on each Task state handle transient failures automatically. The configuration above retries on Lambda service exceptions with exponential backoff. You get this behavior for free without writing any retry logic in your Lambda functions themselves.

The Catch blocks handle business-logic failures separately from infrastructure failures. A PaymentDeclinedError routes to a notification state. An unhandled exception routes to a generic failure state. The ResultPath ensures the error detail is written into the execution context alongside the original input, not replacing it.

The Parallel state in FulfillmentAndNotification runs fulfillment and email simultaneously. Both branches must complete before the workflow advances to OrderComplete. If either branch fails, the entire Parallel state fails. This is often exactly the behavior you want: do not mark the order complete until both downstream systems have been notified.

SDK Integrations: Stop Writing Wrapper Lambdas

One of the most common mistakes I see is writing Lambda functions whose only job is to call another AWS service. A Lambda that calls DynamoDB to write a record. A Lambda that sends an SNS message. A Lambda that starts a Glue job.

Step Functions has optimized integrations with over 220 AWS services. You can call these services directly from a state definition without a Lambda in the middle.

Here is a state that writes directly to DynamoDB:

			
"SaveOrderToDynamo": {
  "Type": "Task",
  "Resource": "arn:aws:states:::dynamodb:putItem",
  "Parameters": {
    "TableName": "orders",
    "Item": {
      "orderId":    { "S.$": "$.orderId" },
      "customerId": { "S.$": "$.customerId" },
      "status":     { "S": "CONFIRMED" },
      "totalAmount":{ "N.$": "States.Format('{}', $.totalAmount)" },
      "createdAt":  { "S.$": "$$.Execution.StartTime" }
    }
  },
  "Next": "SendToSNS"
}

		

And a state that publishes to SNS:

			
"SendToSNS": {
  "Type": "Task",
  "Resource": "arn:aws:states:::sns:publish",
  "Parameters": {
    "TopicArn": "arn:aws:sns:us-east-1:123456789:order-events",
    "Message": {
      "orderId.$":    "$.orderId",
      "customerId.$": "$.customerId",
      "status":       "CONFIRMED"
    }
  },
  "Next": "OrderComplete"
}

		

The .$ suffix on a key means “resolve this from the state input.” The $$.Execution.StartTime is a context object reference that gives you metadata about the current execution. These small conveniences add up significantly when building real workflows.

Removing wrapper Lambdas reduces cold starts, lowers your Lambda invocation costs, simplifies your IAM surface, and makes the workflow easier to read because every state’s purpose is self-evident.

The Wait for Callback Pattern

Some workflows cannot move forward until something external happens. A human needs to approve a refund. A third-party payment processor needs to call back. A document needs to pass a review queue.

Step Functions handles this with the waitForTaskToken integration pattern. The state machine pauses, sends a token to an external system, and resumes only when that token is returned.

Here is the state definition:

			
"WaitForManagerApproval": {
  "Type": "Task",
  "Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
  "Parameters": {
    "QueueUrl": "https://sqs.us-east-1.amazonaws.com/123456789/approval-queue",
    "MessageBody": {
      "taskToken.$":  "$$.Task.Token",
      "orderId.$":    "$.orderId",
      "amount.$":     "$.totalAmount",
      "requestedBy.$":"$.customerId"
    }
  },
  "HeartbeatSeconds": 3600,
  "Next": "ProcessApprovedRefund",
  "Catch": [
    {
      "ErrorEquals": ["ApprovalRejected"],
      "Next": "NotifyRejected"
    },
    {
      "ErrorEquals": ["States.HeartbeatTimeout"],
      "Next": "EscalateApproval"
    }
  ]
}

		

The approval service picks up the message, presents it to a manager, and then calls back:

			
import boto3
sfn = boto3.client("stepfunctions")
def handle_approval_decision(task_token: str, approved: bool, reason: str):
    if approved:
        sfn.send_task_success(
            taskToken=task_token,
            output=json.dumps({"approved": True, "approvedBy": "manager@company.com"})
        )
    else:
        sfn.send_task_failure(
            taskToken=task_token,
            error="ApprovalRejected",
            cause=reason
        )

		

The HeartbeatSeconds field is important. If the external system does not send a heartbeat or complete the task within that window, the state fails with a HeartbeatTimeout. In the example above that routes to an escalation state rather than silently hanging forever. Always set a heartbeat on any waitForTaskToken state.

Deploying with Terraform

Defining your state machine in the console is fine for exploration. In production, everything should be in code.

			
resource "aws_sfn_state_machine" "order_processing" {
  name     = "order-processing-workflow"
  role_arn = aws_iam_role.step_functions_role.arn
  type     = "STANDARD"
  definition = templatefile("${path.module}/state_machine.json", {
    validate_order_arn    = aws_lambda_function.validate_order.arn
    check_inventory_arn   = aws_lambda_function.check_inventory.arn
    process_payment_arn   = aws_lambda_function.process_payment.arn
    trigger_fulfillment_arn = aws_lambda_function.trigger_fulfillment.arn
    send_email_arn        = aws_lambda_function.send_email.arn
  })
  logging_configuration {
    level                  = "ALL"
    include_execution_data = true
    log_destination        = "${aws_cloudwatch_log_group.sfn_logs.arn}:*"
  }
  tracing_configuration {
    enabled = true
  }
}
resource "aws_iam_role" "step_functions_role" {
  name = "step-functions-order-processing-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action    = "sts:AssumeRole"
      Effect    = "Allow"
      Principal = { Service = "states.amazonaws.com" }
    }]
  })
}
resource "aws_iam_role_policy" "sfn_policy" {
  name = "sfn-order-processing-policy"
  role = aws_iam_role.step_functions_role.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = ["lambda:InvokeFunction"]
        Resource = [
          aws_lambda_function.validate_order.arn,
          aws_lambda_function.check_inventory.arn,
          aws_lambda_function.process_payment.arn,
          aws_lambda_function.trigger_fulfillment.arn,
          aws_lambda_function.send_email.arn
        ]
      },
      {
        Effect   = "Allow"
        Action   = ["logs:CreateLogDelivery", "logs:PutLogEvents", "logs:GetLogDelivery"]
        Resource = "*"
      },
      {
        Effect   = "Allow"
        Action   = ["xray:PutTraceSegments", "xray:PutTelemetryRecords"]
        Resource = "*"
      }
    ]
  })
}
resource "aws_cloudwatch_log_group" "sfn_logs" {
  name              = "/aws/states/order-processing"
  retention_in_days = 30
}

		

Using templatefile to inject Lambda ARNs into the state machine definition keeps your infrastructure code clean and makes it easy to reference the correct function ARN for each environment without hardcoding anything.

Observability in Production

Step Functions gives you three layers of observability out of the box when you configure them properly.

CloudWatch Metrics publishes execution counts, failure rates, and durations for every state machine automatically. Set alarms on ExecutionsFailed and ExecutionsTimedOut. For payment or order workflows, a single failed execution is worth an alert. For high-volume event pipelines, set a threshold based on your acceptable failure rate.

CloudWatch Logs with include_execution_data = true captures the full input and output of every state transition. This is the setting that makes debugging possible. Without it, you know a state failed but not what data it received. With it, you can replay the exact scenario that caused the failure.

X-Ray tracing propagates trace context through Lambda invocations triggered by your state machine. In the AWS console, you get a service map showing exactly where time was spent across each execution. For workflows where latency matters, this is the fastest way to identify the bottleneck.

One practical tip: write a CloudWatch Insights query that you can run immediately when an incident starts.

			
fields @timestamp, execution_arn, type, details.name, details.status
| filter type in ["ExecutionFailed", "TaskFailed", "TaskStateExited"]
| sort @timestamp desc
| limit 50

Save this query before you need it. Running it during an incident is much faster than clicking through individual executions.

Common Mistakes

Not setting ResultPath on Catch handlers. By default, a Catch block replaces the entire state input with the error object. Your downstream states then receive only the error, not the original order data they need. Always use "ResultPath": "$.error" to merge the error into the existing input.

Using Express Workflows for payment processing. Express Workflows have at-least-once semantics. A state can execute more than once under failure conditions. For anything involving money or external side effects, use Standard Workflows with idempotency keys in your Lambda functions, or use Standard Workflows period.

Ignoring the execution history limit. Standard Workflow execution history is capped at 25,000 events. For very long-running workflows with many state transitions, you can hit this limit. If your workflow runs for days or weeks with thousands of steps, use the Map state with chunking to keep individual execution histories manageable.

Hardcoding ARNs in state machine definitions. Environment-specific ARNs belong in Terraform variables or SSM Parameter Store, not in your state machine JSON. The pattern shown above with templatefile keeps this clean.

Step Functions does not eliminate complexity. What it does is make complexity visible and manageable. Your business logic lives in Lambda. Your orchestration logic lives in the state machine. When something fails, you have a complete, queryable record of exactly what happened and where.

The teams that get the most value from Step Functions are the ones that resist the temptation to build orchestration logic into their Lambda functions. Keep each function focused on a single responsibility. Let the state machine handle sequencing, retries, error routing, and parallelism. The result is a system where debugging takes minutes instead of hours and where new team members can understand the full workflow by reading a single JSON file.

Enjoy the cloud.

Osama

Cross-Cloud Secret Synchronization: AWS Secrets Manager and OCI Vault in a Production Multi-Cloud Setup

Posted on April 24, 2026 by Osama Mustafa in Uncategorized

One of the most overlooked problems in multi-cloud environments is secrets management across providers. Teams usually solve it badly: they store the same secret in both clouds manually, forget to rotate one of them, and find out during an outage that the credentials have been out of sync for three months.

In this post I will walk through building an automated secrets synchronization pipeline between AWS Secrets Manager and OCI Vault. When a secret rotates in AWS, the pipeline detects the rotation event, retrieves the new value, and pushes it into OCI Vault automatically. Everything is built with Terraform, an AWS Lambda function, and OCI IAM. No manual steps after the initial deployment.

This is a pattern I have used in environments where the database layer runs on OCI (leveraging Oracle Database pricing and performance) while the application layer runs on AWS. Both sides need the same database credentials, and both sides need to stay in sync without human intervention.

Architecture

The flow works like this:

AWS Secrets Manager rotation event fires via EventBridge, which triggers a Lambda function. The Lambda retrieves the new secret value, authenticates to OCI using an API key stored in its own environment (not hardcoded), and calls the OCI Vault API to update the corresponding secret version. OCI Vault stores the new value and makes it available to workloads running in OCI.

Prerequisites

Before starting you need:

AWS account with permissions to manage Secrets Manager, Lambda, EventBridge, and IAM
OCI tenancy with permissions to manage Vault, Keys, and IAM policies
Terraform 1.5 or later
Python 3.11 for the Lambda function
An existing OCI Vault and master encryption key (or we will create one)

Step 1: OCI Vault and IAM Setup

Start with OCI. We need a Vault, a master key, and an IAM user whose API key the Lambda will use to authenticate.

hcl

			
# OCI Vault
resource "oci_kms_vault" "app_vault" {
  compartment_id = var.compartment_id
  display_name   = "multi-cloud-secrets-vault"
  vault_type     = "DEFAULT"
}
# Master Encryption Key inside the Vault
resource "oci_kms_key" "secrets_key" {
  compartment_id      = var.compartment_id
  display_name        = "secrets-master-key"
  management_endpoint = oci_kms_vault.app_vault.management_endpoint
  key_shape {
    algorithm = "AES"
    length    = 32
  }
}
# IAM user for cross-cloud access
resource "oci_identity_user" "sync_user" {
  compartment_id = var.tenancy_ocid
  name           = "aws-secrets-sync-user"
  description    = "Service user for AWS Lambda to push secrets into OCI Vault"
  email          = "sync-user@internal.example.com"
}
# API key for the sync user (you will generate the actual key pair separately)
resource "oci_identity_api_key" "sync_user_key" {
  user_id   = oci_identity_user.sync_user.id
  key_value = var.oci_sync_user_public_key_pem
}
# IAM group for the sync user
resource "oci_identity_group" "sync_group" {
  compartment_id = var.tenancy_ocid
  name           = "secrets-sync-group"
  description    = "Group for cross-cloud secrets sync service users"
}
resource "oci_identity_user_group_membership" "sync_membership" {
  group_id = oci_identity_group.sync_group.id
  user_id  = oci_identity_user.sync_user.id
}
# Minimal IAM policy - only what is needed, nothing more
resource "oci_identity_policy" "sync_policy" {
  compartment_id = var.compartment_id
  name           = "secrets-sync-policy"
  description    = "Allows sync user to manage secrets in the app vault only"
  statements = [
    "Allow group secrets-sync-group to manage secret-family in compartment id ${var.compartment_id} where target.vault.id = '${oci_kms_vault.app_vault.id}'",
    "Allow group secrets-sync-group to use keys in compartment id ${var.compartment_id} where target.key.id = '${oci_kms_key.secrets_key.id}'"
  ]
}

		

The policy scope is intentionally narrow. The sync user can only manage secrets inside this specific vault and can only use this specific key. If the AWS Lambda credentials are ever compromised, the blast radius is limited to this vault.

Step 2: Create the Initial Secret in OCI Vault

We need a secret placeholder in OCI Vault that the Lambda will update. The initial value does not matter since it will be overwritten on the first sync.

hcl

			
resource "oci_vault_secret" "db_password" {
  compartment_id = var.compartment_id
  vault_id       = oci_kms_vault.app_vault.id
  key_id         = oci_kms_key.secrets_key.id
  secret_name    = "prod-db-password"
  secret_content {
    content_type = "BASE64"
    content      = base64encode("initial-placeholder-value")
    name         = "v1"
    stage        = "CURRENT"
  }
  metadata = {
    source      = "aws-secrets-manager"
    aws_secret  = "prod/database/password"
    environment = "production"
  }
}

		

Step 3: AWS Secrets Manager and the Source Secret

On the AWS side, create the authoritative secret and enable automatic rotation.

hcl

			
resource "aws_secretsmanager_secret" "db_password" {
  name                    = "prod/database/password"
  description             = "Production database password - synced to OCI Vault"
  recovery_window_in_days = 7
  tags = {
    Environment   = "production"
    SyncTarget    = "oci-vault"
    OciSecretName = "prod-db-password"
  }
}
resource "aws_secretsmanager_secret_version" "db_password_v1" {
  secret_id     = aws_secretsmanager_secret.db_password.id
  secret_string = jsonencode({
    username = "db_admin",
    password = var.initial_db_password,
    host     = var.db_host,
    port     = 1521,
    database = "PRODDB"
  })
}
# Rotation configuration - rotate every 30 days
resource "aws_secretsmanager_secret_rotation" "db_password_rotation" {
  secret_id           = aws_secretsmanager_secret.db_password.id
  rotation_lambda_arn = aws_lambda_function.db_rotation_lambda.arn
  rotation_rules {
    automatically_after_days = 30
  }
}

		

Step 4: Store OCI Credentials in AWS Secrets Manager

The Lambda needs OCI API credentials to authenticate. Store them as a secret in AWS Secrets Manager so they never appear in Lambda environment variables in plaintext.

hcl

			
resource "aws_secretsmanager_secret" "oci_credentials" {
  name        = "internal/oci-sync-credentials"
  description = "OCI API key credentials for secrets sync Lambda"
  tags = {
    Environment = "production"
    Purpose     = "cross-cloud-sync"
  }
}
resource "aws_secretsmanager_secret_version" "oci_credentials_v1" {
  secret_id = aws_secretsmanager_secret.oci_credentials.id
  secret_string = jsonencode({
    tenancy_ocid  = var.oci_tenancy_ocid,
    user_ocid     = var.oci_sync_user_ocid,
    fingerprint   = var.oci_api_key_fingerprint,
    private_key   = var.oci_private_key_pem,
    region        = var.oci_region
  })
}

		

Step 5: The Lambda Function

This is the core of the pipeline. The Lambda retrieves the rotated secret from AWS Secrets Manager, loads OCI credentials from its own secrets store, and calls the OCI Vault API to create a new secret version.

python

			
import boto3
import json
import base64
import oci
import logging
import os
from datetime import datetime, timezone
from botocore.exceptions import ClientError
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def get_oci_config():
    """Retrieve OCI credentials from AWS Secrets Manager."""
    client = boto3.client("secretsmanager", region_name=os.environ["AWS_REGION"])
    
    try:
        response = client.get_secret_value(
            SecretId=os.environ["OCI_CREDENTIALS_SECRET_ARN"]
        )
        creds = json.loads(response["SecretString"])
        
        return {
            "tenancy": creds["tenancy_ocid"],
            "user": creds["user_ocid"],
            "fingerprint": creds["fingerprint"],
            "key_content": creds["private_key"],
            "region": creds["region"]
        }
    except ClientError as e:
        logger.error(f"Failed to retrieve OCI credentials: {e}")
        raise
def get_aws_secret(secret_arn: str) -> str:
    """Retrieve the current value of an AWS secret."""
    client = boto3.client("secretsmanager", region_name=os.environ["AWS_REGION"])
    
    try:
        response = client.get_secret_value(SecretId=secret_arn)
        return response.get("SecretString") or base64.b64decode(
            response["SecretBinary"]
        ).decode("utf-8")
    except ClientError as e:
        logger.error(f"Failed to retrieve AWS secret {secret_arn}: {e}")
        raise
def push_to_oci_vault(
    oci_config: dict,
    vault_id: str,
    key_id: str,
    secret_ocid: str,
    secret_value: str
):
    """Create a new version of an OCI Vault secret."""
    vaults_client = oci.vault.VaultsClient(oci_config)
    
    encoded_value = base64.b64encode(secret_value.encode("utf-8")).decode("utf-8")
    
    update_details = oci.vault.models.UpdateSecretDetails(
        secret_content=oci.vault.models.Base64SecretContentDetails(
            content_type=oci.vault.models.SecretContentDetails.CONTENT_TYPE_BASE64,
            content=encoded_value,
            name=f"sync-{datetime.now(timezone.utc).strftime('%Y%m%d%H%M%S')}",
            stage="CURRENT"
        ),
        metadata={
            "synced_from": "aws-secrets-manager",
            "synced_at": datetime.now(timezone.utc).isoformat()
        }
    )
    
    response = vaults_client.update_secret(
        secret_id=secret_ocid,
        update_secret_details=update_details
    )
    
    logger.info(
        f"OCI secret updated. OCID: {secret_ocid}, "
        f"New version: {response.data.current_version_number}"
    )
    
    return response.data
def handler(event, context):
    """
    EventBridge trigger handler.
    Expects event detail to contain:
      - aws_secret_arn: ARN of the rotated AWS secret
      - oci_secret_ocid: OCID of the target OCI Vault secret
      - oci_vault_id: OCID of the target OCI Vault
      - oci_key_id: OCID of the OCI KMS key
    """
    logger.info(f"Received event: {json.dumps(event)}")
    
    detail = event.get("detail", {})
    aws_secret_arn  = detail.get("aws_secret_arn")
    oci_secret_ocid = detail.get("oci_secret_ocid")
    oci_vault_id    = detail.get("oci_vault_id")
    oci_key_id      = detail.get("oci_key_id")
    
    if not all([aws_secret_arn, oci_secret_ocid, oci_vault_id, oci_key_id]):
        logger.error("Missing required fields in event detail")
        raise ValueError("Event detail must include aws_secret_arn, oci_secret_ocid, oci_vault_id, oci_key_id")
    
    logger.info(f"Syncing secret: {aws_secret_arn} to OCI: {oci_secret_ocid}")
    
    # Step 1: Get OCI credentials
    oci_config = get_oci_config()
    
    # Step 2: Retrieve the rotated AWS secret
    secret_value = get_aws_secret(aws_secret_arn)
    
    # Step 3: Push to OCI Vault
    result = push_to_oci_vault(
        oci_config=oci_config,
        vault_id=oci_vault_id,
        key_id=oci_key_id,
        secret_ocid=oci_secret_ocid,
        secret_value=secret_value
    )
    
    return {
        "statusCode": 200,
        "body": {
            "message": "Secret synced successfully",
            "oci_secret_ocid": oci_secret_ocid,
            "oci_version": result.current_version_number
        }
    }

		

Step 6: Lambda IAM Role and Deployment

hcl

			
data "aws_iam_policy_document" "lambda_assume_role" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole"]
    principals {
      type        = "Service"
      identifiers = ["lambda.amazonaws.com"]
    }
  }
}
data "aws_iam_policy_document" "lambda_permissions" {
  statement {
    effect = "Allow"
    actions = [
      "secretsmanager:GetSecretValue",
      "secretsmanager:DescribeSecret"
    ]
    resources = [
      aws_secretsmanager_secret.db_password.arn,
      aws_secretsmanager_secret.oci_credentials.arn
    ]
  }
  statement {
    effect = "Allow"
    actions = [
      "logs:CreateLogGroup",
      "logs:CreateLogStream",
      "logs:PutLogEvents"
    ]
    resources = ["arn:aws:logs:*:*:*"]
  }
}
resource "aws_iam_role" "sync_lambda_role" {
  name               = "secrets-sync-lambda-role"
  assume_role_policy = data.aws_iam_policy_document.lambda_assume_role.json
}
resource "aws_iam_role_policy" "sync_lambda_policy" {
  name   = "secrets-sync-lambda-policy"
  role   = aws_iam_role.sync_lambda_role.id
  policy = data.aws_iam_policy_document.lambda_permissions.json
}
resource "aws_lambda_function" "secrets_sync" {
  filename         = "${path.module}/lambda/secrets_sync.zip"
  function_name    = "oci-secrets-sync"
  role             = aws_iam_role.sync_lambda_role.arn
  handler          = "main.handler"
  runtime          = "python3.11"
  timeout          = 60
  memory_size      = 256
  source_code_hash = filebase64sha256("${path.module}/lambda/secrets_sync.zip")
  environment {
    variables = {
      OCI_CREDENTIALS_SECRET_ARN = aws_secretsmanager_secret.oci_credentials.arn
      AWS_REGION                 = var.aws_region
    }
  }
  layers = [aws_lambda_layer_version.oci_sdk_layer.arn]
}

		

Bundle the OCI Python SDK as a Lambda Layer so the function does not need to package it inline:

bash

			
mkdir -p lambda_layer/python
pip install oci --target lambda_layer/python
cd lambda_layer && zip -r ../oci_sdk_layer.zip python/

hcl

			
resource "aws_lambda_layer_version" "oci_sdk_layer" {
  filename            = "${path.module}/oci_sdk_layer.zip"
  layer_name          = "oci-python-sdk"
  compatible_runtimes = ["python3.11"]
  source_code_hash    = filebase64sha256("${path.module}/oci_sdk_layer.zip")
}

		

Step 7: EventBridge Rule to Trigger on Rotation

hcl

			
resource "aws_cloudwatch_event_rule" "secret_rotation_rule" {
  name        = "detect-secret-rotation"
  description = "Fires when a Secrets Manager secret rotation completes"
  event_pattern = jsonencode({
    source      = ["aws.secretsmanager"],
    detail-type = ["AWS API Call via CloudTrail"],
    detail = {
      eventSource = ["secretsmanager.amazonaws.com"],
      eventName   = ["RotateSecret", "PutSecretValue"]
    }
  })
}
resource "aws_cloudwatch_event_target" "sync_lambda_target" {
  rule      = aws_cloudwatch_event_rule.secret_rotation_rule.name
  target_id = "SyncToOCI"
  arn       = aws_lambda_function.secrets_sync.arn
  input_transformer {
    input_paths = {
      secret_arn = "$.detail.requestParameters.secretId"
    }
    input_template = <<EOF
{
  "detail": {
    "aws_secret_arn": "<secret_arn>",
    "oci_secret_ocid": "${var.oci_db_password_secret_ocid}",
    "oci_vault_id": "${oci_kms_vault.app_vault.id}",
    "oci_key_id": "${oci_kms_key.secrets_key.id}"
  }
}
EOF
  }
}
resource "aws_lambda_permission" "allow_eventbridge" {
  statement_id  = "AllowEventBridgeInvoke"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.secrets_sync.function_name
  principal     = "events.amazonaws.com"
  source_arn    = aws_cloudwatch_event_rule.secret_rotation_rule.arn
}

		

Step 8: Verifying the Pipeline

Manually trigger a rotation to test the full pipeline without waiting 30 days:

bash

			
# Force a rotation in AWS
aws secretsmanager rotate-secret \
  --secret-id prod/database/password \
  --region us-east-1
# Check Lambda execution logs
aws logs tail /aws/lambda/oci-secrets-sync --follow
# Verify the new version appeared in OCI Vault
oci vault secret get \
  --secret-id <your-oci-secret-ocid> \
  --query 'data.{name:secret-name, version:"current-version-number", updated:"time-of-current-version-need-rotation"}' \
  --output table

		

A successful sync produces output similar to this in the Lambda logs:

			
INFO: Syncing secret: arn:aws:secretsmanager:us-east-1:123456789:secret:prod/database/password to OCI: ocid1.vaultsecret.oc1...
INFO: OCI secret updated. OCID: ocid1.vaultsecret.oc1..., New version: 3

Handling Failures and Drift

The pipeline as built is synchronous and event-driven, which means if the Lambda fails, the OCI secret does not get updated. Add a dead-letter queue and a reconciliation function that runs on a schedule to catch any drift.

hcl

			
resource "aws_sqs_queue" "sync_dlq" {
  name                      = "secrets-sync-dlq"
  message_retention_seconds = 86400
}
resource "aws_lambda_function_event_invoke_config" "sync_retry" {
  function_name                = aws_lambda_function.secrets_sync.function_name
  maximum_retry_attempts       = 2
  maximum_event_age_in_seconds = 300
  destination_config {
    on_failure {
      destination = aws_sqs_queue.sync_dlq.arn
    }
  }
}

		

For reconciliation, a scheduled Lambda that runs every hour compares the LastRotatedDate on the AWS secret against the synced_at metadata tag on the OCI secret. If they differ by more than five minutes, it triggers a forced sync.

Security Considerations

A few things to keep in mind when running this in production.

The OCI private key stored in AWS Secrets Manager should be rotated periodically, just like any other credential. Add it to your rotation schedule.

Enable CloudTrail in AWS and OCI Audit logging so every access to both secrets stores is recorded. If something is off with the sync, the audit logs tell you exactly which principal made the change and when.

Use VPC endpoints for Secrets Manager in AWS so the Lambda traffic never crosses the public internet when retrieving credentials.

On the OCI side, enable Vault audit logging to the OCI Logging service so every secret version write is captured.

Wrapping Up

This pipeline solves a real operational problem without requiring a third-party secrets broker. AWS Secrets Manager stays the authoritative source. OCI Vault stays current automatically. The only manual step is the initial deployment.

The pattern extends to other cross-cloud credential types. Database connection strings, API tokens, TLS certificates — any secret that needs to exist on both clouds can follow the same EventBridge to Lambda to OCI Vault flow. Extend the Lambda to support a mapping table of AWS secret ARNs to OCI secret OCIDs and one function handles your entire secrets estate across both providers.

Regards,
Osama

Building Generative AI Applications with Vector Databases on AWS

Posted on April 3, 2026 by Osama Mustafa in Uncategorized

A few months ago, I was helping a team that had just integrated an LLM into their product. The use case was straightforward: users ask questions, the LLM answers. They had it running. The demos looked great. Then they went to production.

The model kept confidently making things up. It had no idea about the company’s internal documentation, the latest product specs, or anything that happened after its training cutoff. The team was frustrated. They had the right model, the right infrastructure, but the wrong architecture.

The fix was not fine-tuning. Fine-tuning is expensive, slow, and you have to redo it every time your data changes. The fix was Retrieval Augmented Generation, or RAG. And at the heart of RAG is something called a vector database.

In this article, I will walk you through building a production-grade RAG architecture on AWS. We will cover what vector databases actually are, when to use Aurora pgvector versus OpenSearch versus Amazon Bedrock Knowledge Bases, and how to wire everything together with real code.

What Is a Vector Database and Why Does It Matter

Before writing any infrastructure code, let me explain what problem we are actually solving.

When you work with text, images, or audio in AI systems, the raw data is not what gets compared. Instead, you pass the data through an embedding model, which converts it into a list of numbers called a vector. That vector captures the semantic meaning of the content.

Two sentences that mean the same thing will have vectors that are close to each other in vector space, even if they use completely different words. “The server is down” and “the system is not responding” will be closer to each other than “the server is down” and “I had pasta for lunch.”

A vector database is optimized for one specific operation: given a query vector, find me the N closest vectors in the collection. This is called approximate nearest neighbor search, and it is fundamentally different from SQL WHERE clauses or text search.

In a RAG architecture, the flow looks like this:

You chunk your documents and generate embeddings for each chunk
You store those embeddings in a vector database
When a user asks a question, you generate an embedding for the question
You query the vector database to retrieve the most semantically similar chunks
You pass the question plus those chunks to your LLM as context
The LLM answers based on actual, grounded information

The result is a model that knows your data, stays current as your data changes, and does not hallucinate facts from your knowledge base because the facts are right there in the prompt.

Options on AWS

AWS gives you three serious paths for vector storage, and choosing the wrong one will cost you performance and money.

Amazon Aurora PostgreSQL with pgvector

pgvector is an open source PostgreSQL extension that adds native vector storage and similarity search. If you already run Aurora PostgreSQL, this is often the right starting point.

The extension supports three distance metrics: L2 (Euclidean), inner product, and cosine similarity. For most text embedding use cases, cosine similarity is what you want.

Here is a minimal setup to get you started:

			
-- Enable the extension on your Aurora instance
CREATE EXTENSION vector;
-- Create a table for your document chunks
CREATE TABLE document_chunks (
    id          BIGSERIAL PRIMARY KEY,
    doc_id      TEXT NOT NULL,
    chunk_text  TEXT NOT NULL,
    source_url  TEXT,
    embedding   vector(1536),   -- 1536 dims for text-embedding-3-small
    created_at  TIMESTAMPTZ DEFAULT NOW()
);
-- IVFFlat index for approximate nearest neighbor search
-- lists = sqrt(number of rows) is a good starting point
CREATE INDEX ON document_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

		

			
SELECT
    chunk_text,
    source_url,
    1 - (embedding <=> $1::vector) AS similarity_score
FROM document_chunks
ORDER BY embedding <=> $1::vector
LIMIT 5;

		

The <=> operator computes cosine distance. One minus that gives you similarity.

For production, tune the ivfflat.probes parameter at query time. Higher probes means more accuracy but slower queries. For most use cases, setting it between 10 and 20 is a reasonable balance:

Aurora pgvector is the right choice when your team already knows PostgreSQL, you want to join vector search results with relational data in the same query, or you have an existing Aurora cluster and want to avoid managing another service.

The limitation is scale. Once you push past 10 to 20 million vectors, or you need sub-10ms latency at high concurrency, you will start to feel the ceiling.

Amazon OpenSearch Service with Vector Engine

OpenSearch’s vector engine is built for scale. It uses the HNSW (Hierarchical Navigable Small World) algorithm, which delivers excellent recall and latency even at hundreds of millions of vectors.

Setting up an index for vector search:

			
PUT /documents
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 512
    }
  },
  "mappings": {
    "properties": {
      "doc_id":     { "type": "keyword" },
      "chunk_text": { "type": "text" },
      "source_url": { "type": "keyword" },
      "embedding": {
        "type": "knn_vector",
        "dimension": 1536,
        "method": {
          "name":       "hnsw",
          "space_type": "cosinesimil",
          "engine":     "nmslib",
          "parameters": {
            "ef_construction": 512,
            "m": 16
          }
        }
      }
    }
  }
}

		

The ef_construction and m parameters control the index build quality. Higher values give better recall but increase memory usage and indexing time. For most production workloads, m=16 and ef_construction=512 is a solid baseline.

Indexing a document:

			
import boto3
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
region = "us-east-1"
service = "es"
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key,
                  region, service, session_token=credentials.token)
client = OpenSearch(
    hosts=[{"host": your_opensearch_endpoint, "port": 443}],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection
)
document = {
    "doc_id":     "product-manual-v3-page-42",
    "chunk_text": "The power button is located on the right side of the device...",
    "source_url": "s3://your-bucket/manuals/product-v3.pdf",
    "embedding":  generate_embedding("The power button is located...")
}
client.index(index="documents", body=document)

		

Querying for semantic similarity:

			
query = {
    "size": 5,
    "query": {
        "knn": {
            "embedding": {
                "vector": generate_embedding(user_question),
                "k": 5
            }
        }
    },
    "_source": ["chunk_text", "source_url"]
}
response = client.search(index="documents", body=query)

		

OpenSearch also lets you combine vector search with traditional filters, which is something pgvector struggles with at scale:

			
hybrid_query = {
    "size": 5,
    "query": {
        "bool": {
            "must": [
                {
                    "knn": {
                        "embedding": {
                            "vector": generate_embedding(user_question),
                            "k": 20
                        }
                    }
                }
            ],
            "filter": [
                { "term": { "product_line": "enterprise" } },
                { "range": { "doc_date": { "gte": "2024-01-01" } } }
            ]
        }
    }
}

		

Retrieving 20 candidates via vector search, then filtering down with metadata, is called pre-filtering, and it is critical when your knowledge base spans multiple products, teams, or access tiers.

Amazon Bedrock Knowledge Bases

If you want the fastest path to production and do not want to manage chunking, embedding, or indexing yourself, Bedrock Knowledge Bases handles all of it.

You point it at an S3 bucket. It crawls your documents, chunks them, generates embeddings using your chosen model, and stores them in an OpenSearch Serverless collection. When you query it, it handles the retrieval and optionally the generation too.

			
resource "aws_bedrockagent_knowledge_base" "product_docs" {
  name     = "product-documentation-kb"
  role_arn = aws_iam_role.bedrock_kb_role.arn
  knowledge_base_configuration {
    type = "VECTOR"
    vector_knowledge_base_configuration {
      embedding_model_arn = "arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v2:0"
    }
  }
  storage_configuration {
    type = "OPENSEARCH_SERVERLESS"
    opensearch_serverless_configuration {
      collection_arn    = aws_opensearchserverless_collection.kb_vectors.arn
      vector_index_name = "bedrock-knowledge-base-default-index"
      field_mapping {
        vector_field   = "bedrock-knowledge-base-default-vector"
        text_field     = "AMAZON_BEDROCK_TEXT_CHUNK"
        metadata_field = "AMAZON_BEDROCK_METADATA"
      }
    }
  }
}
resource "aws_bedrockagent_data_source" "s3_docs" {
  knowledge_base_id = aws_bedrockagent_knowledge_base.product_docs.id
  name              = "s3-product-documentation"
  data_source_configuration {
    type = "S3"
    s3_configuration {
      bucket_arn = aws_s3_bucket.documentation.arn
    }
  }
  vector_ingestion_configuration {
    chunking_configuration {
      chunking_strategy = "SEMANTIC"
      semantic_chunking_configuration {
        max_token       = 300
        buffer_size     = 0
        breakpoint_percentile_threshold = 95
      }
    }
  }
}

		

Querying it from your application:

			
import boto3
bedrock_agent = boto3.client("bedrock-agent-runtime", region_name="us-east-1")
response = bedrock_agent.retrieve_and_generate(
    input={
        "text": user_question
    },
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            "knowledgeBaseId": "YOUR_KB_ID",
            "modelArn": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0",
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults": 5,
                    "overrideSearchType": "HYBRID"
                }
            }
        }
    }
)
answer = response["output"]["text"]
citations = response["citations"]

		

The HYBRID search type combines vector similarity with keyword search under the hood, which improves recall for queries that contain specific product names, version numbers, or technical terms that embeddings alone sometimes miss.

Chunking Strategy: The Part Everyone Gets Wrong

The quality of your RAG system depends more on how you chunk your documents than on which vector database you choose. I have seen teams spend weeks optimizing their similarity search while their chunking strategy was destroying recall.

A few rules that hold up in practice:

Chunk size matters. Too small and you lose context. Too large and you dilute the semantic signal. For most document types, 300 to 500 tokens with a 50-token overlap between chunks is a reasonable starting point. The overlap ensures that sentences that fall on chunk boundaries are still retrievable.

Chunk by structure when you can. If your documents have headers, sections, or natural breaks, use those as chunk boundaries rather than fixed token counts. A section about “Troubleshooting Network Errors” should stay together rather than getting split at 400 tokens.

Store metadata with every chunk. The chunk text alone is not enough. You need the source document, the section title, the creation date, the product version. This metadata enables the filtering patterns we covered in OpenSearch and prevents your model from citing a three-year-old document when a current one exists.

Test with real queries. The only way to validate your chunking strategy is to run the queries your users will actually ask and check whether the right chunks are being retrieved. Build a small evaluation set early, before you optimize anything else.

Embedding Model Selection

For AWS workloads, you have two main options through Bedrock:

Amazon Titan Text Embeddings V2 produces 1024-dimensional vectors. It is fast, cheap, and fine for general English text. If you are building an internal knowledge base over English documents, this is the right default.

Cohere Embed v3 supports multilingual embeddings and produces 1024-dimensional vectors with better performance on technical and domain-specific text. If your documents cover specialized subject matter legal, medical, engineering Cohere will typically outperform Titan on retrieval quality.

A critical point that is easy to overlook: you must use the same embedding model at indexing time and query time. If you indexed your documents with Titan and query with Cohere, the vectors live in different spaces and your similarity scores will be meaningless. Build this constraint into your infrastructure from day one.

Architecture Summary

For a production RAG system on AWS, here is the architecture that has worked well for teams I have worked with.

Document ingestion: an S3 bucket triggers a Lambda function, or Step Functions for large files. The function chunks the document, generates embeddings via Bedrock, and writes to your vector store with metadata.

Vector storage: Aurora pgvector for under 5 million vectors with heavy relational joins. OpenSearch for everything larger, or when you need metadata filtering at scale. Bedrock Knowledge Bases when you want fully managed infrastructure and your team does not want to own the pipeline.

Query path: API Gateway triggers a Lambda function that embeds the user query, retrieves top-k chunks from the vector store, builds a context-enriched prompt, and calls Claude or another Bedrock model for the final response.

Observability: CloudWatch captures embedding latency, retrieval similarity scores, and end-to-end response time. Set alerts if retrieval quality drops since that is usually a signal that something changed in your document pipeline.

Regards
Osama

Enforcing SLA Compliance with SQL Assertions in Oracle 23ai: A Real-World Use Case

Posted on March 13, 2026 by Osama Mustafa in Uncategorized

One of the most frustrating things I’ve dealt with as a DBA is cleaning up data that should never have existed in the first place. Orphaned records, overlapping date ranges, business rules violated because some batch job skipped a validation step. We’ve all been there.

The traditional solution was triggers. And if you’ve written cross-table validation triggers in Oracle, you know the pain: mutating table errors (ORA-04091), complex exception handling, scattered logic across multiple trigger bodies, and debugging sessions that make you question your career choices.

Starting with Oracle Database 23ai (release 23.26.1), Oracle introduced SQL Assertions, and they change everything about how we enforce cross-table business rules.

What Are SQL Assertions?

An assertion is a schema-level integrity constraint defined by a boolean expression. If that expression evaluates to false during a transaction, the transaction fails. That’s it. The concept has been part of the SQL standard since SQL-92, but no major database vendor actually implemented it until Oracle did it in 23.26.1.

There are two types of assertion expressions:

Existential expressions use [NOT] EXISTS with a subquery. If the condition is true, the transaction proceeds.

Universal expressions use the new ALL ... SATISFY syntax. This lets you say “for every row matching this query, this condition must hold.” It’s Oracle’s elegant alternative to the awkward double-negation pattern (NOT EXISTS ... WHERE NOT EXISTS ...) that SQL traditionally requires for universal quantification.

The Scenario: SLA Compliance for a Ticketing System

Let me show you a real-world use case that goes beyond toy examples. Imagine you run a support ticketing system for an enterprise. You have service level agreements (SLAs) with your customers, and the database needs to enforce these rules:

Every customer must have an active SLA before they can submit a ticket. No SLA, no support.
Tickets can only be created while the customer’s SLA is active (between start and end dates).
High-priority tickets must be assigned to a senior engineer. You can’t assign a critical production issue to a junior team member.
Every SLA must cover at least one service category. An SLA with no covered services is meaningless.

In a traditional Oracle setup, enforcing these rules would require at least four separate triggers across three tables, careful handling of mutating table errors, and a lot of testing to make sure they don’t interfere with each other.

With assertions, each rule is a single declarative statement.

Building the Schema

sql

			
DROP TABLE IF EXISTS tickets       CASCADE CONSTRAINTS PURGE;
DROP TABLE IF EXISTS sla_services  CASCADE CONSTRAINTS PURGE;
DROP TABLE IF EXISTS slas          CASCADE CONSTRAINTS PURGE;
DROP TABLE IF EXISTS engineers     CASCADE CONSTRAINTS PURGE;
DROP TABLE IF EXISTS customers     CASCADE CONSTRAINTS PURGE;
CREATE TABLE customers (
    id          NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    name        VARCHAR2(200) NOT NULL,
    company     VARCHAR2(200),
    created_at  TIMESTAMP DEFAULT SYSTIMESTAMP
);
CREATE TABLE engineers (
    id            NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    name          VARCHAR2(200) NOT NULL,
    seniority     VARCHAR2(20) CHECK (
                    seniority IN ('junior','mid','senior','lead')
                  ),
    specialization VARCHAR2(100)
);
CREATE TABLE slas (
    id            NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    customer_id   NUMBER NOT NULL REFERENCES customers(id),
    sla_tier      VARCHAR2(20) CHECK (
                    sla_tier IN ('bronze','silver','gold','platinum')
                  ),
    start_date    DATE NOT NULL,
    end_date      DATE NOT NULL,
    CONSTRAINT sla_dates_valid CHECK (end_date > start_date)
);
CREATE TABLE sla_services (
    id            NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    sla_id        NUMBER NOT NULL REFERENCES slas(id),
    service_name  VARCHAR2(100) NOT NULL
);
CREATE TABLE tickets (
    id            NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    customer_id   NUMBER NOT NULL REFERENCES customers(id),
    engineer_id   NUMBER REFERENCES engineers(id),
    priority      VARCHAR2(20) CHECK (
                    priority IN ('low','medium','high','critical')
                  ),
    subject       VARCHAR2(500) NOT NULL,
    created_at    TIMESTAMP DEFAULT SYSTIMESTAMP,
    status        VARCHAR2(20) DEFAULT 'open' CHECK (
                    status IN ('open','in_progress','resolved','closed')
                  )
);

		

Assertion 1: Customers Need an Active SLA to Submit Tickets

This is the core business rule. No active SLA, no ticket creation.

sql

			
CREATE ASSERTION ticket_requires_active_sla
CHECK (
    ALL (SELECT customer_id, created_at FROM tickets) SATISFY
        EXISTS (
            SELECT 1 FROM slas
            WHERE slas.customer_id = tickets.customer_id
              AND tickets.created_at 
                  BETWEEN slas.start_date AND slas.end_date
        )
);

		

Read that in plain English: “For all tickets, there must exist an SLA for that customer where the ticket creation date falls within the SLA period.”

If someone tries to insert a ticket for a customer whose SLA has expired, the database will reject the transaction. No application code needed. No trigger needed. The rule is declarative and self-documenting.

Assertion 2: High-Priority Tickets Need Senior Engineers

This is a cross-table constraint that would be especially painful with triggers because it spans tickets and engineers.

sql

			
CREATE ASSERTION critical_tickets_need_senior_engineer
CHECK (
    NOT EXISTS (
        SELECT 1
        FROM tickets t
        JOIN engineers e ON t.engineer_id = e.id
        WHERE t.priority IN ('high', 'critical')
          AND e.seniority IN ('junior', 'mid')
    )
);

		

This uses the existential pattern. It looks for any high-priority ticket assigned to a junior or mid-level engineer. If it finds one, the transaction fails. Simple, clear, and impossible to bypass from any application that touches this database.

Assertion 3: Every SLA Must Cover at Least One Service

An SLA without any covered services is a data integrity problem waiting to happen.

sql

			
CREATE ASSERTION sla_must_have_services
CHECK (
    ALL (SELECT id FROM slas) SATISFY
        EXISTS (
            SELECT 1 FROM sla_services
            WHERE sla_services.sla_id = slas.id
        )
)
DEFERRABLE INITIALLY DEFERRED;

		

This one uses DEFERRABLE INITIALLY DEFERRED because of the chicken-and-egg problem: the foreign key on sla_services requires the SLA to exist first, but this assertion requires services to exist when an SLA exists. By deferring validation to commit time, you can insert both the SLA and its services in a single transaction.

Testing It Out

Let’s load some data and see the assertions in action:

sql

			
-- Insert customers
INSERT INTO customers (name, company) 
VALUES ('Ahmad Hassan', 'TechCorp Jordan');
INSERT INTO customers (name, company) 
VALUES ('Sara Ali', 'DataFlow ME');
-- Insert engineers
INSERT INTO engineers (name, seniority, specialization)
VALUES ('Omar Khalid', 'senior', 'Database');
INSERT INTO engineers (name, seniority, specialization)
VALUES ('Lina Nasser', 'junior', 'Networking');
-- Insert SLA with services (in one transaction 
-- because of deferred assertion)
INSERT INTO slas (customer_id, sla_tier, start_date, end_date)
VALUES (1, 'gold', DATE '2025-01-01', DATE '2026-12-31');
INSERT INTO sla_services (sla_id, service_name)
VALUES (1, 'Database Support');
INSERT INTO sla_services (sla_id, service_name)
VALUES (1, '24/7 Monitoring');
COMMIT;  -- Assertion validates here: SLA has services, OK
-- This should succeed: customer has active SLA, 
-- senior engineer assigned
INSERT INTO tickets 
  (customer_id, engineer_id, priority, subject)
VALUES 
  (1, 1, 'critical', 'Production database performance issue');
COMMIT;

		

Now let’s try violating the rules:

sql

			
-- This should FAIL: assigning critical ticket 
-- to junior engineer
INSERT INTO tickets 
  (customer_id, engineer_id, priority, subject)
VALUES 
  (1, 2, 'critical', 'Server outage');
COMMIT;
-- ERROR: assertion CRITICAL_TICKETS_NEED_SENIOR_ENGINEER violated
-- This should FAIL: customer 2 has no SLA
INSERT INTO tickets 
  (customer_id, engineer_id, priority, subject)
VALUES 
  (2, 1, 'low', 'General question');
COMMIT;
-- ERROR: assertion TICKET_REQUIRES_ACTIVE_SLA violated

		

The database enforces the rules. Every time. Regardless of which application, API, or batch job is inserting the data.

Why This Matters

The traditional approach to these rules would involve:

Four or more BEFORE INSERT triggers across multiple tables
Careful handling of ORA-04091 mutating table errors (probably using compound triggers or package variables)
Testing every combination of insert/update/delete across all tables
Documentation that explains what each trigger does and how they interact
A maintenance burden that grows with every new business rule

With assertions, each rule is one statement. They live in the data dictionary alongside your other constraints. You can query USER_CONSTRAINTS to see them. They are self-documenting. And Oracle’s internal incremental checking mechanism ensures they perform well because the database only validates the data that actually changed, not the entire table.

Practical Notes

Grant the privilege. CREATE ASSERTION is not included in RESOURCE. Use GRANT DB_DEVELOPER_ROLE TO your_user; or grant it explicitly.

Assertions share the constraint namespace. You cannot have an assertion and a constraint with the same name in the same schema.

Cross-schema assertions need ASSERTION REFERENCES. If your assertion references tables in another schema, you need this object privilege on those tables, and you must use fully qualified table names (synonyms are not supported).

Start with ENABLE NOVALIDATE on existing systems. This lets you add an assertion without checking existing data, which is essential when adding rules to a database that might already contain violations.

Subqueries can nest up to three levels. For most business rules, this is more than enough.

Resources

CREATE ASSERTION documentation
Assertion concepts documentation
How to define cross-table constraints with assertions by Chris Saxon
Oracle AI Database Free container for local testing
FreeSQL for browser-based experimentation

Thank you

Osama

Building a Multi-Cloud Architecture with OCI and AWS: A Real-World Integration Guide

Posted on February 19, 2026February 19, 2026 by Osama Mustafa in Cloud

I’ll tell you something that might sound controversial in cloud circles: the best cloud is often more than one cloud.

I’ve worked with dozens of enterprises over the years, and here’s what I’ve noticed. Some started with AWS years ago and built their entire infrastructure there. Then they realized Oracle Autonomous Database or Exadata could dramatically improve their database performance. Others were Oracle shops that wanted to leverage AWS’s machine learning services or global edge network.

The question isn’t really “which cloud is better?” The question is “how do we get the best of both?”

In this article, I’ll walk you through building a practical multi-cloud architecture connecting OCI and AWS. We’ll cover secure networking, data synchronization, identity federation, and the operational realities of running workloads across both platforms.

Why Multi-Cloud Actually Makes Sense

Let me be clear about something. Multi-cloud for its own sake is a terrible idea. It adds complexity, increases operational burden, and creates more things that can break. But multi-cloud for the right reasons? That’s a different story.

Here are legitimate reasons I’ve seen organizations adopt OCI and AWS together:

Database Performance: Oracle Autonomous Database and Exadata Cloud Service are genuinely difficult to match for Oracle workloads. If you’re running complex OLTP or analytics on Oracle, OCI’s database offerings are purpose-built for that.

AWS Ecosystem: AWS has services that simply don’t exist elsewhere. SageMaker for ML, Lambda’s maturity, CloudFront’s global presence, or specialized services like Rekognition and Comprehend.

Vendor Negotiation: Having workloads on multiple clouds gives you negotiating leverage. I’ve seen organizations save millions in licensing by demonstrating they could move workloads.

Acquisition and Mergers: Company A runs on AWS, Company B runs on OCI. Now they’re one company. Multi-cloud by necessity.

Regulatory Requirements: Some industries require data sovereignty or specific compliance certifications that might be easier to achieve with a particular provider in a particular region.

If none of these apply to you, stick with one cloud. Seriously. But if they do, keep reading.

Architecture Overview

Let’s design a realistic scenario. We have an e-commerce company with:

Application tier running on AWS (EKS, Lambda, API Gateway)
Core transactional database on OCI (Autonomous Transaction Processing)
Data warehouse on OCI (Autonomous Data Warehouse)
Machine learning workloads on AWS (SageMaker)
Shared data that needs to flow between both clouds

Setting Up Cross-Cloud Networking

The foundation of any multi-cloud architecture is networking. You need a secure, reliable, and performant connection between clouds.

Option 1: IPSec VPN (Good for Starting Out)

IPSec VPN is the quickest way to connect AWS and OCI. It runs over the public internet but encrypts everything. Good for development, testing, or low-bandwidth production workloads.

On OCI Side:

First, create a Dynamic Routing Gateway (DRG) and attach it to your VCN:

bash

			
# Create DRG
oci network drg create \
  --compartment-id $COMPARTMENT_ID \
  --display-name "aws-interconnect-drg"
# Attach DRG to VCN
oci network drg-attachment create \
  --drg-id $DRG_ID \
  --vcn-id $VCN_ID \
  --display-name "vcn-attachment"

		

Create a Customer Premises Equipment (CPE) object representing AWS:

bash

			
# Create CPE for AWS VPN endpoint
oci network cpe create \
  --compartment-id $COMPARTMENT_ID \
  --ip-address $AWS_VPN_PUBLIC_IP \
  --display-name "aws-vpn-endpoint"

		

Create the IPSec connection:

bash

			
# Create IPSec connection
oci network ip-sec-connection create \
  --compartment-id $COMPARTMENT_ID \
  --cpe-id $CPE_ID \
  --drg-id $DRG_ID \
  --static-routes '["10.1.0.0/16"]' \
  --display-name "oci-to-aws-vpn"

		

On AWS Side:

Create a Customer Gateway pointing to OCI:

bash

			
# Create Customer Gateway
aws ec2 create-customer-gateway \
  --type ipsec.1 \
  --public-ip $OCI_VPN_PUBLIC_IP \
  --bgp-asn 65000
# Create VPN Gateway
aws ec2 create-vpn-gateway \
  --type ipsec.1
# Attach to VPC
aws ec2 attach-vpn-gateway \
  --vpn-gateway-id $VGW_ID \
  --vpc-id $VPC_ID
# Create VPN Connection
aws ec2 create-vpn-connection \
  --type ipsec.1 \
  --customer-gateway-id $CGW_ID \
  --vpn-gateway-id $VGW_ID \
  --options '{"StaticRoutesOnly": true}'

		

Update route tables on both sides:

bash

			
# AWS: Add route to OCI CIDR
aws ec2 create-route \
  --route-table-id $ROUTE_TABLE_ID \
  --destination-cidr-block 10.2.0.0/16 \
  --gateway-id $VGW_ID
# OCI: Add route to AWS CIDR
oci network route-table update \
  --rt-id $ROUTE_TABLE_ID \
  --route-rules '[{
    "destination": "10.1.0.0/16",
    "destinationType": "CIDR_BLOCK",
    "networkEntityId": "'$DRG_ID'"
  }]'

		

Option 2: Private Connectivity (Production Recommended)

For production workloads, you want dedicated private connectivity. This means OCI FastConnect paired with AWS Direct Connect, meeting at a common colocation facility.

The good news is that Oracle and AWS both have presence in major colocation providers like Equinix. The setup involves:

Establishing FastConnect to your colocation
Establishing Direct Connect to the same colocation
Connecting them via a cross-connect in the facility

hcl

			
# Terraform for FastConnect virtual circuit
resource "oci_core_virtual_circuit" "aws_interconnect" {
  compartment_id         = var.compartment_id
  display_name           = "aws-fastconnect"
  type                   = "PRIVATE"
  bandwidth_shape_name   = "1 Gbps"
  
  cross_connect_mappings {
    customer_bgp_peering_ip = "169.254.100.1/30"
    oracle_bgp_peering_ip   = "169.254.100.2/30"
  }
  
  customer_asn    = "65001"
  gateway_id      = oci_core_drg.main.id
  provider_name   = "Equinix"
  region          = "Dubai"
}

		

hcl

			
# Terraform for AWS Direct Connect
resource "aws_dx_connection" "oci_interconnect" {
  name            = "oci-direct-connect"
  bandwidth       = "1Gbps"
  location        = "Equinix DX1"
  provider_name   = "Equinix"
}
resource "aws_dx_private_virtual_interface" "oci" {
  connection_id    = aws_dx_connection.oci_interconnect.id
  name             = "oci-vif"
  vlan             = 4094
  address_family   = "ipv4"
  bgp_asn          = 65002
  amazon_address   = "169.254.100.5/30"
  customer_address = "169.254.100.6/30"
  dx_gateway_id    = aws_dx_gateway.main.id
}

		

Honestly, setting this up involves coordination with both cloud providers and the colocation facility. Budget 4-8 weeks for the physical connectivity and plan for redundancy from day one.

Database Connectivity from AWS to OCI

Now that we have network connectivity, let’s connect AWS applications to OCI databases.

Configuring Autonomous Database for External Access

First, enable private endpoint access for your Autonomous Database:

bash

			
# Update ADB to use private endpoint
oci db autonomous-database update \
  --autonomous-database-id $ADB_ID \
  --is-access-control-enabled true \
  --whitelisted-ips '["10.1.0.0/16"]' \  # AWS VPC CIDR
  --is-mtls-connection-required false     # Allow TLS without mTLS for simplicity

		

Get the connection string:

bash

			
oci db autonomous-database get \
  --autonomous-database-id $ADB_ID \
  --query 'data."connection-strings".profiles[?consumer=="LOW"].value | [0]'

Application Configuration on AWS

Here’s a practical Python example for connecting from AWS Lambda to OCI Autonomous Database:

python

			
# lambda_function.py
import cx_Oracle
import os
import boto3
from botocore.exceptions import ClientError
def get_db_credentials():
    """Retrieve database credentials from AWS Secrets Manager"""
    secret_name = "oci-adb-credentials"
    region_name = "us-east-1"
    
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )
    
    try:
        response = client.get_secret_value(SecretId=secret_name)
        return json.loads(response['SecretString'])
    except ClientError as e:
        raise e
def handler(event, context):
    # Get credentials
    creds = get_db_credentials()
    
    # Connection string format for Autonomous DB
    dsn = """(description= 
        (retry_count=20)(retry_delay=3)
        (address=(protocol=tcps)(port=1522)
        (host=adb.me-dubai-1.oraclecloud.com))
        (connect_data=(service_name=xxx_atp_low.adb.oraclecloud.com))
        (security=(ssl_server_dn_match=yes)))"""
    
    connection = cx_Oracle.connect(
        user=creds['username'],
        password=creds['password'],
        dsn=dsn,
        encoding="UTF-8"
    )
    
    cursor = connection.cursor()
    cursor.execute("SELECT * FROM orders WHERE order_date = TRUNC(SYSDATE)")
    
    results = []
    for row in cursor:
        results.append({
            'order_id': row[0],
            'customer_id': row[1],
            'amount': float(row[2])
        })
    
    cursor.close()
    connection.close()
    
    return {
        'statusCode': 200,
        'body': json.dumps(results)
    }

		

For containerized applications on EKS, use a connection pool:

python

			
# db_pool.py
import cx_Oracle
import os
class OCIDatabasePool:
    _pool = None
    
    @classmethod
    def get_pool(cls):
        if cls._pool is None:
            cls._pool = cx_Oracle.SessionPool(
                user=os.environ['OCI_DB_USER'],
                password=os.environ['OCI_DB_PASSWORD'],
                dsn=os.environ['OCI_DB_DSN'],
                min=2,
                max=10,
                increment=1,
                encoding="UTF-8",
                threaded=True,
                getmode=cx_Oracle.SPOOL_ATTRVAL_WAIT
            )
        return cls._pool
    
    @classmethod
    def get_connection(cls):
        return cls.get_pool().acquire()
    
    @classmethod
    def release_connection(cls, connection):
        cls.get_pool().release(connection)

		

Kubernetes deployment for the application:

yaml

			
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-service
  template:
    metadata:
      labels:
        app: order-service
    spec:
      containers:
      - name: order-service
        image: 123456789.dkr.ecr.us-east-1.amazonaws.com/order-service:v1.0
        ports:
        - containerPort: 8080
        env:
        - name: OCI_DB_USER
          valueFrom:
            secretKeyRef:
              name: oci-db-credentials
              key: username
        - name: OCI_DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: oci-db-credentials
              key: password
        - name: OCI_DB_DSN
          valueFrom:
            configMapKeyRef:
              name: oci-db-config
              key: dsn
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 1Gi
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10

		

Data Synchronization Between Clouds

Real multi-cloud architectures need data flowing between clouds. Here are practical patterns:

Pattern 1: Event-Driven Sync with Kafka

Use a managed Kafka service as the bridge:

python

			
# AWS Lambda producer - sends events to Kafka
from kafka import KafkaProducer
import json
producer = KafkaProducer(
    bootstrap_servers=['kafka-broker-1:9092', 'kafka-broker-2:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
    security_protocol='SASL_SSL',
    sasl_mechanism='PLAIN',
    sasl_plain_username=os.environ['KAFKA_USER'],
    sasl_plain_password=os.environ['KAFKA_PASSWORD']
)
def handler(event, context):
    # Process order and send to Kafka for OCI consumption
    order_data = process_order(event)
    
    producer.send(
        'orders-topic',
        key=str(order_data['order_id']).encode(),
        value=order_data
    )
    producer.flush()
    
    return {'statusCode': 200}

		

OCI side consumer using OCI Functions:

python

			
# OCI Function consumer
import io
import json
import logging
import cx_Oracle
from kafka import KafkaConsumer
def handler(ctx, data: io.BytesIO = None):
    consumer = KafkaConsumer(
        'orders-topic',
        bootstrap_servers=['kafka-broker-1:9092'],
        auto_offset_reset='earliest',
        enable_auto_commit=True,
        group_id='oci-order-processor',
        value_deserializer=lambda x: json.loads(x.decode('utf-8'))
    )
    
    connection = get_adb_connection()
    cursor = connection.cursor()
    
    for message in consumer:
        order = message.value
        
        cursor.execute("""
            MERGE INTO orders o
            USING (SELECT :order_id AS order_id FROM dual) src
            ON (o.order_id = src.order_id)
            WHEN MATCHED THEN
                UPDATE SET amount = :amount, status = :status, updated_at = SYSDATE
            WHEN NOT MATCHED THEN
                INSERT (order_id, customer_id, amount, status, created_at)
                VALUES (:order_id, :customer_id, :amount, :status, SYSDATE)
        """, order)
        
        connection.commit()
    
    cursor.close()
    connection.close()

		

Pattern 2: Scheduled Batch Sync

For less time-sensitive data, batch synchronization is simpler and more cost-effective:

python

			
# AWS Step Functions state machine for batch sync
{
  "Comment": "Sync data from AWS to OCI",
  "StartAt": "ExtractFromAWS",
  "States": {
    "ExtractFromAWS": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789:function:extract-data",
      "Next": "UploadToS3"
    },
    "UploadToS3": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789:function:upload-to-s3",
      "Next": "CopyToOCI"
    },
    "CopyToOCI": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789:function:copy-to-oci-bucket",
      "Next": "LoadToADB"
    },
    "LoadToADB": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789:function:load-to-adb",
      "End": true
    }
  }
}

		

The Lambda function to copy data to OCI Object Storage:

python

			
# copy_to_oci.py
import boto3
import oci
import os
def handler(event, context):
    # Get file from S3
    s3 = boto3.client('s3')
    s3_object = s3.get_object(
        Bucket=event['bucket'],
        Key=event['key']
    )
    file_content = s3_object['Body'].read()
    
    # Upload to OCI Object Storage
    config = oci.config.from_file()
    object_storage = oci.object_storage.ObjectStorageClient(config)
    
    namespace = object_storage.get_namespace().data
    
    object_storage.put_object(
        namespace_name=namespace,
        bucket_name="data-sync-bucket",
        object_name=event['key'],
        put_object_body=file_content
    )
    
    return {
        'oci_bucket': 'data-sync-bucket',
        'object_name': event['key']
    }

		

Load into Autonomous Database using DBMS_CLOUD:

sql

			
-- Create credential for OCI Object Storage access
BEGIN
  DBMS_CLOUD.CREATE_CREDENTIAL(
    credential_name => 'OCI_CRED',
    username        => 'your_oci_username',
    password        => 'your_auth_token'
  );
END;
/
-- Load data from Object Storage
BEGIN
  DBMS_CLOUD.COPY_DATA(
    table_name      => 'ORDERS_STAGING',
    credential_name => 'OCI_CRED',
    file_uri_list   => 'https://objectstorage.me-dubai-1.oraclecloud.com/n/namespace/b/data-sync-bucket/o/orders_*.csv',
    format          => JSON_OBJECT(
      'type' VALUE 'CSV',
      'skipheaders' VALUE '1',
      'dateformat' VALUE 'YYYY-MM-DD'
    )
  );
END;
/
-- Merge staging into production
MERGE INTO orders o
USING orders_staging s
ON (o.order_id = s.order_id)
WHEN MATCHED THEN
  UPDATE SET o.amount = s.amount, o.status = s.status
WHEN NOT MATCHED THEN
  INSERT (order_id, customer_id, amount, status)
  VALUES (s.order_id, s.customer_id, s.amount, s.status);

		

Identity Federation

Managing identities across clouds is a headache unless you set up proper federation. Here’s how to enable SSO between AWS and OCI using a common identity provider.

Using Azure AD as Common IdP (Yes, a Third Cloud)

This is actually quite common. Many enterprises use Azure AD for identity even if their workloads run elsewhere.

Configure OCI to Trust Azure AD:

bash

			
# Create Identity Provider in OCI
oci iam identity-provider create-saml2-identity-provider \
  --compartment-id $TENANCY_ID \
  --name "AzureAD-Federation" \
  --description "Federation with Azure AD" \
  --product-type "IDCS" \
  --metadata-url "https://login.microsoftonline.com/$TENANT_ID/federationmetadata/2007-06/federationmetadata.xml"

		

Configure AWS to Trust Azure AD:

bash

			
# Create SAML provider in AWS
aws iam create-saml-provider \
  --saml-metadata-document file://azure-ad-metadata.xml \
  --name AzureAD-Federation
# Create role for federated users
aws iam create-role \
  --role-name AzureAD-Admins \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {"Federated": "arn:aws:iam::123456789:saml-provider/AzureAD-Federation"},
      "Action": "sts:AssumeRoleWithSAML",
      "Condition": {
        "StringEquals": {
          "SAML:aud": "https://signin.aws.amazon.com/saml"
        }
      }
    }]
  }'

		

Now your team can use the same Azure AD credentials to access both clouds.

Monitoring Across Clouds

You need unified observability. Here’s a practical approach using Grafana as the common dashboard:

yaml

			
# docker-compose.yml for centralized Grafana
version: '3.8'
services:
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
      - ./provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=secure_password
      - GF_INSTALL_PLUGINS=oci-metrics-datasource
volumes:
  grafana-data:

		

Configure data sources:

yaml

			
# provisioning/datasources/datasources.yaml
apiVersion: 1
datasources:
  - name: AWS-CloudWatch
    type: cloudwatch
    access: proxy
    jsonData:
      authType: keys
      defaultRegion: us-east-1
    secureJsonData:
      accessKey: ${AWS_ACCESS_KEY}
      secretKey: ${AWS_SECRET_KEY}
  
  - name: OCI-Monitoring
    type: oci-metrics-datasource
    access: proxy
    jsonData:
      tenancyOCID: ${OCI_TENANCY_OCID}
      userOCID: ${OCI_USER_OCID}
      region: me-dubai-1
    secureJsonData:
      privateKey: ${OCI_PRIVATE_KEY}

		

Create a unified dashboard that shows both clouds:

json

			
{
  "title": "Multi-Cloud Overview",
  "panels": [
    {
      "title": "AWS EKS CPU Utilization",
      "datasource": "AWS-CloudWatch",
      "targets": [{
        "namespace": "AWS/EKS",
        "metricName": "node_cpu_utilization",
        "dimensions": {"ClusterName": "production"}
      }]
    },
    {
      "title": "OCI Autonomous DB Sessions",
      "datasource": "OCI-Monitoring",
      "targets": [{
        "namespace": "oci_autonomous_database",
        "metric": "CurrentOpenSessionCount",
        "resourceGroup": "production-adb"
      }]
    },
    {
      "title": "Cross-Cloud Latency",
      "datasource": "Prometheus",
      "targets": [{
        "expr": "histogram_quantile(0.95, rate(cross_cloud_request_duration_seconds_bucket[5m]))"
      }]
    }
  ]
}

		

Cost Management

Multi-cloud cost visibility is challenging. Here’s a practical approach:

python

			
# cost_aggregator.py
import boto3
import oci
from datetime import datetime, timedelta
def get_aws_costs(start_date, end_date):
    client = boto3.client('ce')
    response = client.get_cost_and_usage(
        TimePeriod={
            'Start': start_date.strftime('%Y-%m-%d'),
            'End': end_date.strftime('%Y-%m-%d')
        },
        Granularity='DAILY',
        Metrics=['UnblendedCost'],
        GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
    )
    return response['ResultsByTime']
def get_oci_costs(start_date, end_date):
    config = oci.config.from_file()
    usage_api = oci.usage_api.UsageapiClient(config)
    
    response = usage_api.request_summarized_usages(
        request_summarized_usages_details=oci.usage_api.models.RequestSummarizedUsagesDetails(
            tenant_id=config['tenancy'],
            time_usage_started=start_date,
            time_usage_ended=end_date,
            granularity="DAILY",
            group_by=["service"]
        )
    )
    return response.data.items
def generate_report():
    end_date = datetime.now()
    start_date = end_date - timedelta(days=30)
    
    aws_costs = get_aws_costs(start_date, end_date)
    oci_costs = get_oci_costs(start_date, end_date)
    
    total_aws = sum(float(day['Total']['UnblendedCost']['Amount']) for day in aws_costs)
    total_oci = sum(item.computed_amount for item in oci_costs)
    
    print(f"30-Day Multi-Cloud Cost Summary")
    print(f"{'='*40}")
    print(f"AWS Total: ${total_aws:,.2f}")
    print(f"OCI Total: ${total_oci:,.2f}")
    print(f"Combined Total: ${total_aws + total_oci:,.2f}")

		

Lessons Learned

After running multi-cloud architectures for several years, here’s what I’ve learned:

Network is everything. Invest in proper connectivity upfront. The $500/month you save on VPN versus dedicated connectivity will cost you thousands in debugging performance issues.

Pick one cloud for each workload type. Don’t run the same thing in both clouds. Use OCI for Oracle databases, AWS for its unique services. Avoid the temptation to replicate everything everywhere.

Standardize your tooling. Terraform works on both clouds. Use it. Same for monitoring, logging, and CI/CD. The more consistent your tooling, the less your team has to context-switch.

Document your data flows. Know exactly what data goes where and why. This will save you during security audits and incident response.

Test cross-cloud failures. What happens when the VPN goes down? Can your application degrade gracefully? Find out before your customers do.

Conclusion

Multi-cloud between OCI and AWS isn’t simple, but it’s absolutely achievable. The key is having clear reasons for using each cloud, solid networking fundamentals, and consistent operational practices.

Start small. Connect one application to one database across clouds. Get that working reliably before expanding. Build your team’s confidence and expertise incrementally.

The organizations that succeed with multi-cloud are the ones that treat it as an architectural choice, not a checkbox. They know exactly why they need both clouds and have designed their systems accordingly.

Regards,
Osama

Building a Real-Time Data Enrichment & Inference Pipeline on AWS Using Kinesis, Lambda, DynamoDB, and SageMaker

Posted on November 25, 2025 by Osama Mustafa in AWS, Cloud

Modern cloud applications increasingly depend on real-time processing, especially when dealing with fraud detection, personalization, IoT telemetry, or operational monitoring.
In this post, we’ll build a fully functional AWS pipeline that:

Streams events using Amazon Kinesis
Enriches and transforms them via AWS Lambda
Stores real-time feature data in Amazon DynamoDB
Performs machine-learning inference using a SageMaker Endpoint

1. Architecture Overview

2. Step-By-Step Pipeline Build

2.1. Create a Kinesis Data Stream

aws kinesis create-stream \
  --stream-name RealtimeEvents \
  --shard-count 2 \
  --region us-east-1

This stream will accept incoming events from your apps, IoT devices, or microservices.

2.2. DynamoDB Table for Real-Time Features

aws dynamodb create-table \
  --table-name UserFeatureStore \
  --attribute-definitions AttributeName=userId,AttributeType=S \
  --key-schema AttributeName=userId,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region us-east-1

This table holds live user features, updated every time an event arrives.

2.3. Lambda Function (Real-Time Data Enrichment)

This Lambda:

Reads events from Kinesis
Computes simple features (e.g., last event time, rolling count)
Saves enriched data to DynamoDB

import json
import boto3
from datetime import datetime, timedelta

ddb = boto3.resource("dynamodb")
table = ddb.Table("UserFeatureStore")

def lambda_handler(event, context):

    for record in event["Records"]:
        payload = json.loads(record["kinesis"]["data"])

        user = payload["userId"]
        metric = payload["metric"]
        ts = datetime.fromisoformat(payload["timestamp"])

        # Fetch old features
        old = table.get_item(Key={"userId": user}).get("Item", {})

        last_ts = old.get("lastTimestamp")
        count = old.get("count", 0)

        # Update rolling 5-minute count
        if last_ts:
            prev_ts = datetime.fromisoformat(last_ts)
            if ts - prev_ts < timedelta(minutes=5):
                count += 1
            else:
                count = 1
        else:
            count = 1

        # Save new enriched features
        table.put_item(Item={
            "userId": user,
            "lastTimestamp": ts.isoformat(),
            "count": count,
            "lastMetric": metric
        })

    return {"status": "ok"}

Attach the Lambda to the Kinesis stream.

2.4. Creating a SageMaker Endpoint for Inference

Train your model offline, then deploy it:

aws sagemaker create-endpoint-config \
  --endpoint-config-name RealtimeInferenceConfig \
  --production-variants VariantName=AllInOne,ModelName=MyInferenceModel,InitialInstanceCount=1,InstanceType=ml.m5.large

aws sagemaker create-endpoint \
  --endpoint-name RealtimeInference \
  --endpoint-config-name RealtimeInferenceConfig

2.5. API Layer Performing Live Inference

Your application now requests predictions like this:

import boto3
import json

runtime = boto3.client("sagemaker-runtime")
ddb = boto3.resource("dynamodb").Table("UserFeatureStore")

def predict(user_id, extra_input):

    user_features = ddb.get_item(Key={"userId": user_id}).get("Item")

    payload = {
        "userId": user_id,
        "features": user_features,
        "input": extra_input
    }

    response = runtime.invoke_endpoint(
        EndpointName="RealtimeInference",
        ContentType="application/json",
        Body=json.dumps(payload)
    )

    return json.loads(response["Body"].read())

This combines live enriched features + model inference for maximum accuracy.

3. Production Considerations

Performance

Enable Lambda concurrency
Use DynamoDB DAX caching
Use Kinesis Enhanced Fan-Out for high throughput

Security

Use IAM roles with least privilege
Encrypt Kinesis, Lambda, DynamoDB, and SageMaker with KMS

Monitoring

CloudWatch Metrics
CloudWatch Logs Insights queries
DynamoDB capacity alarms
SageMaker Model error monitoring

Cost Optimization

Use PAY_PER_REQUEST DynamoDB
Use Lambda Power Tuning
Scale SageMaker endpoints with autoscaling

Implementing a Real-Time Anomaly Detection Pipeline on OCI Using Streaming Data, Oracle Autonomous Database & ML

Posted on November 22, 2025November 22, 2025 by Osama Mustafa in Cloud, OCI

Detecting unusual patterns in real time is critical to preventing outages, catching fraud, ensuring SLA compliance, and maintaining high-quality user experiences.
In this post, we build a real working pipeline on OCI that:

Ingests streaming data
Computes features in near-real time
Stores results in Autonomous Database
Runs anomaly detection logic
Sends alerts and exposes dashboards

This guide contains every technical step, including:
Streaming → Function → Autonomous DB → Anomaly Logic → Notifications → Dashboards

1. Architecture Overview

Components Used

OCI Streaming
OCI Functions
Oracle Autonomous Database
DBMS_SCHEDULER for anomaly detection job
OCI Notifications
Oracle Analytics Cloud / Grafana

2. Step-by-Step Implementation

2.1 Create OCI Streaming Stream

oci streaming stream create \
  --compartment-id $COMPARTMENT_OCID \
  --display-name "anomaly-events-stream" \
  --partitions 3

2.2 Autonomous Database Table

CREATE TABLE raw_events (
  event_id       VARCHAR2(50),
  event_time     TIMESTAMP,
  metric_value   NUMBER,
  feature1       NUMBER,
  feature2       NUMBER,
  processed_flag CHAR(1) DEFAULT 'N',
  anomaly_flag   CHAR(1) DEFAULT 'N',
  CONSTRAINT pk_raw_events PRIMARY KEY(event_id)
);

2.3 OCI Function – Feature Extraction

func.py:

import oci
import cx_Oracle
import json
from datetime import datetime

def handler(ctx, data: bytes=None):
    event = json.loads(data.decode('utf-8'))

    evt_id = event['id']
    evt_time = datetime.fromisoformat(event['time'])
    value = event['metric']

    # DB Connection
    conn = cx_Oracle.connect(user='USER', password='PWD', dsn='dsn')
    cur = conn.cursor()

    # Fetch previous value if exists
    cur.execute("SELECT metric_value FROM raw_events WHERE event_id=:1", (evt_id,))
    prev = cur.fetchone()
    prev_val = prev[0] if prev else 1.0

    # Compute features
    feature1 = value - prev_val
    feature2 = value / prev_val

    # Insert new event
    cur.execute("""
        INSERT INTO raw_events(event_id, event_time, metric_value, feature1, feature2)
        VALUES(:1, :2, :3, :4, :5)
    """, (evt_id, evt_time, value, feature1, feature2))

    conn.commit()
    cur.close()
    conn.close()

    return "ok"

Deploy the function and attach the streaming trigger.

2.4 Anomaly Detection Job (DBMS_SCHEDULER)

BEGIN
  FOR rec IN (
    SELECT event_id, feature1
    FROM raw_events
    WHERE processed_flag = 'N'
  ) LOOP
    DECLARE
      meanv NUMBER;
      stdv  NUMBER;
      zscore NUMBER;
    BEGIN
      SELECT AVG(feature1), STDDEV(feature1) INTO meanv, stdv FROM raw_events;

      zscore := (rec.feature1 - meanv) / NULLIF(stdv, 0);

      IF ABS(zscore) > 3 THEN
        UPDATE raw_events SET anomaly_flag='Y' WHERE event_id=rec.event_id;
      END IF;

      UPDATE raw_events SET processed_flag='Y' WHERE event_id=rec.event_id;
    END;
  END LOOP;
END;

Schedule this to run every 2 minutes:

BEGIN
  DBMS_SCHEDULER.CREATE_JOB (
    job_name        => 'ANOMALY_JOB',
    job_type        => 'PLSQL_BLOCK',
    job_action      => 'BEGIN anomaly_detection_proc; END;',
    repeat_interval => 'FREQ=MINUTELY;INTERVAL=2;',
    enabled         => TRUE
  );
END;

2.5 Notifications

oci ons topic create \
  --compartment-id $COMPARTMENT_OCID \
  --name "AnomalyAlerts"

In the DB, add a trigger:

CREATE OR REPLACE TRIGGER notify_anomaly
AFTER UPDATE ON raw_events
FOR EACH ROW
WHEN (NEW.anomaly_flag='Y' AND OLD.anomaly_flag='N')
BEGIN
  DBMS_OUTPUT.PUT_LINE('Anomaly detected for event ' || :NEW.event_id);
END;
/

2.6 Dashboarding

You may use:

Oracle Analytics Cloud (OAC)
Grafana + ADW Integration
Any BI tool with SQL

Example Query:

SELECT event_time, metric_value, anomaly_flag 
FROM raw_events
ORDER BY event_time;

2. Terraform + OCI CLI Script Bundle

Terraform – Streaming + Function + Policies

resource "oci_streaming_stream" "anomaly" {
  name           = "anomaly-events-stream"
  partitions     = 3
  compartment_id = var.compartment_id
}

resource "oci_functions_application" "anomaly_app" {
  compartment_id = var.compartment_id
  display_name   = "anomaly-function-app"
  subnet_ids     = var.subnets
}

Terraform Notification Topic

resource "oci_ons_notification_topic" "anomaly" {
  compartment_id = var.compartment_id
  name           = "AnomalyAlerts"
}

CLI Insert Test Events

oci streaming stream message put \
  --stream-id $STREAM_OCID \
  --messages '[{"key":"1","value":"{\"id\":\"1\",\"time\":\"2025-01-01T10:00:00\",\"metric\":58}"}]'

Deploying Real-Time Feature Store on Amazon SageMaker Feature Store with Amazon Kinesis Data Streams & Amazon DynamoDB for Low-Latency ML Inference

Posted on November 10, 2025 by Osama Mustafa in AWS, Cloud

Modern ML inference often depends on up-to-date features (customer behaviour, session counts, recent events) that need to be available in low-latency operations. In this article you’ll learn how to build a real-time feature store on AWS using:

Amazon Kinesis Data Streams for streaming events
AWS Lambda for processing and feature computation
Amazon DynamoDB (or SageMaker Feature Store) for storage of feature vectors
Amazon SageMaker Endpoint for low-latency inference
You’ll see end-to-end code snippets and architecture guidance so you can implement this in your environment.

1. Architecture Overview

The pipeline works like this:

Front-end/app produces events (e.g., user click, transaction) → published to Kinesis.
A Lambda function consumes from Kinesis, computes derived features (for example: rolling window counts, recency, session features).
The Lambda writes/updates these features into a DynamoDB table (or directly into SageMaker Feature Store).
When a request arrives for inference, the application fetches the current feature set from DynamoDB (or Feature Store) and calls a SageMaker endpoint.
Optionally, after inference you can stream feedback events for model refinement.

This architecture provides real-time feature freshness and low-latencyinference.

2. Setup & Implementation

2.1 Create the Kinesis data stream

aws kinesis create-stream \
  --stream-name UserEventsStream \
  --shard-count 2 \
  --region us-east-1

2.2 Create DynamoDB table for features

aws dynamodb create-table \
  --table-name RealTimeFeatures \
  --attribute-definitions AttributeName=userId,AttributeType=S \
  --key-schema AttributeName=userId,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST \
  --region us-east-1

2.3 Lambda function to compute features

Here is a Python snippet (using boto3) which will be triggered by Kinesis:

import json
import boto3
from datetime import datetime, timedelta

dynamo = boto3.resource('dynamodb', region_name='us-east-1')
table = dynamo.Table('RealTimeFeatures')

def lambda_handler(event, context):
    for record in event['Records']:
        payload = json.loads(record['kinesis']['data'])
        user_id = payload['userId']
        event_type = payload['eventType']
        ts = datetime.fromisoformat(payload['timestamp'])

        # Fetch current features
        resp = table.get_item(Key={'userId': user_id})
        item = resp.get('Item', {})
        
        # Derive features: e.g., event_count_last_5min, last_event_type
        last_update = item.get('lastUpdate', ts.isoformat())
        count_5min = item.get('count5min', 0)
        then = datetime.fromisoformat(last_update)
        if ts - then < timedelta(minutes=5):
            count_5min += 1
        else:
            count_5min = 1
        
        # Update feature item
        new_item = {
            'userId': user_id,
            'lastEventType': event_type,
            'count5min': count_5min,
            'lastUpdate': ts.isoformat()
        }
        table.put_item(Item=new_item)
    return {'statusCode': 200}

2.4 Deploy and connect Lambda to Kinesis

Create Lambda function in AWS console or via CLI.
Add Kinesis stream UserEventsStream as event source with batch size and start position = TRIM_HORIZON.
Assign IAM role allowing kinesis:DescribeStream, kinesis:GetRecords, dynamodb:PutItem, etc.

2.5 Prepare SageMaker endpoint for inference

Train model offline (outside scope here) with features stored in training dataset matching real-time features.
Deploy model as endpoint, e.g., arn:aws:sagemaker:us-east-1:123456789012:endpoint/RealtimeModel.
In your application code call endpoint by fetching features from DynamoDB then invoking endpoint:

import boto3
sagemaker = boto3.client('sagemaker-runtime', region_name='us-east-1')
dynamo = boto3.resource('dynamodb', region_name='us-east-1')
table = dynamo.Table('RealTimeFeatures')

def get_prediction(user_id, input_payload):
    resp = table.get_item(Key={'userId': user_id})
    features = resp.get('Item')
    payload = {
        'features': features,
        'input': input_payload
    }
    response = sagemaker.invoke_endpoint(
        EndpointName='RealtimeModel',
        ContentType='application/json',
        Body=json.dumps(payload)
    )
    result = json.loads(response['Body'].read().decode())
    return result

Conclusion

In this blog post you learned how to build a real-time feature store on AWS: streaming event ingestion with Kinesis, real-time feature computation with Lambda, storage in DynamoDB, and serving via SageMaker. You got specific code examples and operational considerations for production readiness. With this setup, you’re well-positioned to deliver low-latency, ML-powered applications.

Enjoy the cloud
Osama

Automating Cost-Governance Workflows in Oracle Cloud Infrastructure (OCI) with APIs & Infrastructure as Code

Posted on October 24, 2025 by Osama Mustafa in Cloud, OCI

Introduction

Cloud cost management isn’t just about checking invoices once a month — it’s about embedding automation, governance, and insights into your infrastructure so that your engineering teams make cost-aware decisions in real time. With OCI, you have native tools (Cost Analysis, Usage APIs, Budgets, etc.) and infrastructure-as-code (IaC) tooling that can help turn cost governance from an after-thought into a proactive part of your DevOps workflow.

In this article you’ll learn how to:

Extract usage and cost data via the OCI Usage API / Cost Reports.
Define IaC workflows (e.g., with Terraform) that enforce budget/usage guardrails.
Build a simple example where you automatically tag resources, monitor spend by tag, and alert/correct when thresholds are exceeded.
Discuss best practices, pitfalls, and governance recommendations for embedding FinOps into OCI operations.

1. Understanding OCI Cost & Usage Data

What data is available?

OCI provides several cost/usage-data mechanisms:

The Cost Analysis tool in the console allows you to view trends by service, compartment, tag, etc. Oracle Docs+1
The Usage/Cost Reports (CSV format) which you can download or programmatically access via the Usage API. Oracle Docs+1
The Usage API (CLI/SDK) to query usage-and-cost programmatically. Oracle Docs+1

Why this matters

By surfacing cost data at a resource, compartment, or tag level, teams can answer questions like:

“Which tag values are consuming cost disproportionately?”
“Which compartments have heavy spend growth month-over-month?”
“Which services (Compute, Storage, Database, etc.) are the highest spenders and require optimization?”

Example: Downloading a cost report via CLI

Here’s a Python/CLI snippet that shows how to download a cost-report CSV from your tenancy:

oci os object get \
  --namespace-name bling \
  --bucket-name <your-tenancy-OCID> \
  --name reports/usage-csv/<report_name>.csv.gz \
  --file local_report.csv.gz

import oci
config = oci.config.from_file("~/.oci/config", "DEFAULT")
os_client = oci.object_storage.ObjectStorageClient(config)
namespace = "bling"
bucket = "<your-tenancy-OCID>"
object_name = "reports/usage-csv/2025-10-19-report-00001.csv.gz"

resp = os_client.get_object(namespace, bucket, object_name)
with open("report-2025-10-19.csv.gz", "wb") as f:
    for chunk in resp.data.raw.stream(1024*1024, decode_content=False):
        f.write(chunk)

2. Defining Cost-Governance Workflows with IaC

Once you have data flowing in, you can enforce guardrails and automate actions. Here’s one example pattern.

a) Enforce tagging rules

Ensure that every resource created in a compartment has a cost_center tag (for example). You can do this via policy + IaC.

# Example Terraform policy for tagging requirement
resource "oci_identity_tag_namespace" "governance" {
  compartment_id = var.compartment_id
  display_name   = "governance_tags"
  is_retired     = false
}

resource "oci_identity_tag_definition" "cost_center" {
  compartment_id = var.compartment_id
  tag_namespace_id = oci_identity_tag_namespace.governance.id
  name            = "cost_center"
  description     = "Cost Center code for FinOps tracking"
  is_retired      = false
}

You can then add an IAM policy that prevents creation of resources if the tag isn’t applied (or fails to meet allowed values). For example:

Allow group ComputeAdmins to manage instance-family in compartment Prod
  where request.operation = “CreateInstance”
  and request.resource.tag.cost_center is not null

b) Monitor vs budget

Use the Usage API or Cost Reports to pull monthly spend per tag, then compare against defined budgets. If thresholds are exceeded, trigger an alert or remediation.

Here’s an example Python pseudo-code:

from datetime import datetime, timedelta
import oci

config = oci.config.from_file()
usage_client = oci.usage_api.UsageapiClient(config)

today = datetime.utcnow()
start = today.replace(day=1)
end = today

req = oci.usage_api.models.RequestSummarizedUsagesDetails(
    tenant_id = config["tenancy"],
    time_usage_started = start,
    time_usage_ended   = end,
    granularity        = "DAILY",
    group_by           = ["tag.cost_center"]
)

resp = usage_client.request_summarized_usages(req)
for item in resp.data.items:
    tag_value = item.tag_map.get("cost_center", "untagged")
    cost     = float(item.computed_amount or 0)
    print(f"Cost for cost_center={tag_value}: {cost}")

    if cost > budget_for(tag_value):
        send_alert(tag_value, cost)
        take_remediation(tag_value)

c) Automated remediation

Remediation could mean:

Auto-shut down non-production instances in compartments after hours.
Resize or terminate idle resources.
Notify owners of over-spend via email/Slack.

Terraform, OCI Functions and Event-Service can help orchestrate that. For example, set up an Event when “cost by compartment exceeds X” → invoke Function → tag resources with “cost_alerted” → optional shutdown.

3. Putting It All Together

Here is a step-by-step scenario:

Define budget categories – e.g., cost_center codes: CC-101, CC-202, CC-303.
Tag resources on creation – via policy/IaC ensure all resources include cost_center tag with one of those codes.
Collect cost data – using Usage API daily, group by tag.cost_center.
Evaluate current spend vs budget – for each code, compare cumulative cost for current month against budget.
If over budget – then:
- send an alert to the team (via SNS, email, Slack)
- optionally trigger remediation: e.g., stop non-critical compute in that cost center’s compartments.
Dashboard & visibility – load cost data into a BI tool (could be OCI Analytics Cloud or Oracle Analytics) with trends, forecasts, anomaly detection. Use the “Show cost” in OCI Ops Insights to view usage & forecast cost. Oracle Docs
Continuous improvement – right-size instances, pause dev/test at night, switch to cheaper shapes or reserved/commit models (depending on your discount model). See OCI best practice guide for optimizing cost. Oracle Docs

Example snippet – alerting logic in CLI

# example command to get summarized usage for last 7 days
oci usage-api request-summarized-usages \
  --tenant-id $TENANCY_OCID \
  --time-usage-started $(date -u -d '-7 days' +%Y-%m-%dT00:00:00Z) \
  --time-usage-ended   $(date -u +%Y-%m-%dT00:00:00Z) \
  --granularity DAILY \
  --group-by "tag.cost_center" \
  --query "data.items[?tagMap.cost_center=='CC-101'].computedAmount" \
  --raw-output

Enjoy the OCI
Osama

Building a Real-Time Recommendation Engine on Oracle Cloud Infrastructure (OCI) Using Generative AI & Streaming

Posted on October 22, 2025 by Osama Mustafa in Cloud, OCI

Introduction

In many modern applications — e-commerce, media platforms, SaaS services — providing real-time personalized recommendations is a key differentiator. With OCI’s streaming, AI/ML and serverless capabilities you can build a recommendation engine that:

Ingests user events (clicks, views, purchases) in real time
Applies a generative-AI model (or fine-tuned model) to generate suggestions
Stores, serves, and updates recommendations frequently
Enables feedback loop to refine model based on real usage

In this article you’ll learn how to:

Set up a streaming pipeline using OCI Streaming Service to ingest user events.
Use OCI Data Science or OCI AI Services + a generative model (e.g., GPT-style) to produce recommendation outputs.
Build a serving layer to deliver recommendations (via OCI Functions + API Gateway).
Create the feedback loop — capturing user interactions, updating model or embeddings, automating retraining.
Walk through code snippets, architectural decisions, best practices and pitfalls.

1. Architecture Overview

Here’s a high-level architecture for our recommendation engine:

Event Ingestion: User activities → publish to OCI Streaming (Kafka-compatible)
Processing Layer: A consumer application (OCI Functions or Data Flow) reads events, preprocesses, enriches with user/profile/context data (from Autonomous DB or NoSQL).
Model Layer: A generative model (e.g., fine-tuned GPT or embedding-based recommender) inside OCI Data Science. It takes context + user history → produces N recommendations.
Serving Layer: OCI API Gateway + OCI Functions deliver recommendations to front-end or mobile apps.
Feedback Loop: User clicks or ignores recommendations → events fed back into streaming topic → periodic retraining/refinement of model or embedding space.
Storage / Feature Store: Use Autonomous NoSQL DB or Autonomous Database for storing user profiles, item embeddings, transaction history.

2. Setting Up Streaming Ingestion

Create an OCI Streaming topic

oci streaming stream create \
  --compartment-id $COMPARTMENT_OCID \
  --display-name "user-event-stream" \
  --partitions 4

Produce events (example with Python)

import oci
from oci.streaming import StreamClient
from oci.streaming.models import PutMessagesDetails, Message

config = oci.config.from_file()
stream_client = StreamClient(config)
stream_id = "<your_stream_OCID>"

def send_event(user_id, item_id, event_type, timestamp):
    msg = Message(value=f"{user_id},{item_id},{event_type},{timestamp}")
    resp = stream_client.put_messages(
        put_messages_details=PutMessagesDetails(
            stream_id=stream_id,
            messages=[msg]
        )
    )
    return resp

# Example
send_event("U123", "I456", "view", "2025-10-19T10:15:00Z")

3. Model Layer: Generative/Embedding-Based Recommendations

Option A: Embedding + similarity lookup

We pre-compute embeddings for users and items (e.g., using a transformer or collaborative model) and store them in a vector database (or NoSQL). When a new event arrives, we update the user embedding (incrementally) and compute top-K similar items.

Option B: Fine-tuned generative model

We fine-tune a GPT-style model on historical user → recommendation sequences so that given “User U123 last 5 items: I234, I456, I890… context: browsing category Sports” we get suggestions like “I333, I777, I222”.

Example snippet using OCI Data Science and Python

import oci
# assume model endpoint is deployed
from some_sdk import RecommendationModelClient  

config = oci.config.from_file()
model_client = RecommendationModelClient(config)
endpoint = "<model_endpoint_url>"

def get_recommendations(user_id, recent_items, context, top_k=5):
    prompt = f"""User: {user_id}
RecentItems: {','.join(recent_items)}
Context: {context}
Provide {top_k} item IDs with reasons:"""
    response = model_client.predict(endpoint, prompt)
    recommended = response['recommendations']
    return recommended

# example
recs = get_recommendations("U123", ["I234","I456","I890"], "Looking for running shoes", 5)
print(recs)

Model deployment

Train/fine-tune in OCI Data Science environment
Deploy as a real-time endpoint (OCI Data Science Model Deployment)
Or optionally use OCI Functions for low-latency, light-weight inference

4. Serving Layer & Feedback Loop

Serving via API Gateway + Functions

Create an OCI Function getRecommendations that takes user_id & context and returns recommendations by calling the model endpoint or embedding lookup
Expose via OCI API Gateway for external apps

Feedback capture

After the user sees recommendations and either clicks, ignores or purchases, capture that as event rec_click, rec_ignore, purchase and publish it back to the streaming topic
Use this feedback to:
- Incrementally update user embedding
- Record reinforcement signal for later batch retraining

Scheduled retraining / embedding update

Use OCI Data Science scheduled jobs or Data Flow to run nightly or weekly batch jobs: aggregate events, update embeddings, fine-tune model
Example pseudo-code:

from datetime import datetime, timedelta
import pandas as pd
# fetch events last 7 days
events = load_events(start=datetime.utcnow()-timedelta(days=7))
# update embeddings, retrain model

Conclusion

Building a real-time recommendation engine on OCI, combining streaming ingestion, generative AI or embedding-based models, and serverless serving, enables you to deliver personalized experiences at scale. By capturing user behaviour in real time, serving timely recommendations, and closing the feedback loop, you shift from static “top N” lists to dynamic, context-aware suggestions. With careful architecture, you can deliver high performance, relevance, and scalability.

Power of the OCI AI
Enjoy
Osama