Author: Osama Mustafa

Osama considered as one of the leaders in Cloud technology, DevOps and database in the Middle-East. I have more than ten years of experience within the industry. moreover, certfied 4x AWS , 4x Azure and 6x OCI, have also obtained database certifications for multiple providers. In addition to having experience with Oracle database and Oracle products, such as middle-ware, OID, OAM and OIM, I have gained substantial knowledge with different databases. Currently, I am architecting and implementing Cloud and DevOps. On top of that, I'm providing solutions for companies that allow them to implement the solutions and to follow the best practices.

OCI GoldenGate: Building a Real-Time CDC Pipeline from On-Premises Oracle Database to Autonomous Database

Posted on July 10, 2026 by Osama Mustafa in Uncategorized

Database migration projects fail in predictable ways. The team runs an export, copies the data, imports it, and then discovers the source database kept taking writes during the transfer. The delta needs to be applied. The delta takes longer than expected. The maintenance window closes. The project slips. The next attempt includes a longer maintenance window, which the business rejects.

Change Data Capture solves this at the architecture level. Instead of taking a point-in-time copy, you stream every committed transaction from the source database to the target in real time. The target stays current continuously. When you are ready to cut over, the gap between source and target is seconds, not hours. You flip the connection string, drain in-flight requests, and the migration is done.

OCI GoldenGate is Oracle’s managed CDC and replication service. It runs the GoldenGate engine inside OCI as a fully managed deployment, handles the Extract process on the source side, moves transactions across the network, and applies them to the target using a Replicat process. In this post I will walk through configuring a full CDC pipeline from an on-premises Oracle Database 19c to OCI Autonomous Database using Terraform for the infrastructure layer and the GoldenGate REST API for pipeline configuration.

Architecture

The pipeline has three components running in sequence.

The Extract process connects to the source Oracle Database, reads the redo logs, and captures committed DML and DDL operations. It writes these to trail files inside the GoldenGate deployment.

The Distribution Path moves trail files from the source-side Extract to the OCI GoldenGate deployment over an encrypted connection. In a hybrid setup where the source database is on-premises, this path crosses the network via FastConnect or IPSec VPN.

The Replicat process reads trail files inside OCI GoldenGate and applies the transactions to Autonomous Database using the native OCI Autonomous Database connection.

Prerequisites

Before starting you need the following in place.

On the source Oracle Database: supplemental logging must be enabled at the database level and at the table level for all tables being replicated. The GoldenGate Extract user needs SELECT on the tables, access to V $LOG and V$ LOGandVLOGFILE, and EXECUTE on DBMS_FLASHBACK.

On the network side: a FastConnect private peering or IPSec VPN between your on-premises network and your OCI VCN. The GoldenGate deployment sits in a private subnet and reaches the source database over this private connection.

On OCI: an Autonomous Database instance already provisioned, with the wallet downloaded and the admin password stored in OCI Vault.

Step 1: Source Database Preparation

Connect to the source Oracle Database as SYSDBA and run:

			
-- Enable supplemental logging at the database level
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA;
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (PRIMARY KEY) COLUMNS;
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (UNIQUE) COLUMNS;
-- Verify supplemental logging is active
SELECT SUPPLEMENTAL_LOG_DATA_MIN,
       SUPPLEMENTAL_LOG_DATA_PK,
       SUPPLEMENTAL_LOG_DATA_UI
FROM V$DATABASE;
-- Create the GoldenGate capture user
CREATE USER ggadmin IDENTIFIED BY "YourStrongPassword123!";
GRANT CREATE SESSION TO ggadmin;
GRANT SELECT ANY DICTIONARY TO ggadmin;
GRANT SELECT ANY TABLE TO ggadmin;
GRANT FLASHBACK ANY TABLE TO ggadmin;
GRANT EXECUTE ON DBMS_FLASHBACK TO ggadmin;
GRANT SELECT ON SYS.V_$DATABASE TO ggadmin;
GRANT SELECT ON SYS.V_$LOG TO ggadmin;
GRANT SELECT ON SYS.V_$LOGFILE TO ggadmin;
GRANT SELECT ON SYS.V_$ARCHIVED_LOG TO ggadmin;
GRANT SELECT ON SYS.V_$LOG_HISTORY TO ggadmin;
GRANT SELECT ON SYS.V_$TRANSACTION TO ggadmin;
-- Enable supplemental logging on specific tables
-- Run this for each table you want to replicate
ALTER TABLE hr.employees ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;
ALTER TABLE hr.departments ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;
ALTER TABLE orders.order_header ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;
ALTER TABLE orders.order_lines ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;

		

Let’s write another blog that should be technical as well, no repeated content from OsamaOracle and No AI Indicator such as “—”

Searched the web

Good. I have a clear picture of what is covered. Topics not touched on the site: OCI Queue Service, OCI Cache (Redis), OCI GoldenGate, OCI DevOps pipelines, OCI Compute autoscaling with custom metrics, and OCI Network Firewall. I will go with OCI Network Firewall as the topic — it is a dedicated next-generation firewall service, fully distinct from NSGs, Security Lists, or WAF, and nothing on the site covers it.

OCI Network Firewall: Building a Centralized Inspection Architecture with Terraform

Security Lists and Network Security Groups handle stateful packet filtering at the subnet and VNIC level. They are the right tool for controlling which ports and protocols reach a resource. What they cannot do is inspect the content of traffic, detect threats based on application-layer signatures, block specific URLs or FQDNs, or apply SSL inspection to decrypt and re-encrypt traffic in flight. That requires a different layer entirely.

OCI Network Firewall is Oracle’s managed next-generation firewall service, built on Palo Alto Networks technology and integrated natively into VCN routing. It supports application-layer inspection, IDPS (Intrusion Detection and Prevention), URL filtering, FQDN-based rules, and TLS inspection. Unlike a third-party firewall appliance you would deploy on a compute instance, OCI Network Firewall is a fully managed service: Oracle handles the underlying infrastructure, HA, and scaling. You manage the policy.

In this post I will walk through designing a hub-and-spoke inspection architecture, deploying the firewall and its policy using Terraform, configuring IDPS and URL filtering rules, and validating traffic flow with OCI Flow Logs.

Architecture: Hub-and-Spoke with Centralized Inspection

The standard pattern for OCI Network Firewall in multi-VCN environments is centralized inspection through a hub VCN. All spoke VCNs route traffic through the hub, and the firewall sits in the hub inspecting both north-south (internet-bound) and east-west (spoke-to-spoke) traffic.

			
Internet Gateway
      |
  [OCI Network Firewall] (hub VCN - firewall subnet)
      |
  Dynamic Routing Gateway
    /     \
Spoke VCN 1   Spoke VCN 2
(app tier)    (data tier)

		

Traffic routing in this architecture uses a combination of DRG route tables and VCN ingress/egress route tables to steer all flows through the firewall subnet before they reach their destination. This is the most important concept to get right: the firewall only inspects traffic that is routed through it. Misconfigured route tables mean packets bypass the firewall entirely with no error or warning.

Step 1: Hub VCN and Firewall Subnet

hcl

			
resource "oci_core_vcn" "hub_vcn" {
  compartment_id = var.compartment_id
  cidr_blocks    = ["192.168.0.0/16"]
  display_name   = "hub-inspection-vcn"
  dns_label      = "hubvcn"
}
# Firewall subnet - the firewall VNIC lives here
resource "oci_core_subnet" "firewall_subnet" {
  compartment_id             = var.compartment_id
  vcn_id                     = oci_core_vcn.hub_vcn.id
  cidr_block                 = "192.168.1.0/24"
  display_name               = "firewall-subnet"
  dns_label                  = "fwsubnet"
  prohibit_public_ip_on_vnic = true
  route_table_id             = oci_core_route_table.firewall_subnet_rt.id
  security_list_ids          = [oci_core_security_list.firewall_sl.id]
}
# Internet Gateway for north-south traffic
resource "oci_core_internet_gateway" "hub_igw" {
  compartment_id = var.compartment_id
  vcn_id         = oci_core_vcn.hub_vcn.id
  display_name   = "hub-internet-gateway"
  enabled        = true
}
# DRG for spoke VCN attachment
resource "oci_core_drg" "hub_drg" {
  compartment_id = var.compartment_id
  display_name   = "hub-drg"
}
resource "oci_core_drg_attachment" "hub_vcn_attachment" {
  drg_id       = oci_core_drg.hub_drg.id
  display_name = "hub-vcn-attachment"
  network_details {
    id   = oci_core_vcn.hub_vcn.id
    type = "VCN"
  }
}

		

The firewall subnet must not have a public IP on its VNIC. The firewall receives traffic through routing, not through a public endpoint.

Step 2: Firewall Policy

The policy is the heart of the firewall. It contains address lists, URL lists, application lists, and the ordered set of security rules. All of these are defined as child resources of the policy and are applied when the policy is attached to a firewall instance.

hcl

			
resource "oci_network_firewall_network_firewall_policy" "production_policy" {
  compartment_id = var.compartment_id
  display_name   = "production-inspection-policy"
}
# IP address list for trusted internal RFC1918 ranges
resource "oci_network_firewall_network_firewall_policy_address_list" "internal_ranges" {
  name                       = "internal-rfc1918"
  network_firewall_policy_id = oci_network_firewall_network_firewall_policy.production_policy.id
  type                       = "IP"
  addresses = [
    "10.0.0.0/8",
    "172.16.0.0/12",
    "192.168.0.0/16"
  ]
}
# FQDN list for allowed outbound SaaS destinations
resource "oci_network_firewall_network_firewall_policy_address_list" "allowed_saas" {
  name                       = "allowed-saas-fqdns"
  network_firewall_policy_id = oci_network_firewall_network_firewall_policy.production_policy.id
  type                       = "FQDN"
  addresses = [
    "*.oracle.com",
    "*.oraclecloud.com",
    "*.github.com",
    "registry-1.docker.io",
    "auth.docker.io",
    "production.cloudflare.docker.com"
  ]
}
# URL list for blocked categories
resource "oci_network_firewall_network_firewall_policy_url_list" "blocked_urls" {
  name                       = "blocked-url-categories"
  network_firewall_policy_id = oci_network_firewall_network_firewall_policy.production_policy.id
  urls {
    pattern = "*.pastebin.com"
    type    = "SIMPLE"
  }
  urls {
    pattern = "*.ngrok.io"
    type    = "SIMPLE"
  }
  urls {
    pattern = "*.ngrok-free.app"
    type    = "SIMPLE"
  }
}
# Application list scoping HTTPS traffic
resource "oci_network_firewall_network_firewall_policy_application_group" "web_apps" {
  name                       = "web-traffic"
  network_firewall_policy_id = oci_network_firewall_network_firewall_policy.production_policy.id
  apps = ["HTTP", "HTTPS", "SSL"]
}

		

Step 3: Security Rules

Rules are evaluated in order. The first matching rule wins. Structure your rules from most specific to most general and always end with an explicit deny-all for traffic that does not match any allow rule.

hcl

			
# Rule 1: Allow spoke-to-spoke east-west traffic between known internal ranges
resource "oci_network_firewall_network_firewall_policy_security_rule" "allow_east_west" {
  name                       = "allow-internal-east-west"
  network_firewall_policy_id = oci_network_firewall_network_firewall_policy.production_policy.id
  action                     = "ALLOW"
  priority                   = 100
  condition {
    source_address      = [oci_network_firewall_network_firewall_policy_address_list.internal_ranges.name]
    destination_address = [oci_network_firewall_network_firewall_policy_address_list.internal_ranges.name]
  }
  inspection = "INTRUSION_DETECTION"
}
# Rule 2: Allow outbound HTTPS to approved SaaS FQDNs with IPS enabled
resource "oci_network_firewall_network_firewall_policy_security_rule" "allow_saas_egress" {
  name                       = "allow-saas-egress"
  network_firewall_policy_id = oci_network_firewall_network_firewall_policy.production_policy.id
  action                     = "ALLOW"
  priority                   = 200
  condition {
    source_address      = [oci_network_firewall_network_firewall_policy_address_list.internal_ranges.name]
    destination_address = [oci_network_firewall_network_firewall_policy_address_list.allowed_saas.name]
    application         = [oci_network_firewall_network_firewall_policy_application_group.web_apps.name]
  }
  inspection = "INTRUSION_PREVENTION"
}
# Rule 3: Block access to known bad URL patterns
resource "oci_network_firewall_network_firewall_policy_security_rule" "block_bad_urls" {
  name                       = "block-prohibited-urls"
  network_firewall_policy_id = oci_network_firewall_network_firewall_policy.production_policy.id
  action                     = "DROP"
  priority                   = 300
  condition {
    url = [oci_network_firewall_network_firewall_policy_url_list.blocked_urls.name]
  }
}
# Rule 4: Explicit deny-all as the last rule
resource "oci_network_firewall_network_firewall_policy_security_rule" "deny_all" {
  name                       = "deny-all-unmatched"
  network_firewall_policy_id = oci_network_firewall_network_firewall_policy.production_policy.id
  action                     = "DROP"
  priority                   = 65534
  condition {}
}

		

The inspection field on allow rules sets the IDPS mode. INTRUSION_DETECTION logs threats without blocking. INTRUSION_PREVENTION blocks them. Use detection first in a new environment and move to prevention after you have validated no false positives are hitting legitimate traffic.

Step 4: Deploy the Firewall Instance

hcl

			
resource "oci_network_firewall_network_firewall" "hub_firewall" {
  compartment_id             = var.compartment_id
  display_name               = "hub-production-firewall"
  subnet_id                  = oci_core_subnet.firewall_subnet.id
  network_firewall_policy_id = oci_network_firewall_network_firewall_policy.production_policy.id
  availability_domain = data.oci_identity_availability_domains.ads.availability_domains[0].name
  defined_tags = {
    "Operations.Environment" = "production"
    "Operations.ManagedBy"   = "terraform"
  }
}
# Retrieve the firewall's private IP - needed for route table configuration
data "oci_core_private_ips" "firewall_ip" {
  subnet_id  = oci_core_subnet.firewall_subnet.id
  ip_address = oci_network_firewall_network_firewall.hub_firewall.ipv4_address
  depends_on = [oci_network_firewall_network_firewall.hub_firewall]
}

		

The firewall takes 10 to 15 minutes to provision on first deployment. The Terraform apply will wait. Do not cancel it.

Step 5: Route Tables for Traffic Steering

This is where most implementations go wrong. Three separate route tables are required to steer traffic correctly through the firewall.

The first route table is for the Internet Gateway ingress: inbound traffic from the internet destined for a spoke VCN must be routed to the firewall before reaching the DRG.

hcl

			
# IGW ingress route table - applied to the Internet Gateway
resource "oci_core_route_table" "igw_ingress_rt" {
  compartment_id = var.compartment_id
  vcn_id         = oci_core_vcn.hub_vcn.id
  display_name   = "igw-ingress-route-table"
  route_rules {
    destination       = "10.0.0.0/8"
    destination_type  = "CIDR_BLOCK"
    network_entity_id = data.oci_core_private_ips.firewall_ip.private_ips[0].id
  }
  route_rules {
    destination       = "172.16.0.0/12"
    destination_type  = "CIDR_BLOCK"
    network_entity_id = data.oci_core_private_ips.firewall_ip.private_ips[0].id
  }
}
# Associate route table with the Internet Gateway
resource "oci_core_vcn_dns_resolver_association" "igw_rt_association" {
  # Note: in OCI you associate the route table with the IGW via the VCN route table update
}
resource "oci_core_route_table" "firewall_subnet_rt" {
  compartment_id = var.compartment_id
  vcn_id         = oci_core_vcn.hub_vcn.id
  display_name   = "firewall-subnet-route-table"
  # Outbound internet traffic from the firewall goes to IGW
  route_rules {
    destination       = "0.0.0.0/0"
    destination_type  = "CIDR_BLOCK"
    network_entity_id = oci_core_internet_gateway.hub_igw.id
  }
  # Return traffic to spokes goes through DRG
  route_rules {
    destination       = "10.0.0.0/8"
    destination_type  = "CIDR_BLOCK"
    network_entity_id = oci_core_drg.hub_drg.id
  }
}
# DRG route table - forces all spoke traffic toward the firewall
resource "oci_core_drg_route_table" "drg_spoke_rt" {
  drg_id                           = oci_core_drg.hub_drg.id
  display_name                     = "drg-spoke-inspection-rt"
  is_ecmp_enabled                  = false
  import_drg_route_distribution_id = oci_core_drg_route_distribution.hub_distribution.id
}
resource "oci_core_drg_route_table_route_rule" "default_to_firewall" {
  drg_route_table_id         = oci_core_drg_route_table.drg_spoke_rt.id
  destination                = "0.0.0.0/0"
  destination_type           = "CIDR_BLOCK"
  next_hop_drg_attachment_id = oci_core_drg_attachment.hub_vcn_attachment.id
}

		

The DRG route table sends all traffic from spokes into the hub VCN attachment. The hub VCN’s route tables then redirect that traffic to the firewall’s private IP before it exits to the internet or to another spoke.

Step 6: Attaching Spoke VCNs

Each spoke VCN attaches to the DRG and assigns the inspection route table to that attachment so traffic from the spoke is steered through the firewall.

hcl

			
resource "oci_core_drg_attachment" "spoke1_attachment" {
  drg_id       = oci_core_drg.hub_drg.id
  display_name = "spoke1-vcn-attachment"
  network_details {
    id   = var.spoke1_vcn_id
    type = "VCN"
  }
  drg_route_table_id = oci_core_drg_route_table.drg_spoke_rt.id
}
resource "oci_core_drg_attachment" "spoke2_attachment" {
  drg_id       = oci_core_drg.hub_drg.id
  display_name = "spoke2-vcn-attachment"
  network_details {
    id   = var.spoke2_vcn_id
    type = "VCN"
  }
  drg_route_table_id = oci_core_drg_route_table.drg_spoke_rt.id
}

		

Each spoke VCN also needs a local route rule pointing its default gateway to the DRG:

hcl

			
# This goes in each spoke VCN's subnet route table
route_rules {
  destination       = "0.0.0.0/0"
  destination_type  = "CIDR_BLOCK"
  network_entity_id = oci_core_drg.hub_drg.id
}

		

Step 7: Enable Firewall Logging

Without logs you cannot verify the firewall is working, investigate blocked traffic, or tune your rules. OCI Network Firewall supports three log types: Traffic logs, Threat logs, and Traffic Insights logs. Enable all three.

hcl

			
resource "oci_logging_log_group" "firewall_log_group" {
  compartment_id = var.compartment_id
  display_name   = "network-firewall-logs"
}
resource "oci_logging_log" "firewall_traffic_log" {
  display_name = "firewall-traffic"
  log_group_id = oci_logging_log_group.firewall_log_group.id
  log_type     = "SERVICE"
  configuration {
    source {
      category    = "traffic"
      resource    = oci_network_firewall_network_firewall.hub_firewall.id
      service     = "oci-network-firewall"
      source_type = "OCISERVICE"
    }
    compartment_id = var.compartment_id
  }
  retention_duration = 60
  is_enabled         = true
}
resource "oci_logging_log" "firewall_threat_log" {
  display_name = "firewall-threats"
  log_group_id = oci_logging_log_group.firewall_log_group.id
  log_type     = "SERVICE"
  configuration {
    source {
      category    = "threat"
      resource    = oci_network_firewall_network_firewall.hub_firewall.id
      service     = "oci-network-firewall"
      source_type = "OCISERVICE"
    }
    compartment_id = var.compartment_id
  }
  retention_duration = 60
  is_enabled         = true
}

		

Validating the Configuration

Once deployed, validate traffic routing before declaring the rollout complete.

First, verify a connection from a spoke instance to an allowed destination reaches it without being dropped:

bash

			
# From a compute instance in spoke VCN 1
curl -v https://objectstorage.me-jeddah-1.oraclecloud.com/healthcheck

Then check the firewall traffic log in OCI Logging to confirm the connection was seen and allowed:

bash

			
oci logging-search search-logs \
  --search-query 'search "ocid1.compartment.oc1..yourcompartmentocid/oci-network-firewall/firewall-traffic" | where data.action = '"'"'ALLOW'"'"'' \
  --time-start "$(date -u -d '5 minutes ago' +%Y-%m-%dT%H:%M:%SZ)" \
  --time-end "$(date -u +%Y-%m-%dT%H:%M:%SZ)"

Test the block rule by attempting to reach a prohibited URL:

bash

curl -v https://pastebin.com

This should time out. The threat log should contain an entry with action: DROP and the destination matching your blocked URL list.

To test IDPS, you can use a safe test signature. The EICAR test string sent over HTTP triggers a detection event without any real malicious activity:

bash

			
curl -v http://any-http-server/path?test=X5O!P%40AP%5B4%5C\PZX54(P%5E)7CC)7%7D%24EICAR-STANDARD-ANTIVIRUS-TEST-FILE!%24H%2BH*

If IDPS is in detection mode, the threat log will show the signature hit. If it is in prevention mode, the connection will be dropped immediately.

Policy Updates Without Downtime

One advantage of OCI Network Firewall’s policy model is that the policy is a separate resource from the firewall instance. You can build a new policy, validate it, and then update the firewall’s network_firewall_policy_id to point to the new policy. The firewall applies the new policy without a service interruption. This makes iterative rule changes safe to deploy during business hours.

hcl

			
# Create new policy version
resource "oci_network_firewall_network_firewall_policy" "production_policy_v2" {
  compartment_id = var.compartment_id
  display_name   = "production-inspection-policy-v2"
}
# After validating the new policy, update the firewall to reference it
resource "oci_network_firewall_network_firewall" "hub_firewall" {
  # ...existing config...
  network_firewall_policy_id = oci_network_firewall_network_firewall_policy.production_policy_v2.id
}

		

In a GitOps pipeline, this means policy changes go through a PR, get reviewed, apply to a staging firewall first, and then the same policy OCID is referenced in the production firewall config. No manual steps, no maintenance window required for rule changes.

Where OCI Network Firewall Fits in Your Security Stack

OCI Network Firewall is not a replacement for Security Lists, NSGs, or WAF. Each operates at a different layer and serves a different purpose.

Security Lists and NSGs handle stateful L3/L4 filtering close to the resource. They are fast, simple, and the right tool for controlling which ports reach a compute instance or database.

OCI WAF sits at the edge handling HTTP/HTTPS application-layer attacks: SQL injection, XSS, bot detection, and rate limiting against your public endpoints.

OCI Network Firewall sits between those two layers handling deep packet inspection, IDPS, URL filtering, and FQDN-based egress control across all traffic flowing through your VCN topology.

Running all three in the same environment is not redundancy for its own sake. Each catches a class of threat the others cannot. Security Lists catch misconfigured port exposure. WAF catches web application exploits at the edge. Network Firewall catches lateral movement, data exfiltration attempts, and known malware signatures in traffic that has already passed the perimeter.

Regards,
Osama

LinkedIn Hashtags:

#OCI #OracleCloud #NetworkFirewall #CloudSecurity #NGFW #IDPS #Terraform #InfrastructureAsCode #IaC #CloudNetworking #DevSecOps #OracleCloudInfrastructure #HubAndSpoke #NetworkSecurity #CloudArchitecture #PlatformEngineering #ThreatDetection #CloudNative #ZeroTrust #TechBlog

I need another blog for OCI

Searched the web

Clear picture now. OCI DevOps Service (the full CI/CD pipeline service with build pipelines, deployment pipelines, artifact registry, and OKE/Functions deployment) has never been covered as a dedicated post. That is the topic.

OCI DevOps: Building a Production CI/CD Pipeline with Terraform

Most teams running workloads on OCI manage their deployments through a mix of external tools: GitHub Actions pushing to OKE, Jenkins deploying to compute instances, manual Terraform runs triggered from a developer’s laptop. This works until it does not. The audit trail is scattered, secrets flow through CI runners that may not be in your VCN, and there is no native integration between the deployment tooling and the OCI IAM model that controls the infrastructure.

OCI DevOps is Oracle’s native CI/CD service. It covers source code mirroring, build pipelines, artifact management, and deployment pipelines to OKE, compute instances, Functions, and other targets. Everything runs inside your tenancy, authenticates through IAM Dynamic Groups and policies, and integrates natively with OCI Vault for secrets, OCI Container Registry for images, and OCI Artifact Registry for generic artifacts.

In this post I will build a complete pipeline from source code mirror through build, test, image push, and deployment to an OKE cluster, using Terraform for all infrastructure and a real application for the pipeline to deploy.

Service Architecture

OCI DevOps has five main components that work together.

The Project is the top-level container. It groups all related resources: code repositories, build pipelines, deployment pipelines, and environments.

Code Repositories mirror external Git repositories (GitHub, GitLab, Bitbucket) or host code natively inside OCI. Mirroring syncs on a schedule or on webhook trigger.

Build Pipelines execute build stages: managed build (runs your build spec on Oracle-managed runners), deliver artifact (pushes to Container Registry or Artifact Registry), and trigger deployment.

Artifact Registry stores generic versioned artifacts: Helm charts, Terraform modules, JAR files, and deployment manifests.

Deployment Pipelines run the actual deployment to a target environment. They support blue-green, canary, and rolling deployment strategies with built-in approval gates.

Step 1: IAM Setup

OCI DevOps needs a Dynamic Group that matches the build and deployment pipeline resources, and a policy that grants them the permissions to do their work.

hcl

			
resource "oci_identity_dynamic_group" "devops_build_dg" {
  compartment_id = var.tenancy_ocid
  name           = "devops-build-pipelines"
  description    = "Dynamic group for OCI DevOps build pipeline runners"
  matching_rule  = "All {resource.type = 'devopsbuildpipeline', resource.compartment.id = '${var.compartment_id}'}"
}
resource "oci_identity_dynamic_group" "devops_deploy_dg" {
  compartment_id = var.tenancy_ocid
  name           = "devops-deploy-pipelines"
  description    = "Dynamic group for OCI DevOps deployment pipelines"
  matching_rule  = "All {resource.type = 'devopsdeploypipeline', resource.compartment.id = '${var.compartment_id}'}"
}
resource "oci_identity_policy" "devops_policy" {
  compartment_id = var.compartment_id
  name           = "devops-pipeline-policy"
  description    = "Permissions for OCI DevOps build and deploy pipelines"
  statements = [
    # Build pipelines need to read secrets and push to container registry
    "Allow dynamic-group devops-build-pipelines to manage repos in compartment id ${var.compartment_id}",
    "Allow dynamic-group devops-build-pipelines to read secret-family in compartment id ${var.compartment_id}",
    "Allow dynamic-group devops-build-pipelines to manage artifacts in compartment id ${var.compartment_id}",
    "Allow dynamic-group devops-build-pipelines to manage devops-family in compartment id ${var.compartment_id}",
    # Deploy pipelines need to manage OKE workloads and read artifacts
    "Allow dynamic-group devops-deploy-pipelines to manage cluster-family in compartment id ${var.compartment_id}",
    "Allow dynamic-group devops-deploy-pipelines to use artifacts in compartment id ${var.compartment_id}",
    "Allow dynamic-group devops-deploy-pipelines to manage devops-family in compartment id ${var.compartment_id}",
    "Allow dynamic-group devops-deploy-pipelines to read secret-family in compartment id ${var.compartment_id}"
  ]
}

		

Step 2: Create the DevOps Project

hcl

			
resource "oci_devops_project" "orders_api_project" {
  compartment_id = var.compartment_id
  name           = "orders-api"
  description    = "CI/CD pipeline for the orders API service"
  notification_config {
    topic_id = oci_ons_notification_topic.devops_alerts.id
  }
  defined_tags = {
    "Operations.Environment" = "production"
    "Operations.ManagedBy"   = "terraform"
  }
}
resource "oci_ons_notification_topic" "devops_alerts" {
  compartment_id = var.compartment_id
  name           = "devops-pipeline-alerts"
  description    = "Notifications for DevOps pipeline events"
}
resource "oci_ons_subscription" "devops_email" {
  compartment_id = var.compartment_id
  topic_id       = oci_ons_notification_topic.devops_alerts.id
  protocol       = "EMAIL"
  endpoint       = var.devops_alert_email
}

		

Step 3: Mirror the GitHub Repository

OCI DevOps can mirror a GitHub repository and trigger a build pipeline on push events. The mirror keeps a copy of the source inside OCI so builds do not depend on external connectivity to GitHub at build time.

hcl

			
resource "oci_devops_repository" "orders_api_repo" {
  project_id      = oci_devops_project.orders_api_project.id
  name            = "orders-api"
  description     = "Mirror of GitHub orders-api repository"
  repository_type = "MIRRORED"
  default_branch  = "main"
  mirror_repository_config {
    repository_url    = "https://github.com/your-org/orders-api.git"
    connector_id      = oci_devops_connection.github_connection.id
    trigger_schedule {
      schedule_type = "CUSTOM"
      custom_schedule = "0 */6 * * *"
    }
  }
}
resource "oci_devops_connection" "github_connection" {
  project_id      = oci_devops_project.orders_api_project.id
  display_name    = "github-connection"
  connection_type = "GITHUB_ACCESS_TOKEN"
  description     = "Connection to GitHub using PAT stored in OCI Vault"
  access_token = oci_vault_secret.github_pat.id
}
resource "oci_vault_secret" "github_pat" {
  compartment_id = var.compartment_id
  vault_id       = var.vault_id
  key_id         = var.vault_key_id
  secret_name    = "github-pat-devops"
  secret_content {
    content_type = "BASE64"
    content      = base64encode(var.github_personal_access_token)
  }
}

		

The GitHub PAT is stored in OCI Vault, not in a Terraform variable or environment variable on a CI runner. The build pipeline retrieves it at runtime using the Dynamic Group policy.

Step 4: Build Spec

The build spec is a YAML file committed to your repository at build_spec.yaml. It defines the steps the managed build runner executes.

yaml

			
version: 0.1
component: build
timeoutInSeconds: 1800
env:
  exportedVariables:
    - BUILDRUN_HASH
steps:
  - type: Command
    name: Set build hash
    command: |
      export BUILDRUN_HASH=$(echo ${OCI_BUILD_RUN_ID} | tail -c 8)
      echo "BUILDRUN_HASH: ${BUILDRUN_HASH}"
  - type: Command
    name: Install dependencies
    command: |
      cd orders-api
      pip install -r requirements.txt --quiet
  - type: Command
    name: Run unit tests
    command: |
      cd orders-api
      python -m pytest tests/unit/ -v --tb=short --junitxml=test-results.xml
      if [ $? -ne 0 ]; then
        echo "Unit tests failed. Aborting build."
        exit 1
      fi
  - type: Command
    name: Run security scan
    command: |
      pip install bandit --quiet
      cd orders-api
      bandit -r src/ -f json -o bandit-report.json -ll
      if [ $? -eq 1 ]; then
        echo "High severity security issues found. Aborting build."
        exit 1
      fi
  - type: Command
    name: Build container image
    command: |
      cd orders-api
      IMAGE_TAG="${CONTAINER_REGISTRY}/${NAMESPACE}/orders-api:${BUILDRUN_HASH}"
      docker build -t orders-api:latest -t ${IMAGE_TAG} .
      echo "IMAGE_TAG=${IMAGE_TAG}" >> ${OCI_PRIMARY_SOURCE_DIR}/build_output.env
  - type: Command
    name: Push image to OCI Container Registry
    command: |
      docker push ${IMAGE_TAG}
outputArtifacts:
  - name: orders-api-image
    type: DOCKER_IMAGE
    location: ${IMAGE_TAG}
  - name: kubernetes-manifests
    type: BINARY
    location: ${OCI_PRIMARY_SOURCE_DIR}/orders-api/k8s/

		

The security scan step uses Bandit to flag high-severity Python security issues and fails the build if any are found. This happens before the image is built, not after.

Step 5: Build Pipeline

hcl

			
resource "oci_devops_build_pipeline" "orders_api_build" {
  project_id   = oci_devops_project.orders_api_project.id
  display_name = "orders-api-build"
  description  = "Build, test, scan, and push the orders API container image"
  build_pipeline_parameters {
    items {
      name          = "CONTAINER_REGISTRY"
      default_value = "${var.oci_region_key}.ocir.io"
      description   = "OCI Container Registry endpoint"
    }
    items {
      name          = "NAMESPACE"
      default_value = var.tenancy_namespace
      description   = "OCI tenancy namespace for Container Registry"
    }
  }
}
# Stage 1: Managed Build
resource "oci_devops_build_pipeline_stage" "managed_build" {
  build_pipeline_id = oci_devops_build_pipeline.orders_api_build.id
  display_name      = "managed-build"
  description       = "Execute build spec on managed runner"
  build_pipeline_stage_type = "BUILD"
  build_spec_file                    = "build_spec.yaml"
  stage_execution_timeout_in_seconds = 1800
  image                              = "OL7_X86_64_STANDARD_10"
  build_source_collection {
    items {
      connection_type = "DEVOPS_CODE_REPOSITORY"
      repository_id   = oci_devops_repository.orders_api_repo.id
      name            = "orders-api"
      branch          = "main"
      repository_url  = oci_devops_repository.orders_api_repo.http_url
    }
  }
  build_pipeline_stage_predecessor_collection {
    items {
      id = oci_devops_build_pipeline.orders_api_build.id
    }
  }
}
# Stage 2: Deliver Artifact to Container Registry
resource "oci_devops_build_pipeline_stage" "deliver_artifact" {
  build_pipeline_id = oci_devops_build_pipeline.orders_api_build.id
  display_name      = "deliver-artifact"
  description       = "Push built image to OCI Container Registry"
  build_pipeline_stage_type = "DELIVER_ARTIFACT"
  deliver_artifact_collection {
    items {
      artifact_name = "orders-api-image"
      artifact_id   = oci_devops_deploy_artifact.orders_api_image.id
    }
    items {
      artifact_name = "kubernetes-manifests"
      artifact_id   = oci_devops_deploy_artifact.k8s_manifests.id
    }
  }
  build_pipeline_stage_predecessor_collection {
    items {
      id = oci_devops_build_pipeline_stage.managed_build.id
    }
  }
}
# Stage 3: Trigger Deployment Pipeline
resource "oci_devops_build_pipeline_stage" "trigger_deploy" {
  build_pipeline_id = oci_devops_build_pipeline.orders_api_build.id
  display_name      = "trigger-deployment"
  description       = "Trigger the deployment pipeline on successful build"
  build_pipeline_stage_type = "TRIGGER_DEPLOYMENT_PIPELINE"
  deploy_pipeline_id        = oci_devops_deploy_pipeline.orders_api_deploy.id
  is_pass_all_parameters_enabled = true
  build_pipeline_stage_predecessor_collection {
    items {
      id = oci_devops_build_pipeline_stage.deliver_artifact.id
    }
  }
}

		

Step 6: Artifact Registry

hcl

			
resource "oci_artifacts_repository" "k8s_manifests_repo" {
  compartment_id  = var.compartment_id
  display_name    = "orders-api-manifests"
  description     = "Kubernetes deployment manifests for orders API"
  is_immutable    = false
  repository_type = "GENERIC"
}
resource "oci_devops_deploy_artifact" "orders_api_image" {
  project_id             = oci_devops_project.orders_api_project.id
  display_name           = "orders-api-container-image"
  argument_substitution_mode = "SUBSTITUTE_PLACEHOLDERS"
  deploy_artifact_type   = "DOCKER_IMAGE"
  deploy_artifact_source {
    deploy_artifact_source_type = "OCIR"
    image_uri    = "${var.oci_region_key}.ocir.io/${var.tenancy_namespace}/orders-api:$${BUILDRUN_HASH}"
    image_digest = " "
  }
}
resource "oci_devops_deploy_artifact" "k8s_manifests" {
  project_id             = oci_devops_project.orders_api_project.id
  display_name           = "orders-api-k8s-manifests"
  argument_substitution_mode = "SUBSTITUTE_PLACEHOLDERS"
  deploy_artifact_type   = "KUBERNETES_MANIFEST"
  deploy_artifact_source {
    deploy_artifact_source_type = "GENERIC_ARTIFACT"
    repository_id  = oci_artifacts_repository.k8s_manifests_repo.id
    deploy_artifact_path    = "k8s/deployment.yaml"
    deploy_artifact_version = "$${BUILDRUN_HASH}"
  }
}

		

Step 7: Deployment Environment and Pipeline

The deployment pipeline targets the OKE cluster. Define the environment first, then the pipeline stages.

hcl

			
resource "oci_devops_deploy_environment" "oke_prod" {
  project_id              = oci_devops_project.orders_api_project.id
  display_name            = "oke-production"
  description             = "Production OKE cluster"
  deploy_environment_type = "OKE_CLUSTER"
  cluster_id              = var.oke_cluster_id
}
resource "oci_devops_deploy_pipeline" "orders_api_deploy" {
  project_id   = oci_devops_project.orders_api_project.id
  display_name = "orders-api-deploy"
  description  = "Blue-green deployment of orders API to production OKE"
  deploy_pipeline_parameters {
    items {
      name          = "NAMESPACE"
      default_value = "orders"
      description   = "Kubernetes namespace for the deployment"
    }
    items {
      name          = "IMAGE_TAG"
      default_value = "latest"
      description   = "Container image tag to deploy"
    }
  }
}
# Stage 1: Approval gate before production deployment
resource "oci_devops_deploy_stage" "approval_gate" {
  deploy_pipeline_id = oci_devops_deploy_pipeline.orders_api_deploy.id
  display_name       = "production-approval"
  description        = "Manual approval required before deploying to production"
  deploy_stage_type               = "MANUAL_APPROVAL"
  approval_policy {
    approval_policy_type         = "COUNT_BASED_APPROVAL"
    number_of_approvals_required = 1
  }
  deploy_stage_predecessor_collection {
    items {
      id = oci_devops_deploy_pipeline.orders_api_deploy.id
    }
  }
}
# Stage 2: Blue-green deploy to OKE
resource "oci_devops_deploy_stage" "oke_blue_green_deploy" {
  deploy_pipeline_id = oci_devops_deploy_pipeline.orders_api_deploy.id
  display_name       = "oke-blue-green-deploy"
  description        = "Deploy new version to green environment"
  deploy_stage_type = "OKE_BLUE_GREEN_DEPLOYMENT"
  oke_blue_green_deploy_stage_details {
    kubernetes_manifest_deploy_artifact_ids = [
      oci_devops_deploy_artifact.k8s_manifests.id
    ]
    oke_cluster_deploy_environment_id = oci_devops_deploy_environment.oke_prod.id
    blue_green_strategy {
      strategy_type     = "NGINX_INGRESS_STRATEGY"
      namespace_a       = "orders-blue"
      namespace_b       = "orders-green"
      ingress_name      = "orders-api-ingress"
    }
  }
  deploy_stage_predecessor_collection {
    items {
      id = oci_devops_deploy_stage.approval_gate.id
    }
  }
}
# Stage 3: Traffic shift after successful deployment validation
resource "oci_devops_deploy_stage" "traffic_shift" {
  deploy_pipeline_id = oci_devops_deploy_pipeline.orders_api_deploy.id
  display_name       = "shift-traffic-to-green"
  description        = "Shift 100% of traffic to the newly deployed green environment"
  deploy_stage_type = "OKE_BLUE_GREEN_TRAFFIC_SHIFT"
  oke_blue_green_traffic_shift_deploy_stage_details {
    oke_blue_green_deployment_deploy_stage_id = oci_devops_deploy_stage.oke_blue_green_deploy.id
  }
  deploy_stage_predecessor_collection {
    items {
      id = oci_devops_deploy_stage.oke_blue_green_deploy.id
    }
  }
}

		

Step 8: Trigger on Code Push

The trigger watches the mirrored repository and fires the build pipeline when a push lands on the main branch.

hcl

			
resource "oci_devops_trigger" "main_branch_push" {
  project_id     = oci_devops_project.orders_api_project.id
  display_name   = "main-branch-push-trigger"
  description    = "Trigger build pipeline on every push to main"
  trigger_source = "DEVOPS_CODE_REPOSITORY"
  repository_id  = oci_devops_repository.orders_api_repo.id
  actions {
    type        = "TRIGGER_BUILD_PIPELINE"
    build_pipeline_id = oci_devops_build_pipeline.orders_api_build.id
    filter {
      trigger_source = "DEVOPS_CODE_REPOSITORY"
      events         = ["PUSH"]
      include {
        head_ref = "main"
      }
      exclude {
        file_filter {
          file_paths = ["docs/*", "*.md", ".github/*"]
        }
      }
    }
  }
}

		

The exclude block prevents documentation-only changes from triggering a full build and deploy. Pushing a README update does not kick off the pipeline.

Step 9: Verifying the Pipeline

Once Terraform applies, validate the end-to-end flow.

Check mirror sync status:

bash

			
oci devops repository get \
  --repository-id <your-repo-ocid> \
  --query 'data.{name:name, mirror-status:"mirror-repository-config"}' \
  --output table

Manually trigger a build to test without waiting for a push:

bash

			
oci devops build-run create \
  --build-pipeline-id <your-build-pipeline-ocid> \
  --display-name "manual-validation-run" \
  --build-run-arguments '{"items": [{"name": "IMAGE_TAG", "value": "validation-test"}]}'

Watch the build run progress:

bash

			
oci devops build-run get \
  --build-run-id <build-run-ocid> \
  --query 'data.{status:"lifecycle-state", phase:"build-run-progress"."build-pipeline-stage-run-progress"}' \
  --output table

List deployment history to confirm deployments are being tracked:

bash

			
oci devops deployment list \
  --project-id <project-ocid> \
  --sort-by timeCreated \
  --sort-order DESC \
  --limit 10 \
  --query 'data.items[*].{name:"display-name", status:"lifecycle-state", time:\"time-created\"}' \
  --output table

		

Rollback

If a deployment introduces a regression, OCI DevOps blue-green makes rollback immediate. Traffic is still flowing to the old environment until the traffic shift stage completes. If you catch the issue before the shift, simply reject the traffic shift stage from the console or CLI:

bash

			
oci devops deployment approve \
  --deployment-id <deployment-ocid> \
  --deploy-stage-id <traffic-shift-stage-ocid> \
  --reason "Rolling back: latency regression detected in green environment" \
  --action REJECT

		

The green environment is torn down, the blue environment continues serving traffic, and the deployment is marked as failed with the reason recorded in the audit log.

Where This Fits in a Real Team

The value of OCI DevOps over an external CI/CD tool is not raw feature count. GitHub Actions or GitLab CI have richer marketplace ecosystems. The value is native IAM integration and residency inside your tenancy.

Build runners authenticate to OCI Vault, Container Registry, and Artifact Registry using the Dynamic Group policy with no credentials stored on a third-party platform. Every build and deployment is recorded in OCI Audit with the OCID of the pipeline that ran it. Deployment approvals are logged against the OCI user who approved or rejected them. For regulated environments where you need to prove that every production change was approved by a named human identity and executed by an automated system with least-privilege credentials, OCI DevOps gives you that audit trail natively.

For teams already running everything inside OCI, it is the most operationally coherent choice.

Regards,
Osama

LinkedIn Hashtags:

#OCI #OracleCloud #OracleDevOps #CICD #DevOps #Terraform #OKE #BlueGreen #InfrastructureAsCode #IaC #CloudNative #PlatformEngineering #ContainerRegistry #GitOps #OracleCloudInfrastructure #Kubernetes #CloudSecurity #CloudArchitecture #DeploymentAutomation #TechBlog

I need another OCI Blogs

Searched the web

Good. The site covers: OKE, Functions, Streaming, API Gateway, Vault, Cloud Guard, NSGs, Bastion, Service Gateway, Network Firewall, DRCC, DevOps pipelines, Resource Manager/GitOps, Autonomous Database, Data Flow, Data Integration, Logging Analytics, Monitoring, DR, Cost Management, AI/ML services, and Multi-Cloud. The one clear gap at the infrastructure layer is OCI Compute Autoscaling with Instance Pools and Custom Metrics — no dedicated post exists. That is the topic.

OCI Instance Pools and Autoscaling: Building a Production-Grade Compute Scaling Architecture with Terraform

Vertical scaling on OCI is straightforward: stop the instance, change the shape, start it again. It works but it does not solve the problem you face at 9am on a Monday when traffic doubles in ten minutes and you need twenty more instances, not one bigger one. That is horizontal scaling, and doing it properly on OCI requires understanding how Instance Configurations, Instance Pools, and Autoscaling Configurations work together.

Most teams get to instance pools quickly. They read the docs, create a pool with a fixed size, and think they are done. What they miss is the autoscaling layer on top, the load balancer backend set attachment that makes the pool actually serve traffic, the health check configuration that removes unhealthy instances before they receive requests, and the custom metric path that scales on application-level signals instead of just CPU.

This post covers all of it: the full Terraform implementation of a production autoscaling group behind a load balancer, health checks, scaling policies using both metric-based and schedule-based triggers, and custom metric publishing so you can scale on queue depth or request latency instead of raw CPU utilization.

How the Components Fit Together

Before writing any Terraform, the relationship between the three core resources matters.

An Instance Configuration is a template. It defines the compute shape, the OS image, the boot volume size, the VCN subnet placement, the cloud-init script, and any attached block volumes. The Instance Configuration itself does not run anything. It is a snapshot of how an instance should be created.

An Instance Pool uses that template to create and manage a group of identically configured instances. The pool maintains a target size, handles replacements when an instance becomes unhealthy, and integrates with the OCI Load Balancer to register and deregister instances automatically as they join or leave the pool.

An Autoscaling Configuration sits on top of the pool and adjusts the target size based on rules you define. It can scale out when CPU exceeds a threshold, scale in when it drops, and follow a fixed schedule for predictable load patterns.

Step 1: Instance Configuration

The cloud-init script inside the instance configuration is where you install your application, configure the OCI Monitoring agent for custom metrics, and register the instance with your configuration management system. Keep it idempotent.

hcl

			
data "oci_core_images" "ol8_image" {
  compartment_id           = var.compartment_id
  operating_system         = "Oracle Linux"
  operating_system_version = "8"
  shape                    = "VM.Standard.E4.Flex"
  sort_by                  = "TIMECREATED"
  sort_order               = "DESC"
  filter {
    name   = "display_name"
    values = ["^.*Oracle-Linux-8.*$"]
    regex  = true
  }
}
resource "oci_core_instance_configuration" "app_instance_config" {
  compartment_id = var.compartment_id
  display_name   = "orders-api-instance-config-v${var.app_version}"
  instance_details {
    instance_type = "compute"
    launch_details {
      compartment_id = var.compartment_id
      display_name   = "orders-api-node"
      shape          = "VM.Standard.E4.Flex"
      shape_config {
        ocpus         = 2
        memory_in_gbs = 16
      }
      source_details {
        source_type             = "image"
        image_id                = data.oci_core_images.ol8_image.images[0].id
        boot_volume_size_in_gbs = 50
      }
      create_vnic_details {
        subnet_id             = var.app_subnet_id
        assign_public_ip      = false
        nsg_ids               = [var.app_nsg_id]
        hostname_label_prefix = "orders-api"
      }
      metadata = {
        ssh_authorized_keys = var.ssh_public_key
        user_data           = base64encode(templatefile("${path.module}/templates/cloud-init.yaml", {
          app_version        = var.app_version
          compartment_id     = var.compartment_id
          region             = var.region
          monitoring_enabled = "true"
        }))
      }
      defined_tags = {
        "Operations.Environment" = "production"
        "Operations.Application" = "orders-api"
        "Operations.ManagedBy"   = "terraform"
      }
    }
  }
}

		

The cloud-init template at templates/cloud-init.yaml:

yaml

			
#cloud-config
runcmd:
  # Install OCI Unified Monitoring Agent for custom metrics
  - dnf install -y oracle-cloud-agent
  - systemctl enable oracle-cloud-agent
  - systemctl start oracle-cloud-agent
  # Install the application
  - mkdir -p /opt/orders-api
  - dnf install -y python3.11 python3.11-pip
  - pip3.11 install orders-api==${app_version}
  # Configure the application
  - |
    cat > /etc/orders-api/config.yaml <<EOF
    environment: production
    compartment_id: ${compartment_id}
    region: ${region}
    metrics_namespace: custom_orders_api
    EOF
  # Start the application
  - systemctl enable orders-api
  - systemctl start orders-api
write_files:
  - path: /etc/systemd/system/orders-api.service
    content: |
      [Unit]
      Description=Orders API Service
      After=network.target
      [Service]
      Type=simple
      User=app
      ExecStart=/usr/local/bin/orders-api serve
      Restart=always
      RestartSec=5
      Environment=CONFIG_FILE=/etc/orders-api/config.yaml
      [Install]
      WantedBy=multi-user.target

		

Step 2: Instance Pool

hcl

			
resource "oci_core_instance_pool" "orders_api_pool" {
  compartment_id            = var.compartment_id
  instance_configuration_id = oci_core_instance_configuration.app_instance_config.id
  display_name              = "orders-api-pool"
  size                      = 2
  placement_configurations {
    availability_domain = data.oci_identity_availability_domains.ads.availability_domains[0].name
    primary_subnet_id   = var.app_subnet_id
    fault_domains       = ["FAULT-DOMAIN-1", "FAULT-DOMAIN-2", "FAULT-DOMAIN-3"]
  }
  placement_configurations {
    availability_domain = data.oci_identity_availability_domains.ads.availability_domains[1].name
    primary_subnet_id   = var.app_subnet_id_ad2
    fault_domains       = ["FAULT-DOMAIN-1", "FAULT-DOMAIN-2", "FAULT-DOMAIN-3"]
  }
  load_balancers {
    backend_set_name = oci_load_balancer_backend_set.orders_api_backend.name
    load_balancer_id = oci_load_balancer.orders_lb.id
    port             = 8080
    vnic_selection   = "PrimaryVnic"
  }
  defined_tags = {
    "Operations.Environment" = "production"
    "Operations.Application" = "orders-api"
  }
}

		

Two placement configurations across two availability domains with all three fault domains specified in each. This spreads instances evenly across the physical failure domains within each AD. A single hardware failure affecting one fault domain takes out at most one third of your capacity in one AD, not all of it.

The load_balancers block registers the pool with the load balancer backend set automatically. When the pool adds an instance, OCI registers it with the backend set. When it removes one, OCI deregisters it before terminating the instance so it drains connections cleanly.

Step 3: Load Balancer and Health Check

hcl

			
resource "oci_load_balancer" "orders_lb" {
  compartment_id             = var.compartment_id
  display_name               = "orders-api-lb"
  shape                      = "flexible"
  subnet_ids                 = [var.public_subnet_id]
  is_private                 = false
  shape_details {
    minimum_bandwidth_in_mbps = 10
    maximum_bandwidth_in_mbps = 400
  }
  defined_tags = {
    "Operations.Environment" = "production"
    "Operations.Application" = "orders-api"
  }
}
resource "oci_load_balancer_backend_set" "orders_api_backend" {
  name             = "orders-api-backend-set"
  load_balancer_id = oci_load_balancer.orders_lb.id
  policy           = "LEAST_CONNECTIONS"
  health_checker {
    protocol            = "HTTP"
    port                = 8080
    url_path            = "/health"
    interval_ms         = 10000
    timeout_in_millis   = 3000
    retries             = 3
    return_code         = 200
    response_body_regex = ".*\"status\":\"healthy\".*"
  }
  session_persistence_configuration {
    cookie_name      = "orders_session"
    disable_fallback = false
  }
}
resource "oci_load_balancer_listener" "orders_https" {
  load_balancer_id         = oci_load_balancer.orders_lb.id
  name                     = "orders-https-listener"
  default_backend_set_name = oci_load_balancer_backend_set.orders_api_backend.name
  port                     = 443
  protocol                 = "HTTP"
  ssl_configuration {
    certificate_name        = oci_load_balancer_certificate.orders_cert.certificate_name
    verify_peer_certificate = false
    protocols               = ["TLSv1.2", "TLSv1.3"]
    cipher_suite_name       = "oci-wider-compatible-ssl-cipher-suite-v1"
  }
  connection_configuration {
    idle_timeout_in_seconds = 60
  }
}

		

The health checker uses response_body_regex to validate the response body, not just the HTTP status code. Your /health endpoint should return a JSON payload that confirms the application is ready to serve traffic, not just that the process is running. A process can be alive but unable to connect to the database, which makes it unhealthy from a request-serving perspective even though it returns 200.

Step 4: Metric-Based Autoscaling

The default metric-based autoscaling policy uses CPU utilization. This works for CPU-bound workloads but misses the mark for I/O-bound services where CPU stays low while request queues build up.

hcl

			
resource "oci_autoscaling_auto_scaling_configuration" "orders_api_asc" {
  compartment_id       = var.compartment_id
  display_name         = "orders-api-autoscaling"
  is_enabled           = true
  cool_down_in_seconds = 300
  auto_scaling_resources {
    id   = oci_core_instance_pool.orders_api_pool.id
    type = "instancePool"
  }
  policies {
    display_name = "cpu-scale-out"
    policy_type  = "threshold"
    capacity {
      initial = 2
      min     = 2
      max     = 20
    }
    rules {
      display_name = "scale-out-on-high-cpu"
      action {
        type  = "CHANGE_COUNT_BY"
        value = 2
      }
      metric {
        metric_type = "CPU_UTILIZATION"
        threshold {
          operator = "GT"
          value    = 75
        }
      }
    }
    rules {
      display_name = "scale-in-on-low-cpu"
      action {
        type  = "CHANGE_COUNT_BY"
        value = -1
      }
      metric {
        metric_type = "CPU_UTILIZATION"
        threshold {
          operator = "LT"
          value    = 25
        }
      }
    }
  }
}

		

The cool_down_in_seconds = 300 prevents the autoscaler from firing again within five minutes of the last scaling action. Without this, a sudden traffic spike triggers a scale-out, the new instances come online, CPU drops, a scale-in fires immediately, the instances terminate, CPU climbs again, and you get an oscillation loop. Five minutes gives new instances time to warm up and take on load before the next evaluation.

Scale out by 2, scale in by 1. Always scale out faster than you scale in. The cost of having one extra instance for a few minutes is trivial compared to the cost of serving degraded traffic because you removed capacity too aggressively.

Step 5: Custom Metric Autoscaling

CPU-based scaling is not enough for most production services. A better signal is often active request queue depth or response latency percentile. If your application publishes custom metrics to OCI Monitoring, you can scale on those instead.

Here is how the application publishes a custom metric from Python:

python

			
import oci
import json
from datetime import datetime, timezone
def publish_queue_depth_metric(queue_depth: int, compartment_id: str):
    config = oci.config.from_file()
    monitoring_client = oci.monitoring.MonitoringClient(
        config,
        service_endpoint="https://telemetry-ingestion.{}.oraclecloud.com".format(config["region"])
    )
    metric_data = oci.monitoring.models.PostMetricDataDetails(
        metric_data=[
            oci.monitoring.models.MetricDataDetails(
                namespace="custom_orders_api",
                compartment_id=compartment_id,
                name="RequestQueueDepth",
                dimensions={
                    "environment": "production",
                    "application": "orders-api"
                },
                datapoints=[
                    oci.monitoring.models.Datapoint(
                        timestamp=datetime.now(timezone.utc),
                        value=float(queue_depth)
                    )
                ],
                metadata={
                    "unit": "count",
                    "displayName": "Request Queue Depth"
                }
            )
        ]
    )
    response = monitoring_client.post_metric_data(
        post_metric_data_details=metric_data
    )
    return response.status

		

Call this function every 60 seconds from a background thread in your application. Once the metric appears in OCI Monitoring under the custom_orders_api namespace, you can create an autoscaling rule against it.

OCI’s native autoscaling configuration only supports CPU_UTILIZATION and MEMORY_UTILIZATION as built-in metric types. To scale on a custom metric you pair OCI Monitoring alarms with an OCI Functions trigger that calls the Instance Pool resize API directly.

hcl

			
resource "oci_monitoring_alarm" "queue_depth_high" {
  compartment_id        = var.compartment_id
  display_name          = "orders-api-queue-depth-high"
  is_enabled            = true
  metric_compartment_id = var.compartment_id
  namespace             = "custom_orders_api"
  query                 = "RequestQueueDepth[1m]{environment = 'production'}.mean() > 500"
  severity              = "WARNING"
  pending_duration      = "PT2M"
  destinations          = [oci_ons_notification_topic.scaling_topic.id]
  body                  = "Queue depth exceeded 500 for 2 minutes. Scaling out instance pool."
}
resource "oci_monitoring_alarm" "queue_depth_low" {
  compartment_id        = var.compartment_id
  display_name          = "orders-api-queue-depth-low"
  is_enabled            = true
  metric_compartment_id = var.compartment_id
  namespace             = "custom_orders_api"
  query                 = "RequestQueueDepth[5m]{environment = 'production'}.mean() < 100"
  severity              = "INFO"
  pending_duration      = "PT10M"
  destinations          = [oci_ons_notification_topic.scaling_topic.id]
  body                  = "Queue depth below 100 for 10 minutes. Scaling in instance pool."
}
resource "oci_ons_notification_topic" "scaling_topic" {
  compartment_id = var.compartment_id
  name           = "orders-api-scaling-events"
  description    = "Triggers custom metric scaling function"
}
resource "oci_ons_subscription" "scaling_function_sub" {
  compartment_id = var.compartment_id
  topic_id       = oci_ons_notification_topic.scaling_topic.id
  protocol       = "ORACLE_FUNCTIONS"
  endpoint       = oci_functions_function.pool_scaler.id
}

		

The OCI Function that handles the scaling action:

python

			
import io
import json
import oci
import logging
logger = logging.getLogger()
POOL_ID       = "ocid1.instancepool.oc1..."
MIN_SIZE      = 2
MAX_SIZE      = 20
SCALE_OUT_BY  = 2
SCALE_IN_BY   = 1
def handler(ctx, data: io.BytesIO = None):
    try:
        body = json.loads(data.getvalue())
        alarm_body = body.get("body", "")
        logger.info(f"Received alarm notification: {alarm_body}")
    except Exception as ex:
        logger.error(f"Failed to parse notification: {ex}")
        return
    signer = oci.auth.signers.get_resource_principals_signer()
    compute_mgmt = oci.core.ComputeManagementClient(config={}, signer=signer)
    pool = compute_mgmt.get_instance_pool(POOL_ID).data
    current_size = pool.size
    if "Scaling out" in alarm_body:
        new_size = min(current_size + SCALE_OUT_BY, MAX_SIZE)
        action = "scale-out"
    elif "Scaling in" in alarm_body:
        new_size = max(current_size - SCALE_IN_BY, MIN_SIZE)
        action = "scale-in"
    else:
        logger.info("Unrecognized alarm body, no action taken")
        return
    if new_size == current_size:
        logger.info(f"Already at {'max' if action == 'scale-out' else 'min'} size ({current_size}), no action")
        return
    update_details = oci.core.models.UpdateInstancePoolDetails(size=new_size)
    compute_mgmt.update_instance_pool(POOL_ID, update_details)
    logger.info(f"Pool resize triggered: {current_size} to {new_size} ({action})")

		

The function uses get_resource_principals_signer() to authenticate with the Dynamic Group policy. No credentials are stored in the function configuration.

Step 6: Schedule-Based Scaling

For workloads with predictable patterns you can layer a schedule-based policy on top of the metric-based policy. Business-hours applications can scale up before the working day starts and scale down after it ends, reducing idle capacity costs at night.

hcl

			
resource "oci_autoscaling_auto_scaling_configuration" "orders_api_schedule" {
  compartment_id       = var.compartment_id
  display_name         = "orders-api-schedule-scaling"
  is_enabled           = true
  cool_down_in_seconds = 300
  auto_scaling_resources {
    id   = oci_core_instance_pool.orders_api_pool.id
    type = "instancePool"
  }
  policies {
    display_name = "business-hours-scale-up"
    policy_type  = "scheduled"
    execution_schedule {
      expression = "0 7 * * 0-4"
      timezone   = "Asia/Riyadh"
      type       = "cron"
    }
    capacity {
      initial = 6
      min     = 6
      max     = 20
    }
  }
  policies {
    display_name = "after-hours-scale-down"
    policy_type  = "scheduled"
    execution_schedule {
      expression = "0 20 * * 0-4"
      timezone   = "Asia/Riyadh"
      type       = "cron"
    }
    capacity {
      initial = 2
      min     = 2
      max     = 20
    }
  }
}

		

The cron expression 0 7 * * 0-4 fires at 07:00 Sunday through Thursday in the Asia/Riyadh timezone, which covers the standard working week in the Gulf region. At 20:00 the pool scales back to the minimum. The max remains at 20 in both schedules so metric-based scaling can still expand beyond the scheduled minimum during peak periods.

Step 7: Rolling Instance Configuration Update

When you need to deploy a new application version, you update the Instance Configuration and then replace pool instances without taking the pool offline.

hcl

			
# Create new instance configuration with updated app version
resource "oci_core_instance_configuration" "app_instance_config_v2" {
  compartment_id = var.compartment_id
  display_name   = "orders-api-instance-config-v${var.new_app_version}"
  # Same configuration as v1 with updated user_data referencing new_app_version
  instance_details {
    instance_type = "compute"
    launch_details {
      # ... identical to v1 except user_data references new_app_version
    }
  }
}
# Update the pool to use the new configuration
resource "oci_core_instance_pool" "orders_api_pool" {
  instance_configuration_id = oci_core_instance_configuration.app_instance_config_v2.id
  # ... rest of pool config unchanged
}

		

Updating instance_configuration_id on the pool does not immediately replace running instances. Existing instances continue running with the old configuration. New instances added by scaling or manual pool resize use the new configuration. To replace all existing instances with the new version, trigger a rolling replacement using the OCI CLI:

bash

			
oci compute-management instance-pool-instance attach \
  --instance-pool-id <pool-ocid> \
  --instance-id <instance-ocid>
# Or use the softreset action to trigger a rolling replace
oci compute-management instance-pool softreset \
  --instance-pool-id <pool-ocid>

		

The softreset action replaces instances one at a time, waiting for each new instance to pass the load balancer health check before terminating the next old instance. Zero downtime rolling deploy without any orchestration tooling.

Validating the Autoscaling Behavior

Generate artificial CPU load to test the scale-out policy:

bash

			
# SSH into one of the pool instances via OCI Bastion
# Then stress the CPU
stress-ng --cpu 2 --cpu-load 90 --timeout 600

Watch the pool size change in real time:

bash

			
watch -n 10 'oci compute-management instance-pool get \
  --instance-pool-id <pool-ocid> \
  --query "data.{size:size, state:\"lifecycle-state\"}" \
  --output table'

List all instances currently in the pool with their health status:

bash

			
oci compute-management instance-pool-instance list \
  --instance-pool-id <pool-ocid> \
  --query 'data[*].{id:"id", state:"state", ad:"availability-domain", fault-domain:"fault-domain"}' \
  --output table

Check the autoscaling activity history to see every scale event with its trigger reason:

bash

			
oci autoscaling auto-scaling-configuration list \
  --compartment-id <compartment-ocid> \
  --query 'data[*].{name:"display-name", enabled:"is-enabled"}' \
  --output table

Operational Notes

A few things that matter in production but are easy to miss.

The load balancer backend set policy is set to LEAST_CONNECTIONS. This distributes new connections to the instance with the fewest active connections rather than round-robin. For APIs with variable request duration, this prevents a slow request on one instance from causing it to accumulate a backlog while other instances are idle.

Instance Configuration versioning in Terraform requires care. The configuration resource name includes the version number (app_instance_config_v${var.app_version}), which means Terraform creates a new resource rather than modifying the existing one. This preserves the old configuration so you can roll back by pointing the pool back to the previous configuration OCID if the new version has problems.

The minimum pool size of 2 placed across two availability domains means you always have at least one instance in each AD. A complete outage of one availability domain still leaves the pool functional. Set your minimum to at least 2 and spread placement across ADs for any production workload.

Regards,
Osama

LinkedIn Hashtags:

#OCI #OracleCloud #OracleCloudInfrastructure #InstancePools #Autoscaling #Terraform #InfrastructureAsCode #IaC #CloudNative #CloudArchitecture #PlatformEngineering #DevOps #ComputeScaling #CloudAutomation #OracleCompute #LoadBalancer #HighAvailability #CloudEngineering #FinOps #TechBlog

Another blog

Searched the web

OCI GoldenGate has never been covered as a dedicated technical post. The 2017 mention is just a brief service description in an old summary post. Topic: OCI GoldenGate: Real-Time Data Replication from On-Premises Oracle Database to OCI Autonomous Database with Terraform and CDC Pipeline.

OCI GoldenGate: Building a Real-Time CDC Pipeline from On-Premises Oracle Database to Autonomous Database

Architecture

The pipeline has three components running in sequence.

The Extract process connects to the source Oracle Database, reads the redo logs, and captures committed DML and DDL operations. It writes these to trail files inside the GoldenGate deployment.

The Replicat process reads trail files inside OCI GoldenGate and applies the transactions to Autonomous Database using the native OCI Autonomous Database connection.

			
On-Premises Oracle DB 19c
         |
    [GoldenGate Extract]  (reads redo logs)
         |
    [Distribution Path]   (encrypted, over FastConnect or VPN)
         |
    OCI GoldenGate Deployment
         |
    [Replicat Process]    (applies transactions)
         |
    OCI Autonomous Database

		

Prerequisites

Before starting you need the following in place.

On the source Oracle Database: supplemental logging must be enabled at the database level and at the table level for all tables being replicated. The GoldenGate Extract user needs SELECT on the tables, access to V $L O G a n d V$ LOGandVLOGFILE, and EXECUTE on DBMS_FLASHBACK.

On OCI: an Autonomous Database instance already provisioned, with the wallet downloaded and the admin password stored in OCI Vault.

Step 1: Source Database Preparation

Connect to the source Oracle Database as SYSDBA and run:

sql

			
-- Enable supplemental logging at the database level
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA;
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (PRIMARY KEY) COLUMNS;
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (UNIQUE) COLUMNS;
-- Verify supplemental logging is active
SELECT SUPPLEMENTAL_LOG_DATA_MIN,
       SUPPLEMENTAL_LOG_DATA_PK,
       SUPPLEMENTAL_LOG_DATA_UI
FROM V$DATABASE;
-- Create the GoldenGate capture user
CREATE USER ggadmin IDENTIFIED BY "YourStrongPassword123!";
GRANT CREATE SESSION TO ggadmin;
GRANT SELECT ANY DICTIONARY TO ggadmin;
GRANT SELECT ANY TABLE TO ggadmin;
GRANT FLASHBACK ANY TABLE TO ggadmin;
GRANT EXECUTE ON DBMS_FLASHBACK TO ggadmin;
GRANT SELECT ON SYS.V_$DATABASE TO ggadmin;
GRANT SELECT ON SYS.V_$LOG TO ggadmin;
GRANT SELECT ON SYS.V_$LOGFILE TO ggadmin;
GRANT SELECT ON SYS.V_$ARCHIVED_LOG TO ggadmin;
GRANT SELECT ON SYS.V_$LOG_HISTORY TO ggadmin;
GRANT SELECT ON SYS.V_$TRANSACTION TO ggadmin;
-- Enable supplemental logging on specific tables
-- Run this for each table you want to replicate
ALTER TABLE hr.employees ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;
ALTER TABLE hr.departments ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;
ALTER TABLE orders.order_header ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;
ALTER TABLE orders.order_lines ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;

		

The SUPPLEMENTAL LOG DATA (ALL) COLUMNS clause on each table is important. Without it, GoldenGate can only capture the changed column values, not the before-image values needed to reconstruct the full row state on the target for UPDATE and DELETE operations.

Step 2: IAM Setup for OCI GoldenGate

			
resource "oci_identity_dynamic_group" "goldengate_dg" {
  compartment_id = var.tenancy_ocid
  name           = "goldengate-deployment-dg"
  description    = "Dynamic group for OCI GoldenGate deployments"
  matching_rule  = "All {resource.type = 'goldengatedeployment', resource.compartment.id = '${var.compartment_id}'}"
}
resource "oci_identity_policy" "goldengate_policy" {
  compartment_id = var.compartment_id
  name           = "goldengate-service-policy"
  description    = "Permissions for OCI GoldenGate to access required services"
  statements = [
    "Allow dynamic-group goldengate-deployment-dg to manage secret-family in compartment id ${var.compartment_id}",
    "Allow dynamic-group goldengate-deployment-dg to manage autonomous-database-family in compartment id ${var.compartment_id}",
    "Allow dynamic-group goldengate-deployment-dg to manage buckets in compartment id ${var.compartment_id}",
    "Allow dynamic-group goldengate-deployment-dg to manage objects in compartment id ${var.compartment_id}",
    "Allow service goldengate to manage autonomous-database-family in compartment id ${var.compartment_id}",
    "Allow service goldengate to read secret-family in compartment id ${var.compartment_id}"
  ]
}

		

Step 3: Store Credentials in OCI Vault

The GoldenGate deployment retrieves source and target database credentials from OCI Vault at runtime. Never pass credentials as plain text in deployment parameters.

			
resource "oci_vault_secret" "source_db_password" {
  compartment_id = var.compartment_id
  vault_id       = var.vault_id
  key_id         = var.vault_key_id
  secret_name    = "goldengate-source-db-password"
  secret_content {
    content_type = "BASE64"
    content      = base64encode(var.source_db_password)
  }
  defined_tags = {
    "Operations.ManagedBy"   = "terraform"
    "Operations.Environment" = "production"
  }
}
resource "oci_vault_secret" "target_adb_password" {
  compartment_id = var.compartment_id
  vault_id       = var.vault_id
  key_id         = var.vault_key_id
  secret_name    = "goldengate-target-adb-password"
  secret_content {
    content_type = "BASE64"
    content      = base64encode(var.target_adb_admin_password)
  }
}
resource "oci_vault_secret" "goldengate_admin_password" {
  compartment_id = var.compartment_id
  vault_id       = var.vault_id
  key_id         = var.vault_key_id
  secret_name    = "goldengate-admin-password"
  secret_content {
    content_type = "BASE64"
    content      = base64encode(var.goldengate_admin_password)
  }
}

		

Step 4: Deploy the OCI GoldenGate Instance

			
resource "oci_golden_gate_deployment" "cdc_deployment" {
  compartment_id          = var.compartment_id
  display_name            = "prod-cdc-pipeline"
  description             = "Real-time CDC from on-premises Oracle 19c to Autonomous Database"
  deployment_type         = "OGG"
  license_model           = "LICENSE_INCLUDED"
  subnet_id               = var.private_subnet_id
  is_auto_scaling_enabled = true
  cpu_core_count          = 2
  fqdn                    = "goldengate.internal.example.com"
  ogg_data {
    admin_username    = "oggadmin"
    admin_password_secret_id = oci_vault_secret.goldengate_admin_password.id
    deployment_name   = "cdc-prod"
  }
  maintenance_window {
    day        = "SUNDAY"
    start_hour = 2
  }
  defined_tags = {
    "Operations.Environment" = "production"
    "Operations.ManagedBy"   = "terraform"
  }
}
output "goldengate_console_url" {
  value       = oci_golden_gate_deployment.cdc_deployment.deployment_url
  description = "URL to access the GoldenGate administration console"
}

		

The is_auto_scaling_enabled = true flag allows GoldenGate to scale its CPU automatically under load. Set this to true for production deployments where replication lag during peak transaction periods is a concern.

The deployment takes 10 to 15 minutes to provision. Once it is in the ACTIVE lifecycle state, proceed to create the connections.

Step 5: Create the Source and Target Connections

OCI GoldenGate uses Connection resources that abstract the database credentials and connectivity details away from the pipeline configuration.

			
# Source: On-premises Oracle Database 19c
resource "oci_golden_gate_connection" "source_oracle_db" {
  compartment_id  = var.compartment_id
  display_name    = "source-oracle-19c"
  description     = "On-premises Oracle Database 19c source for CDC"
  connection_type = "ORACLE"
  technology_type = "ORACLE_DATABASE"
  username           = "ggadmin"
  password_secret_id = oci_vault_secret.source_db_password.id
  # Connection string using TNS format for on-premises database
  connection_string = "(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=${var.source_db_host})(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=${var.source_db_service_name})))"
  # Subnet for private connectivity to on-premises network
  subnet_id = var.private_subnet_id
  nsg_ids = [var.goldengate_nsg_id]
  defined_tags = {
    "Operations.Environment" = "production"
    "Operations.ManagedBy"   = "terraform"
  }
}
# Target: OCI Autonomous Database
resource "oci_golden_gate_connection" "target_autonomous_db" {
  compartment_id  = var.compartment_id
  display_name    = "target-autonomous-db"
  description     = "OCI Autonomous Database target"
  connection_type = "ORACLE"
  technology_type = "ORACLE_AUTONOMOUS_DATABASE"
  username              = "admin"
  password_secret_id    = oci_vault_secret.target_adb_password.id
  database_id           = var.autonomous_database_id
  # Wallet for mTLS connection to Autonomous Database
  wallet_secret_id = oci_vault_secret.adb_wallet_secret.id
  subnet_id = var.private_subnet_id
  nsg_ids   = [var.goldengate_nsg_id]
}
# Assignment: attach connections to the deployment
resource "oci_golden_gate_deployment_backup" "source_assignment" {
  # Connections are assigned via the GoldenGate console or REST API after creation
  # Terraform manages the connection resources; assignment happens through the console
}

		

After the connections are created in Terraform, assign them to the deployment through the GoldenGate administration console or the REST API:

			
# Get the deployment console URL
CONSOLE_URL=$(terraform output -raw goldengate_console_url)
# Authenticate against the GoldenGate admin API
TOKEN=$(curl -s -X POST "${CONSOLE_URL}/services/v2/authtokens" \
  -H "Content-Type: application/json" \
  -d '{"username":"oggadmin","password":"'${GOLDENGATE_ADMIN_PASSWORD}'"}' \
  | jq -r '.authToken')
echo "GoldenGate auth token obtained: ${TOKEN:0:20}..."

		

Step 6: Configure the Extract Process

The Extract process connects to the source database and reads the redo logs. Configure it through the GoldenGate REST API.

			
# Create the Extract process via REST API
curl -s -X POST "${CONSOLE_URL}/services/v2/extracts" \
  -H "Authorization: Bearer ${TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "EXTSRC",
    "type": "INTEGRATED",
    "credentials": "source-oracle-19c",
    "beginTime": "NOW",
    "integrated": {
      "threadCount": 4,
      "maxSGASize": 1024,
      "logRetentionHours": 24
    }
  }'
# Add the source tables to the Extract
curl -s -X POST "${CONSOLE_URL}/services/v2/extracts/EXTSRC/tables" \
  -H "Authorization: Bearer ${TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "tables": [
      {"schema": "HR", "table": "EMPLOYEES"},
      {"schema": "HR", "table": "DEPARTMENTS"},
      {"schema": "ORDERS", "table": "ORDER_HEADER"},
      {"schema": "ORDERS", "table": "ORDER_LINES"}
    ]
  }'
# Add a trail file for the Extract output
curl -s -X POST "${CONSOLE_URL}/services/v2/extracts/EXTSRC/trailfiles" \
  -H "Authorization: Bearer ${TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "trailName": "lt",
    "trailSize": 500
  }'Step 7: Configure the Replicat Process
The Replicat reads trail files and applies transactions to Autonomous Database. Use the PARALLEL Replicat type for better throughput on high-volume schemas.

		

Step 7: Configure the Replicat Process

The Replicat reads trail files and applies transactions to Autonomous Database. Use the PARALLEL Replicat type for better throughput on high-volume schemas.

			
# Create the Replicat process
curl -s -X POST "${CONSOLE_URL}/services/v2/replicats" \
  -H "Authorization: Bearer ${TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "REPADB",
    "type": "PARALLEL",
    "credentials": "target-autonomous-db",
    "trailFile": "lt",
    "parallel": {
      "threads": 4,
      "commitBatchSize": 1000,
      "maxTransactions": 100
    },
    "conflictResolution": {
      "updateMissing": "DISCARD",
      "deleteMissing": "DISCARD",
      "insertDuplicate": "UPDATE"
    }
  }'
# Map source tables to target tables
curl -s -X POST "${CONSOLE_URL}/services/v2/replicats/REPADB/mappings" \
  -H "Authorization: Bearer ${TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "mappings": [
      {
        "source": {"schema": "HR", "table": "EMPLOYEES"},
        "target": {"schema": "HR", "table": "EMPLOYEES"}
      },
      {
        "source": {"schema": "HR", "table": "DEPARTMENTS"},
        "target": {"schema": "HR", "table": "DEPARTMENTS"}
      },
      {
        "source": {"schema": "ORDERS", "table": "ORDER_HEADER"},
        "target": {"schema": "ORDERS", "table": "ORDER_HEADER"}
      },
      {
        "source": {"schema": "ORDERS", "table": "ORDER_LINES"},
        "target": {"schema": "ORDERS", "table": "ORDER_LINES"}
      }
    ]
  }'

		

The conflictResolution settings define what happens when the Replicat encounters a conflict between the incoming transaction and the current state of the target. insertDuplicate: UPDATE means if a row being inserted already exists on the target, update it instead of raising an error. This handles the case where your initial load and the CDC stream overlap and the same row arrives twice.

Step 8: Initial Load Before Starting CDC

Before starting the Extract and Replicat, you need to populate the target tables with the current data from the source. Do this with a coordinated approach that captures a consistent SCN from the source

			
-- On the source database, capture the current SCN
-- This will be used as the starting point for the Extract
SELECT CURRENT_SCN FROM V$DATABASE;
-- Note this SCN: for example, 8573921
-- Export data consistent as of this SCN using Data Pump
expdp system/password@source_db \
  DIRECTORY=dp_dir \
  DUMPFILE=initial_load.dmp \
  LOGFILE=initial_load.log \
  SCHEMAS=HR,ORDERS \
  FLASHBACK_SCN=8573921
-- Import into Autonomous Database
impdp admin/password@adb_high \
  DIRECTORY=data_pump_dir \
  DUMPFILE=initial_load.dmp \
  LOGFILE=initial_load_import.log \
  REMAP_SCHEMA=HR:HR \
  REMAP_SCHEMA=ORDERS:ORDERS \
  TABLE_EXISTS_ACTION=REPLACE

		

Once the import completes, configure the Extract to start from the SCN you captured:

			
# Start the Extract from the SCN captured before the Data Pump export
curl -s -X POST "${CONSOLE_URL}/services/v2/extracts/EXTSRC/start" \
  -H "Authorization: Bearer ${TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "startPosition": {
      "type": "SCN",
      "scn": "8573921"
    }
  }'
# Start the Replicat
curl -s -X POST "${CONSOLE_URL}/services/v2/replicats/REPADB/start" \
  -H "Authorization: Bearer ${TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{}'

		

GoldenGate will begin reading from SCN 8573921 and apply every committed transaction that occurred after the Data Pump export snapshot. The target database catches up to the source in minutes depending on transaction volume since the export.

Step 9: Monitoring the Pipeline

Check Extract and Replicat status and lag:

			
# Get all process statuses
curl -s "${CONSOLE_URL}/services/v2/extracts/EXTSRC" \
  -H "Authorization: Bearer ${TOKEN}" \
  | jq '{
      status: .status,
      lag: .lagSeconds,
      extractedSCN: .currentSCN,
      lastCheckpoint: .lastCheckpointTime
    }'
curl -s "${CONSOLE_URL}/services/v2/replicats/REPADB" \
  -H "Authorization: Bearer ${TOKEN}" \
  | jq '{
      status: .status,
      lag: .lagSeconds,
      appliedSCN: .currentSCN,
      appliedTransactions: .totalTransactions
    }'

		

A healthy pipeline shows lag in single-digit seconds for low to moderate transaction volumes. If lag starts growing, check the Replicat thread count, the commit batch size, and whether the Autonomous Database OCPUs are being throttled.

For production monitoring, create an OCI alarm against the GoldenGate service metrics:

			
resource "oci_monitoring_alarm" "goldengate_lag_alarm" {
  compartment_id        = var.compartment_id
  display_name          = "goldengate-high-lag"
  is_enabled            = true
  metric_compartment_id = var.compartment_id
  namespace             = "oci_goldengate"
  query                 = "ReplicatLag[5m]{deploymentName = 'cdc-prod'}.mean() > 60"
  severity              = "CRITICAL"
  pending_duration      = "PT5M"
  destinations          = [var.ops_notification_topic_id]
  body                  = "GoldenGate Replicat lag has exceeded 60 seconds for 5 minutes. Investigate pipeline health immediately."
}
resource "oci_monitoring_alarm" "goldengate_process_down" {
  compartment_id        = var.compartment_id
  display_name          = "goldengate-process-not-running"
  is_enabled            = true
  metric_compartment_id = var.compartment_id
  namespace             = "oci_goldengate"
  query                 = "ExtractStatus[1m]{deploymentName = 'cdc-prod'}.mean() < 1"
  severity              = "CRITICAL"
  pending_duration      = "PT2M"
  destinations          = [var.ops_notification_topic_id]
  body                  = "GoldenGate Extract process is not running. Replication has stopped."
}

		

Step 10: Cutover Procedure

When you are ready to cut over the application from the source to the target database, follow this sequence.

First, verify the Replicat lag is below five seconds and stable:

			
curl -s "${CONSOLE_URL}/services/v2/replicats/REPADB" \
  -H "Authorization: Bearer ${TOKEN}" \
  | jq '.lagSeconds'

Stop the application from writing to the source database. This can be done at the load balancer level by removing backend instances from the pool, or by switching the application to read-only mode if it supports it.

Wait for the Replicat lag to reach zero and confirm the applied SCN matches the source current SCN:

			
# Source current SCN
sqlplus -S system/password@source_db <<EOF
SELECT CURRENT_SCN FROM V\$DATABASE;
EOF
# Replicat applied SCN
curl -s "${CONSOLE_URL}/services/v2/replicats/REPADB" \
  -H "Authorization: Bearer ${TOKEN}" \
  | jq '.currentSCN'

		

When the SCNs match, the target is fully caught up. Update your application connection string to point to the Autonomous Database, bring the application back online, and stop the GoldenGate processes.

			
# Stop Extract and Replicat after successful cutover
curl -s -X POST "${CONSOLE_URL}/services/v2/extracts/EXTSRC/stop" \
  -H "Authorization: Bearer ${TOKEN}"
curl -s -X POST "${CONSOLE_URL}/services/v2/replicats/REPADB/stop" \
  -H "Authorization: Bearer ${TOKEN}"

		

When GoldenGate Is the Right Tool

GoldenGate adds operational complexity that a simple Data Pump export and import does not. It is the right tool when: the source database is too large for a maintenance window export to complete within business constraints, the application cannot tolerate more than a few seconds of downtime, or you need bidirectional replication where both source and target accept writes simultaneously during the transition.

For databases under a few hundred gigabytes with a maintenance window of a few hours, Data Pump with a brief application outage is simpler and carries less operational risk. For terabyte-scale Oracle databases with SLAs that do not permit extended downtime, GoldenGate CDC is the correct architecture.

Regards,
Osama

OCI Instance Pools and Autoscaling: Building a Production-Grade Compute Scaling Architecture with Terraform

Posted on June 22, 2026 by Osama Mustafa in Uncategorized

How the Components Fit Together

Before writing any Terraform, the relationship between the three core resources matters.

Step 1: Instance Configuration

			
data "oci_core_images" "ol8_image" {
  compartment_id           = var.compartment_id
  operating_system         = "Oracle Linux"
  operating_system_version = "8"
  shape                    = "VM.Standard.E4.Flex"
  sort_by                  = "TIMECREATED"
  sort_order               = "DESC"
  filter {
    name   = "display_name"
    values = ["^.*Oracle-Linux-8.*$"]
    regex  = true
  }
}
resource "oci_core_instance_configuration" "app_instance_config" {
  compartment_id = var.compartment_id
  display_name   = "orders-api-instance-config-v${var.app_version}"
  instance_details {
    instance_type = "compute"
    launch_details {
      compartment_id = var.compartment_id
      display_name   = "orders-api-node"
      shape          = "VM.Standard.E4.Flex"
      shape_config {
        ocpus         = 2
        memory_in_gbs = 16
      }
      source_details {
        source_type             = "image"
        image_id                = data.oci_core_images.ol8_image.images[0].id
        boot_volume_size_in_gbs = 50
      }
      create_vnic_details {
        subnet_id             = var.app_subnet_id
        assign_public_ip      = false
        nsg_ids               = [var.app_nsg_id]
        hostname_label_prefix = "orders-api"
      }
      metadata = {
        ssh_authorized_keys = var.ssh_public_key
        user_data           = base64encode(templatefile("${path.module}/templates/cloud-init.yaml", {
          app_version        = var.app_version
          compartment_id     = var.compartment_id
          region             = var.region
          monitoring_enabled = "true"
        }))
      }
      defined_tags = {
        "Operations.Environment" = "production"
        "Operations.Application" = "orders-api"
        "Operations.ManagedBy"   = "terraform"
      }
    }
  }
}

		

The cloud-init template at templates/cloud-init.yaml:

			
#cloud-config
runcmd:
  # Install OCI Unified Monitoring Agent for custom metrics
  - dnf install -y oracle-cloud-agent
  - systemctl enable oracle-cloud-agent
  - systemctl start oracle-cloud-agent
  # Install the application
  - mkdir -p /opt/orders-api
  - dnf install -y python3.11 python3.11-pip
  - pip3.11 install orders-api==${app_version}
  # Configure the application
  - |
    cat > /etc/orders-api/config.yaml <<EOF
    environment: production
    compartment_id: ${compartment_id}
    region: ${region}
    metrics_namespace: custom_orders_api
    EOF
  # Start the application
  - systemctl enable orders-api
  - systemctl start orders-api
write_files:
  - path: /etc/systemd/system/orders-api.service
    content: |
      [Unit]
      Description=Orders API Service
      After=network.target
      [Service]
      Type=simple
      User=app
      ExecStart=/usr/local/bin/orders-api serve
      Restart=always
      RestartSec=5
      Environment=CONFIG_FILE=/etc/orders-api/config.yaml
      [Install]
      WantedBy=multi-user.target

		

Step 2: Instance Pool

			
resource "oci_core_instance_pool" "orders_api_pool" {
  compartment_id            = var.compartment_id
  instance_configuration_id = oci_core_instance_configuration.app_instance_config.id
  display_name              = "orders-api-pool"
  size                      = 2
  placement_configurations {
    availability_domain = data.oci_identity_availability_domains.ads.availability_domains[0].name
    primary_subnet_id   = var.app_subnet_id
    fault_domains       = ["FAULT-DOMAIN-1", "FAULT-DOMAIN-2", "FAULT-DOMAIN-3"]
  }
  placement_configurations {
    availability_domain = data.oci_identity_availability_domains.ads.availability_domains[1].name
    primary_subnet_id   = var.app_subnet_id_ad2
    fault_domains       = ["FAULT-DOMAIN-1", "FAULT-DOMAIN-2", "FAULT-DOMAIN-3"]
  }
  load_balancers {
    backend_set_name = oci_load_balancer_backend_set.orders_api_backend.name
    load_balancer_id = oci_load_balancer.orders_lb.id
    port             = 8080
    vnic_selection   = "PrimaryVnic"
  }
  defined_tags = {
    "Operations.Environment" = "production"
    "Operations.Application" = "orders-api"
  }
}

		

Step 3: Load Balancer and Health Check

			
resource "oci_load_balancer" "orders_lb" {
  compartment_id             = var.compartment_id
  display_name               = "orders-api-lb"
  shape                      = "flexible"
  subnet_ids                 = [var.public_subnet_id]
  is_private                 = false
  shape_details {
    minimum_bandwidth_in_mbps = 10
    maximum_bandwidth_in_mbps = 400
  }
  defined_tags = {
    "Operations.Environment" = "production"
    "Operations.Application" = "orders-api"
  }
}
resource "oci_load_balancer_backend_set" "orders_api_backend" {
  name             = "orders-api-backend-set"
  load_balancer_id = oci_load_balancer.orders_lb.id
  policy           = "LEAST_CONNECTIONS"
  health_checker {
    protocol            = "HTTP"
    port                = 8080
    url_path            = "/health"
    interval_ms         = 10000
    timeout_in_millis   = 3000
    retries             = 3
    return_code         = 200
    response_body_regex = ".*\"status\":\"healthy\".*"
  }
  session_persistence_configuration {
    cookie_name      = "orders_session"
    disable_fallback = false
  }
}
resource "oci_load_balancer_listener" "orders_https" {
  load_balancer_id         = oci_load_balancer.orders_lb.id
  name                     = "orders-https-listener"
  default_backend_set_name = oci_load_balancer_backend_set.orders_api_backend.name
  port                     = 443
  protocol                 = "HTTP"
  ssl_configuration {
    certificate_name        = oci_load_balancer_certificate.orders_cert.certificate_name
    verify_peer_certificate = false
    protocols               = ["TLSv1.2", "TLSv1.3"]
    cipher_suite_name       = "oci-wider-compatible-ssl-cipher-suite-v1"
  }
  connection_configuration {
    idle_timeout_in_seconds = 60
  }
}

		

Step 4: Metric-Based Autoscaling

The default metric-based autoscaling policy uses CPU utilization. This works for CPU-bound workloads but misses the mark for I/O-bound services where CPU stays low while request queues build up.

			
resource "oci_autoscaling_auto_scaling_configuration" "orders_api_asc" {
  compartment_id       = var.compartment_id
  display_name         = "orders-api-autoscaling"
  is_enabled           = true
  cool_down_in_seconds = 300
  auto_scaling_resources {
    id   = oci_core_instance_pool.orders_api_pool.id
    type = "instancePool"
  }
  policies {
    display_name = "cpu-scale-out"
    policy_type  = "threshold"
    capacity {
      initial = 2
      min     = 2
      max     = 20
    }
    rules {
      display_name = "scale-out-on-high-cpu"
      action {
        type  = "CHANGE_COUNT_BY"
        value = 2
      }
      metric {
        metric_type = "CPU_UTILIZATION"
        threshold {
          operator = "GT"
          value    = 75
        }
      }
    }
    rules {
      display_name = "scale-in-on-low-cpu"
      action {
        type  = "CHANGE_COUNT_BY"
        value = -1
      }
      metric {
        metric_type = "CPU_UTILIZATION"
        threshold {
          operator = "LT"
          value    = 25
        }
      }
    }
  }
}

		

Step 5: Custom Metric Autoscaling

Here is how the application publishes a custom metric from Python:

python

			
import oci
import json
from datetime import datetime, timezone
def publish_queue_depth_metric(queue_depth: int, compartment_id: str):
    config = oci.config.from_file()
    monitoring_client = oci.monitoring.MonitoringClient(
        config,
        service_endpoint="https://telemetry-ingestion.{}.oraclecloud.com".format(config["region"])
    )
    metric_data = oci.monitoring.models.PostMetricDataDetails(
        metric_data=[
            oci.monitoring.models.MetricDataDetails(
                namespace="custom_orders_api",
                compartment_id=compartment_id,
                name="RequestQueueDepth",
                dimensions={
                    "environment": "production",
                    "application": "orders-api"
                },
                datapoints=[
                    oci.monitoring.models.Datapoint(
                        timestamp=datetime.now(timezone.utc),
                        value=float(queue_depth)
                    )
                ],
                metadata={
                    "unit": "count",
                    "displayName": "Request Queue Depth"
                }
            )
        ]
    )
    response = monitoring_client.post_metric_data(
        post_metric_data_details=metric_data
    )
    return response.status

		

			
resource "oci_monitoring_alarm" "queue_depth_high" {
  compartment_id        = var.compartment_id
  display_name          = "orders-api-queue-depth-high"
  is_enabled            = true
  metric_compartment_id = var.compartment_id
  namespace             = "custom_orders_api"
  query                 = "RequestQueueDepth[1m]{environment = 'production'}.mean() > 500"
  severity              = "WARNING"
  pending_duration      = "PT2M"
  destinations          = [oci_ons_notification_topic.scaling_topic.id]
  body                  = "Queue depth exceeded 500 for 2 minutes. Scaling out instance pool."
}
resource "oci_monitoring_alarm" "queue_depth_low" {
  compartment_id        = var.compartment_id
  display_name          = "orders-api-queue-depth-low"
  is_enabled            = true
  metric_compartment_id = var.compartment_id
  namespace             = "custom_orders_api"
  query                 = "RequestQueueDepth[5m]{environment = 'production'}.mean() < 100"
  severity              = "INFO"
  pending_duration      = "PT10M"
  destinations          = [oci_ons_notification_topic.scaling_topic.id]
  body                  = "Queue depth below 100 for 10 minutes. Scaling in instance pool."
}
resource "oci_ons_notification_topic" "scaling_topic" {
  compartment_id = var.compartment_id
  name           = "orders-api-scaling-events"
  description    = "Triggers custom metric scaling function"
}
resource "oci_ons_subscription" "scaling_function_sub" {
  compartment_id = var.compartment_id
  topic_id       = oci_ons_notification_topic.scaling_topic.id
  protocol       = "ORACLE_FUNCTIONS"
  endpoint       = oci_functions_function.pool_scaler.id
}

		

The OCI Function that handles the scaling action:

			
import io
import json
import oci
import logging
logger = logging.getLogger()
POOL_ID       = "ocid1.instancepool.oc1..."
MIN_SIZE      = 2
MAX_SIZE      = 20
SCALE_OUT_BY  = 2
SCALE_IN_BY   = 1
def handler(ctx, data: io.BytesIO = None):
    try:
        body = json.loads(data.getvalue())
        alarm_body = body.get("body", "")
        logger.info(f"Received alarm notification: {alarm_body}")
    except Exception as ex:
        logger.error(f"Failed to parse notification: {ex}")
        return
    signer = oci.auth.signers.get_resource_principals_signer()
    compute_mgmt = oci.core.ComputeManagementClient(config={}, signer=signer)
    pool = compute_mgmt.get_instance_pool(POOL_ID).data
    current_size = pool.size
    if "Scaling out" in alarm_body:
        new_size = min(current_size + SCALE_OUT_BY, MAX_SIZE)
        action = "scale-out"
    elif "Scaling in" in alarm_body:
        new_size = max(current_size - SCALE_IN_BY, MIN_SIZE)
        action = "scale-in"
    else:
        logger.info("Unrecognized alarm body, no action taken")
        return
    if new_size == current_size:
        logger.info(f"Already at {'max' if action == 'scale-out' else 'min'} size ({current_size}), no action")
        return
    update_details = oci.core.models.UpdateInstancePoolDetails(size=new_size)
    compute_mgmt.update_instance_pool(POOL_ID, update_details)
    logger.info(f"Pool resize triggered: {current_size} to {new_size} ({action})")

		

The function uses get_resource_principals_signer() to authenticate with the Dynamic Group policy. No credentials are stored in the function configuration.

Step 6: Schedule-Based Scaling

			
resource "oci_autoscaling_auto_scaling_configuration" "orders_api_schedule" {
  compartment_id       = var.compartment_id
  display_name         = "orders-api-schedule-scaling"
  is_enabled           = true
  cool_down_in_seconds = 300
  auto_scaling_resources {
    id   = oci_core_instance_pool.orders_api_pool.id
    type = "instancePool"
  }
  policies {
    display_name = "business-hours-scale-up"
    policy_type  = "scheduled"
    execution_schedule {
      expression = "0 7 * * 0-4"
      timezone   = "Asia/Riyadh"
      type       = "cron"
    }
    capacity {
      initial = 6
      min     = 6
      max     = 20
    }
  }
  policies {
    display_name = "after-hours-scale-down"
    policy_type  = "scheduled"
    execution_schedule {
      expression = "0 20 * * 0-4"
      timezone   = "Asia/Riyadh"
      type       = "cron"
    }
    capacity {
      initial = 2
      min     = 2
      max     = 20
    }
  }
}

		

Step 7: Rolling Instance Configuration Update

When you need to deploy a new application version, you update the Instance Configuration and then replace pool instances without taking the pool offline.

			
# Create new instance configuration with updated app version
resource "oci_core_instance_configuration" "app_instance_config_v2" {
  compartment_id = var.compartment_id
  display_name   = "orders-api-instance-config-v${var.new_app_version}"
  # Same configuration as v1 with updated user_data referencing new_app_version
  instance_details {
    instance_type = "compute"
    launch_details {
      # ... identical to v1 except user_data references new_app_version
    }
  }
}
# Update the pool to use the new configuration
resource "oci_core_instance_pool" "orders_api_pool" {
  instance_configuration_id = oci_core_instance_configuration.app_instance_config_v2.id
  # ... rest of pool config unchanged
}

		

			
oci compute-management instance-pool-instance attach \
  --instance-pool-id <pool-ocid> \
  --instance-id <instance-ocid>
# Or use the softreset action to trigger a rolling replace
oci compute-management instance-pool softreset \
  --instance-pool-id <pool-ocid>

		

Validating the Autoscaling Behavior

Generate artificial CPU load to test the scale-out policy:

			
# SSH into one of the pool instances via OCI Bastion
# Then stress the CPU
stress-ng --cpu 2 --cpu-load 90 --timeout 600

Watch the pool size change in real time:

			
watch -n 10 'oci compute-management instance-pool get \
  --instance-pool-id <pool-ocid> \
  --query "data.{size:size, state:\"lifecycle-state\"}" \
  --output table'

List all instances currently in the pool with their health status:

			
oci compute-management instance-pool-instance list \
  --instance-pool-id <pool-ocid> \
  --query 'data[*].{id:"id", state:"state", ad:"availability-domain", fault-domain:"fault-domain"}' \
  --output table

Check the autoscaling activity history to see every scale event with its trigger reason:

			
oci autoscaling auto-scaling-configuration list \
  --compartment-id <compartment-ocid> \
  --query 'data[*].{name:"display-name", enabled:"is-enabled"}' \
  --output table

Operational Notes

A few things that matter in production but are easy to miss.

Regards,
Osama

OCI DevOps: Building a Production CI/CD Pipeline with Terraform

Posted on June 16, 2026 by Osama Mustafa in Uncategorized

Service Architecture

OCI DevOps has five main components that work together.

The Project is the top-level container. It groups all related resources: code repositories, build pipelines, deployment pipelines, and environments.

Code Repositories mirror external Git repositories (GitHub, GitLab, Bitbucket) or host code natively inside OCI. Mirroring syncs on a schedule or on webhook trigger.

Build Pipelines execute build stages: managed build (runs your build spec on Oracle-managed runners), deliver artifact (pushes to Container Registry or Artifact Registry), and trigger deployment.

Artifact Registry stores generic versioned artifacts: Helm charts, Terraform modules, JAR files, and deployment manifests.

Deployment Pipelines run the actual deployment to a target environment. They support blue-green, canary, and rolling deployment strategies with built-in approval gates.

Step 1: IAM Setup

OCI DevOps needs a Dynamic Group that matches the build and deployment pipeline resources, and a policy that grants them the permissions to do their work.

			
resource "oci_identity_dynamic_group" "devops_build_dg" {
  compartment_id = var.tenancy_ocid
  name           = "devops-build-pipelines"
  description    = "Dynamic group for OCI DevOps build pipeline runners"
  matching_rule  = "All {resource.type = 'devopsbuildpipeline', resource.compartment.id = '${var.compartment_id}'}"
}
resource "oci_identity_dynamic_group" "devops_deploy_dg" {
  compartment_id = var.tenancy_ocid
  name           = "devops-deploy-pipelines"
  description    = "Dynamic group for OCI DevOps deployment pipelines"
  matching_rule  = "All {resource.type = 'devopsdeploypipeline', resource.compartment.id = '${var.compartment_id}'}"
}
resource "oci_identity_policy" "devops_policy" {
  compartment_id = var.compartment_id
  name           = "devops-pipeline-policy"
  description    = "Permissions for OCI DevOps build and deploy pipelines"
  statements = [
    # Build pipelines need to read secrets and push to container registry
    "Allow dynamic-group devops-build-pipelines to manage repos in compartment id ${var.compartment_id}",
    "Allow dynamic-group devops-build-pipelines to read secret-family in compartment id ${var.compartment_id}",
    "Allow dynamic-group devops-build-pipelines to manage artifacts in compartment id ${var.compartment_id}",
    "Allow dynamic-group devops-build-pipelines to manage devops-family in compartment id ${var.compartment_id}",
    # Deploy pipelines need to manage OKE workloads and read artifacts
    "Allow dynamic-group devops-deploy-pipelines to manage cluster-family in compartment id ${var.compartment_id}",
    "Allow dynamic-group devops-deploy-pipelines to use artifacts in compartment id ${var.compartment_id}",
    "Allow dynamic-group devops-deploy-pipelines to manage devops-family in compartment id ${var.compartment_id}",
    "Allow dynamic-group devops-deploy-pipelines to read secret-family in compartment id ${var.compartment_id}"
  ]
}

		

Step 2: Create the DevOps Project

			
resource "oci_devops_project" "orders_api_project" {
  compartment_id = var.compartment_id
  name           = "orders-api"
  description    = "CI/CD pipeline for the orders API service"
  notification_config {
    topic_id = oci_ons_notification_topic.devops_alerts.id
  }
  defined_tags = {
    "Operations.Environment" = "production"
    "Operations.ManagedBy"   = "terraform"
  }
}
resource "oci_ons_notification_topic" "devops_alerts" {
  compartment_id = var.compartment_id
  name           = "devops-pipeline-alerts"
  description    = "Notifications for DevOps pipeline events"
}
resource "oci_ons_subscription" "devops_email" {
  compartment_id = var.compartment_id
  topic_id       = oci_ons_notification_topic.devops_alerts.id
  protocol       = "EMAIL"
  endpoint       = var.devops_alert_email
}

		

Step 3: Mirror the GitHub Repository

			
resource "oci_devops_repository" "orders_api_repo" {
  project_id      = oci_devops_project.orders_api_project.id
  name            = "orders-api"
  description     = "Mirror of GitHub orders-api repository"
  repository_type = "MIRRORED"
  default_branch  = "main"
  mirror_repository_config {
    repository_url    = "https://github.com/your-org/orders-api.git"
    connector_id      = oci_devops_connection.github_connection.id
    trigger_schedule {
      schedule_type = "CUSTOM"
      custom_schedule = "0 */6 * * *"
    }
  }
}
resource "oci_devops_connection" "github_connection" {
  project_id      = oci_devops_project.orders_api_project.id
  display_name    = "github-connection"
  connection_type = "GITHUB_ACCESS_TOKEN"
  description     = "Connection to GitHub using PAT stored in OCI Vault"
  access_token = oci_vault_secret.github_pat.id
}
resource "oci_vault_secret" "github_pat" {
  compartment_id = var.compartment_id
  vault_id       = var.vault_id
  key_id         = var.vault_key_id
  secret_name    = "github-pat-devops"
  secret_content {
    content_type = "BASE64"
    content      = base64encode(var.github_personal_access_token)
  }
}

		

The GitHub PAT is stored in OCI Vault, not in a Terraform variable or environment variable on a CI runner. The build pipeline retrieves it at runtime using the Dynamic Group policy.

Step 4: Build Spec

The build spec is a YAML file committed to your repository at build_spec.yaml. It defines the steps the managed build runner executes.

			
version: 0.1
component: build
timeoutInSeconds: 1800
env:
  exportedVariables:
    - BUILDRUN_HASH
steps:
  - type: Command
    name: Set build hash
    command: |
      export BUILDRUN_HASH=$(echo ${OCI_BUILD_RUN_ID} | tail -c 8)
      echo "BUILDRUN_HASH: ${BUILDRUN_HASH}"
  - type: Command
    name: Install dependencies
    command: |
      cd orders-api
      pip install -r requirements.txt --quiet
  - type: Command
    name: Run unit tests
    command: |
      cd orders-api
      python -m pytest tests/unit/ -v --tb=short --junitxml=test-results.xml
      if [ $? -ne 0 ]; then
        echo "Unit tests failed. Aborting build."
        exit 1
      fi
  - type: Command
    name: Run security scan
    command: |
      pip install bandit --quiet
      cd orders-api
      bandit -r src/ -f json -o bandit-report.json -ll
      if [ $? -eq 1 ]; then
        echo "High severity security issues found. Aborting build."
        exit 1
      fi
  - type: Command
    name: Build container image
    command: |
      cd orders-api
      IMAGE_TAG="${CONTAINER_REGISTRY}/${NAMESPACE}/orders-api:${BUILDRUN_HASH}"
      docker build -t orders-api:latest -t ${IMAGE_TAG} .
      echo "IMAGE_TAG=${IMAGE_TAG}" >> ${OCI_PRIMARY_SOURCE_DIR}/build_output.env
  - type: Command
    name: Push image to OCI Container Registry
    command: |
      docker push ${IMAGE_TAG}
outputArtifacts:
  - name: orders-api-image
    type: DOCKER_IMAGE
    location: ${IMAGE_TAG}
  - name: kubernetes-manifests
    type: BINARY
    location: ${OCI_PRIMARY_SOURCE_DIR}/orders-api/k8s/

		

The security scan step uses Bandit to flag high-severity Python security issues and fails the build if any are found. This happens before the image is built, not after.

Step 5: Build Pipeline

			
resource "oci_devops_build_pipeline" "orders_api_build" {
  project_id   = oci_devops_project.orders_api_project.id
  display_name = "orders-api-build"
  description  = "Build, test, scan, and push the orders API container image"
  build_pipeline_parameters {
    items {
      name          = "CONTAINER_REGISTRY"
      default_value = "${var.oci_region_key}.ocir.io"
      description   = "OCI Container Registry endpoint"
    }
    items {
      name          = "NAMESPACE"
      default_value = var.tenancy_namespace
      description   = "OCI tenancy namespace for Container Registry"
    }
  }
}
# Stage 1: Managed Build
resource "oci_devops_build_pipeline_stage" "managed_build" {
  build_pipeline_id = oci_devops_build_pipeline.orders_api_build.id
  display_name      = "managed-build"
  description       = "Execute build spec on managed runner"
  build_pipeline_stage_type = "BUILD"
  build_spec_file                    = "build_spec.yaml"
  stage_execution_timeout_in_seconds = 1800
  image                              = "OL7_X86_64_STANDARD_10"
  build_source_collection {
    items {
      connection_type = "DEVOPS_CODE_REPOSITORY"
      repository_id   = oci_devops_repository.orders_api_repo.id
      name            = "orders-api"
      branch          = "main"
      repository_url  = oci_devops_repository.orders_api_repo.http_url
    }
  }
  build_pipeline_stage_predecessor_collection {
    items {
      id = oci_devops_build_pipeline.orders_api_build.id
    }
  }
}
# Stage 2: Deliver Artifact to Container Registry
resource "oci_devops_build_pipeline_stage" "deliver_artifact" {
  build_pipeline_id = oci_devops_build_pipeline.orders_api_build.id
  display_name      = "deliver-artifact"
  description       = "Push built image to OCI Container Registry"
  build_pipeline_stage_type = "DELIVER_ARTIFACT"
  deliver_artifact_collection {
    items {
      artifact_name = "orders-api-image"
      artifact_id   = oci_devops_deploy_artifact.orders_api_image.id
    }
    items {
      artifact_name = "kubernetes-manifests"
      artifact_id   = oci_devops_deploy_artifact.k8s_manifests.id
    }
  }
  build_pipeline_stage_predecessor_collection {
    items {
      id = oci_devops_build_pipeline_stage.managed_build.id
    }
  }
}
# Stage 3: Trigger Deployment Pipeline
resource "oci_devops_build_pipeline_stage" "trigger_deploy" {
  build_pipeline_id = oci_devops_build_pipeline.orders_api_build.id
  display_name      = "trigger-deployment"
  description       = "Trigger the deployment pipeline on successful build"
  build_pipeline_stage_type = "TRIGGER_DEPLOYMENT_PIPELINE"
  deploy_pipeline_id        = oci_devops_deploy_pipeline.orders_api_deploy.id
  is_pass_all_parameters_enabled = true
  build_pipeline_stage_predecessor_collection {
    items {
      id = oci_devops_build_pipeline_stage.deliver_artifact.id
    }
  }
}

		

Step 6: Artifact Registry

			
resource "oci_artifacts_repository" "k8s_manifests_repo" {
  compartment_id  = var.compartment_id
  display_name    = "orders-api-manifests"
  description     = "Kubernetes deployment manifests for orders API"
  is_immutable    = false
  repository_type = "GENERIC"
}
resource "oci_devops_deploy_artifact" "orders_api_image" {
  project_id             = oci_devops_project.orders_api_project.id
  display_name           = "orders-api-container-image"
  argument_substitution_mode = "SUBSTITUTE_PLACEHOLDERS"
  deploy_artifact_type   = "DOCKER_IMAGE"
  deploy_artifact_source {
    deploy_artifact_source_type = "OCIR"
    image_uri    = "${var.oci_region_key}.ocir.io/${var.tenancy_namespace}/orders-api:$${BUILDRUN_HASH}"
    image_digest = " "
  }
}
resource "oci_devops_deploy_artifact" "k8s_manifests" {
  project_id             = oci_devops_project.orders_api_project.id
  display_name           = "orders-api-k8s-manifests"
  argument_substitution_mode = "SUBSTITUTE_PLACEHOLDERS"
  deploy_artifact_type   = "KUBERNETES_MANIFEST"
  deploy_artifact_source {
    deploy_artifact_source_type = "GENERIC_ARTIFACT"
    repository_id  = oci_artifacts_repository.k8s_manifests_repo.id
    deploy_artifact_path    = "k8s/deployment.yaml"
    deploy_artifact_version = "$${BUILDRUN_HASH}"
  }
}

		

Step 7: Deployment Environment and Pipeline

The deployment pipeline targets the OKE cluster. Define the environment first, then the pipeline stages.

			
resource "oci_devops_deploy_environment" "oke_prod" {
  project_id              = oci_devops_project.orders_api_project.id
  display_name            = "oke-production"
  description             = "Production OKE cluster"
  deploy_environment_type = "OKE_CLUSTER"
  cluster_id              = var.oke_cluster_id
}
resource "oci_devops_deploy_pipeline" "orders_api_deploy" {
  project_id   = oci_devops_project.orders_api_project.id
  display_name = "orders-api-deploy"
  description  = "Blue-green deployment of orders API to production OKE"
  deploy_pipeline_parameters {
    items {
      name          = "NAMESPACE"
      default_value = "orders"
      description   = "Kubernetes namespace for the deployment"
    }
    items {
      name          = "IMAGE_TAG"
      default_value = "latest"
      description   = "Container image tag to deploy"
    }
  }
}
# Stage 1: Approval gate before production deployment
resource "oci_devops_deploy_stage" "approval_gate" {
  deploy_pipeline_id = oci_devops_deploy_pipeline.orders_api_deploy.id
  display_name       = "production-approval"
  description        = "Manual approval required before deploying to production"
  deploy_stage_type               = "MANUAL_APPROVAL"
  approval_policy {
    approval_policy_type         = "COUNT_BASED_APPROVAL"
    number_of_approvals_required = 1
  }
  deploy_stage_predecessor_collection {
    items {
      id = oci_devops_deploy_pipeline.orders_api_deploy.id
    }
  }
}
# Stage 2: Blue-green deploy to OKE
resource "oci_devops_deploy_stage" "oke_blue_green_deploy" {
  deploy_pipeline_id = oci_devops_deploy_pipeline.orders_api_deploy.id
  display_name       = "oke-blue-green-deploy"
  description        = "Deploy new version to green environment"
  deploy_stage_type = "OKE_BLUE_GREEN_DEPLOYMENT"
  oke_blue_green_deploy_stage_details {
    kubernetes_manifest_deploy_artifact_ids = [
      oci_devops_deploy_artifact.k8s_manifests.id
    ]
    oke_cluster_deploy_environment_id = oci_devops_deploy_environment.oke_prod.id
    blue_green_strategy {
      strategy_type     = "NGINX_INGRESS_STRATEGY"
      namespace_a       = "orders-blue"
      namespace_b       = "orders-green"
      ingress_name      = "orders-api-ingress"
    }
  }
  deploy_stage_predecessor_collection {
    items {
      id = oci_devops_deploy_stage.approval_gate.id
    }
  }
}
# Stage 3: Traffic shift after successful deployment validation
resource "oci_devops_deploy_stage" "traffic_shift" {
  deploy_pipeline_id = oci_devops_deploy_pipeline.orders_api_deploy.id
  display_name       = "shift-traffic-to-green"
  description        = "Shift 100% of traffic to the newly deployed green environment"
  deploy_stage_type = "OKE_BLUE_GREEN_TRAFFIC_SHIFT"
  oke_blue_green_traffic_shift_deploy_stage_details {
    oke_blue_green_deployment_deploy_stage_id = oci_devops_deploy_stage.oke_blue_green_deploy.id
  }
  deploy_stage_predecessor_collection {
    items {
      id = oci_devops_deploy_stage.oke_blue_green_deploy.id
    }
  }
}

		

Step 8: Trigger on Code Push

The trigger watches the mirrored repository and fires the build pipeline when a push lands on the main branch.

			
resource "oci_devops_trigger" "main_branch_push" {
  project_id     = oci_devops_project.orders_api_project.id
  display_name   = "main-branch-push-trigger"
  description    = "Trigger build pipeline on every push to main"
  trigger_source = "DEVOPS_CODE_REPOSITORY"
  repository_id  = oci_devops_repository.orders_api_repo.id
  actions {
    type        = "TRIGGER_BUILD_PIPELINE"
    build_pipeline_id = oci_devops_build_pipeline.orders_api_build.id
    filter {
      trigger_source = "DEVOPS_CODE_REPOSITORY"
      events         = ["PUSH"]
      include {
        head_ref = "main"
      }
      exclude {
        file_filter {
          file_paths = ["docs/*", "*.md", ".github/*"]
        }
      }
    }
  }
}

		

The exclude block prevents documentation-only changes from triggering a full build and deploy. Pushing a README update does not kick off the pipeline.

Step 9: Verifying the Pipeline

Once Terraform applies, validate the end-to-end flow.

Check mirror sync status:

			
oci devops repository get \
  --repository-id <your-repo-ocid> \
  --query 'data.{name:name, mirror-status:"mirror-repository-config"}' \
  --output table

Manually trigger a build to test without waiting for a push:

			
oci devops build-run create \
  --build-pipeline-id <your-build-pipeline-ocid> \
  --display-name "manual-validation-run" \
  --build-run-arguments '{"items": [{"name": "IMAGE_TAG", "value": "validation-test"}]}'

Watch the build run progress:

			
oci devops build-run get \
  --build-run-id <build-run-ocid> \
  --query 'data.{status:"lifecycle-state", phase:"build-run-progress"."build-pipeline-stage-run-progress"}' \
  --output table

List deployment history to confirm deployments are being tracked:

			
oci devops deployment list \
  --project-id <project-ocid> \
  --sort-by timeCreated \
  --sort-order DESC \
  --limit 10 \
  --query 'data.items[*].{name:"display-name", status:"lifecycle-state", time:\"time-created\"}' \
  --output table

		

Rollback

			
oci devops deployment approve \
  --deployment-id <deployment-ocid> \
  --deploy-stage-id <traffic-shift-stage-ocid> \
  --reason "Rolling back: latency regression detected in green environment" \
  --action REJECT

		

The green environment is torn down, the blue environment continues serving traffic, and the deployment is marked as failed with the reason recorded in the audit log.

Where This Fits in a Real Team

For teams already running everything inside OCI, it is the most operationally coherent choice.

Regards,
Osama

Orchestrating Production Workflows with AWS Step Functions

Posted on June 1, 2026 by Osama Mustafa in AWS, Cloud

I want to tell you about a production incident that still bothers me.

We had a payment processing system built on Lambda. Each function did one thing: validate the card, charge the customer, update the order, send the receipt, trigger fulfillment. Clean separation of concerns. Looked great on paper.

Then a Lambda timed out in the middle of the charge step. The card had been charged. The order had not been updated. The receipt never went out. Fulfillment never started. And because there was no central record of what had run, we had no way to resume from where things broke. We ended up with a manual cleanup process, a refund, and an angry customer.

The root problem was not the timeout. The root problem was that we had orchestration logic scattered across function calls, SQS queues, and environment variables. When something went wrong, we had no visibility and no way to recover cleanly.

AWS Step Functions exists to solve exactly this problem. It gives you a managed, visual, stateful orchestration layer that sits above your compute. In this article I will walk you through how Step Functions actually works, the patterns that matter in production, and the mistakes I see teams make when they first adopt it.

What Step Functions Actually Does

Step Functions is a serverless orchestration service. You define a workflow as a state machine using Amazon States Language, a JSON-based specification. Each state in the machine can invoke a Lambda function, call an AWS service directly, wait for a human approval, run a parallel branch, or retry on failure with configurable backoff.

The key thing that separates Step Functions from gluing Lambdas together with SQS is that the state machine itself is the source of truth. Every execution has a complete audit trail. You can look at any execution and see exactly which states ran, what input and output they received, when they ran, and whether they succeeded or failed. When something goes wrong you have a complete picture.

There are two workflow types and the choice matters.

Standard Workflows are designed for long-running, durable processes. They can run for up to a year. Every state transition is recorded in the execution history. You pay per state transition. This is what you want for anything involving payments, order processing, document workflows, or human approvals.

Express Workflows are designed for high-volume, short-duration workloads. They run for up to five minutes, have at-least-once execution semantics, and you pay per execution duration. Use them for event processing pipelines where you need to handle thousands of events per second and idempotency is handled at the application level.

Your First Production State Machine

Let me walk through a real example: an e-commerce order processing workflow. This is a Standard Workflow since order processing is exactly the kind of thing you need full durability and auditability for.

			
{
  "Comment": "Order processing workflow",
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789:function:validate-order",
      "Next": "CheckInventory",
      "Retry": [
        {
          "ErrorEquals": ["Lambda.ServiceException", "Lambda.TooManyRequestsException"],
          "IntervalSeconds": 2,
          "MaxAttempts": 3,
          "BackoffRate": 2
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["OrderValidationError"],
          "Next": "OrderRejected",
          "ResultPath": "$.error"
        }
      ]
    },
    "CheckInventory": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789:function:check-inventory",
      "Next": "ProcessPayment",
      "Retry": [
        {
          "ErrorEquals": ["States.TaskFailed"],
          "IntervalSeconds": 5,
          "MaxAttempts": 2,
          "BackoffRate": 1.5
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["InsufficientInventoryError"],
          "Next": "NotifyOutOfStock",
          "ResultPath": "$.error"
        }
      ]
    },
    "ProcessPayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789:function:process-payment",
      "Next": "FulfillmentAndNotification",
      "Retry": [
        {
          "ErrorEquals": ["Lambda.ServiceException"],
          "IntervalSeconds": 1,
          "MaxAttempts": 2,
          "BackoffRate": 2
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["PaymentDeclinedError"],
          "Next": "NotifyPaymentFailed",
          "ResultPath": "$.error"
        },
        {
          "ErrorEquals": ["States.ALL"],
          "Next": "OrderProcessingFailed",
          "ResultPath": "$.error"
        }
      ]
    },
    "FulfillmentAndNotification": {
      "Type": "Parallel",
      "Branches": [
        {
          "StartAt": "TriggerFulfillment",
          "States": {
            "TriggerFulfillment": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-east-1:123456789:function:trigger-fulfillment",
              "End": true
            }
          }
        },
        {
          "StartAt": "SendConfirmationEmail",
          "States": {
            "SendConfirmationEmail": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:us-east-1:123456789:function:send-email",
              "End": true
            }
          }
        }
      ],
      "Next": "OrderComplete"
    },
    "OrderComplete":         { "Type": "Succeed" },
    "OrderRejected":         { "Type": "Fail", "Error": "OrderRejected" },
    "NotifyOutOfStock":      { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789:function:notify-out-of-stock", "End": true },
    "NotifyPaymentFailed":   { "Type": "Task", "Resource": "arn:aws:lambda:us-east-1:123456789:function:notify-payment-failed", "End": true },
    "OrderProcessingFailed": { "Type": "Fail", "Error": "ProcessingFailed" }
  }
}

		

A few things worth pointing out in this definition.

The Retry blocks on each Task state handle transient failures automatically. The configuration above retries on Lambda service exceptions with exponential backoff. You get this behavior for free without writing any retry logic in your Lambda functions themselves.

The Catch blocks handle business-logic failures separately from infrastructure failures. A PaymentDeclinedError routes to a notification state. An unhandled exception routes to a generic failure state. The ResultPath ensures the error detail is written into the execution context alongside the original input, not replacing it.

The Parallel state in FulfillmentAndNotification runs fulfillment and email simultaneously. Both branches must complete before the workflow advances to OrderComplete. If either branch fails, the entire Parallel state fails. This is often exactly the behavior you want: do not mark the order complete until both downstream systems have been notified.

SDK Integrations: Stop Writing Wrapper Lambdas

One of the most common mistakes I see is writing Lambda functions whose only job is to call another AWS service. A Lambda that calls DynamoDB to write a record. A Lambda that sends an SNS message. A Lambda that starts a Glue job.

Step Functions has optimized integrations with over 220 AWS services. You can call these services directly from a state definition without a Lambda in the middle.

Here is a state that writes directly to DynamoDB:

			
"SaveOrderToDynamo": {
  "Type": "Task",
  "Resource": "arn:aws:states:::dynamodb:putItem",
  "Parameters": {
    "TableName": "orders",
    "Item": {
      "orderId":    { "S.$": "$.orderId" },
      "customerId": { "S.$": "$.customerId" },
      "status":     { "S": "CONFIRMED" },
      "totalAmount":{ "N.$": "States.Format('{}', $.totalAmount)" },
      "createdAt":  { "S.$": "$$.Execution.StartTime" }
    }
  },
  "Next": "SendToSNS"
}

		

And a state that publishes to SNS:

			
"SendToSNS": {
  "Type": "Task",
  "Resource": "arn:aws:states:::sns:publish",
  "Parameters": {
    "TopicArn": "arn:aws:sns:us-east-1:123456789:order-events",
    "Message": {
      "orderId.$":    "$.orderId",
      "customerId.$": "$.customerId",
      "status":       "CONFIRMED"
    }
  },
  "Next": "OrderComplete"
}

		

The .$ suffix on a key means “resolve this from the state input.” The $$.Execution.StartTime is a context object reference that gives you metadata about the current execution. These small conveniences add up significantly when building real workflows.

Removing wrapper Lambdas reduces cold starts, lowers your Lambda invocation costs, simplifies your IAM surface, and makes the workflow easier to read because every state’s purpose is self-evident.

The Wait for Callback Pattern

Some workflows cannot move forward until something external happens. A human needs to approve a refund. A third-party payment processor needs to call back. A document needs to pass a review queue.

Step Functions handles this with the waitForTaskToken integration pattern. The state machine pauses, sends a token to an external system, and resumes only when that token is returned.

Here is the state definition:

			
"WaitForManagerApproval": {
  "Type": "Task",
  "Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
  "Parameters": {
    "QueueUrl": "https://sqs.us-east-1.amazonaws.com/123456789/approval-queue",
    "MessageBody": {
      "taskToken.$":  "$$.Task.Token",
      "orderId.$":    "$.orderId",
      "amount.$":     "$.totalAmount",
      "requestedBy.$":"$.customerId"
    }
  },
  "HeartbeatSeconds": 3600,
  "Next": "ProcessApprovedRefund",
  "Catch": [
    {
      "ErrorEquals": ["ApprovalRejected"],
      "Next": "NotifyRejected"
    },
    {
      "ErrorEquals": ["States.HeartbeatTimeout"],
      "Next": "EscalateApproval"
    }
  ]
}

		

The approval service picks up the message, presents it to a manager, and then calls back:

			
import boto3
sfn = boto3.client("stepfunctions")
def handle_approval_decision(task_token: str, approved: bool, reason: str):
    if approved:
        sfn.send_task_success(
            taskToken=task_token,
            output=json.dumps({"approved": True, "approvedBy": "manager@company.com"})
        )
    else:
        sfn.send_task_failure(
            taskToken=task_token,
            error="ApprovalRejected",
            cause=reason
        )

		

The HeartbeatSeconds field is important. If the external system does not send a heartbeat or complete the task within that window, the state fails with a HeartbeatTimeout. In the example above that routes to an escalation state rather than silently hanging forever. Always set a heartbeat on any waitForTaskToken state.

Deploying with Terraform

Defining your state machine in the console is fine for exploration. In production, everything should be in code.

			
resource "aws_sfn_state_machine" "order_processing" {
  name     = "order-processing-workflow"
  role_arn = aws_iam_role.step_functions_role.arn
  type     = "STANDARD"
  definition = templatefile("${path.module}/state_machine.json", {
    validate_order_arn    = aws_lambda_function.validate_order.arn
    check_inventory_arn   = aws_lambda_function.check_inventory.arn
    process_payment_arn   = aws_lambda_function.process_payment.arn
    trigger_fulfillment_arn = aws_lambda_function.trigger_fulfillment.arn
    send_email_arn        = aws_lambda_function.send_email.arn
  })
  logging_configuration {
    level                  = "ALL"
    include_execution_data = true
    log_destination        = "${aws_cloudwatch_log_group.sfn_logs.arn}:*"
  }
  tracing_configuration {
    enabled = true
  }
}
resource "aws_iam_role" "step_functions_role" {
  name = "step-functions-order-processing-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action    = "sts:AssumeRole"
      Effect    = "Allow"
      Principal = { Service = "states.amazonaws.com" }
    }]
  })
}
resource "aws_iam_role_policy" "sfn_policy" {
  name = "sfn-order-processing-policy"
  role = aws_iam_role.step_functions_role.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = ["lambda:InvokeFunction"]
        Resource = [
          aws_lambda_function.validate_order.arn,
          aws_lambda_function.check_inventory.arn,
          aws_lambda_function.process_payment.arn,
          aws_lambda_function.trigger_fulfillment.arn,
          aws_lambda_function.send_email.arn
        ]
      },
      {
        Effect   = "Allow"
        Action   = ["logs:CreateLogDelivery", "logs:PutLogEvents", "logs:GetLogDelivery"]
        Resource = "*"
      },
      {
        Effect   = "Allow"
        Action   = ["xray:PutTraceSegments", "xray:PutTelemetryRecords"]
        Resource = "*"
      }
    ]
  })
}
resource "aws_cloudwatch_log_group" "sfn_logs" {
  name              = "/aws/states/order-processing"
  retention_in_days = 30
}

		

Using templatefile to inject Lambda ARNs into the state machine definition keeps your infrastructure code clean and makes it easy to reference the correct function ARN for each environment without hardcoding anything.

Observability in Production

Step Functions gives you three layers of observability out of the box when you configure them properly.

CloudWatch Metrics publishes execution counts, failure rates, and durations for every state machine automatically. Set alarms on ExecutionsFailed and ExecutionsTimedOut. For payment or order workflows, a single failed execution is worth an alert. For high-volume event pipelines, set a threshold based on your acceptable failure rate.

CloudWatch Logs with include_execution_data = true captures the full input and output of every state transition. This is the setting that makes debugging possible. Without it, you know a state failed but not what data it received. With it, you can replay the exact scenario that caused the failure.

X-Ray tracing propagates trace context through Lambda invocations triggered by your state machine. In the AWS console, you get a service map showing exactly where time was spent across each execution. For workflows where latency matters, this is the fastest way to identify the bottleneck.

One practical tip: write a CloudWatch Insights query that you can run immediately when an incident starts.

			
fields @timestamp, execution_arn, type, details.name, details.status
| filter type in ["ExecutionFailed", "TaskFailed", "TaskStateExited"]
| sort @timestamp desc
| limit 50

Save this query before you need it. Running it during an incident is much faster than clicking through individual executions.

Common Mistakes

Not setting ResultPath on Catch handlers. By default, a Catch block replaces the entire state input with the error object. Your downstream states then receive only the error, not the original order data they need. Always use "ResultPath": "$.error" to merge the error into the existing input.

Using Express Workflows for payment processing. Express Workflows have at-least-once semantics. A state can execute more than once under failure conditions. For anything involving money or external side effects, use Standard Workflows with idempotency keys in your Lambda functions, or use Standard Workflows period.

Ignoring the execution history limit. Standard Workflow execution history is capped at 25,000 events. For very long-running workflows with many state transitions, you can hit this limit. If your workflow runs for days or weeks with thousands of steps, use the Map state with chunking to keep individual execution histories manageable.

Hardcoding ARNs in state machine definitions. Environment-specific ARNs belong in Terraform variables or SSM Parameter Store, not in your state machine JSON. The pattern shown above with templatefile keeps this clean.

Step Functions does not eliminate complexity. What it does is make complexity visible and manageable. Your business logic lives in Lambda. Your orchestration logic lives in the state machine. When something fails, you have a complete, queryable record of exactly what happened and where.

The teams that get the most value from Step Functions are the ones that resist the temptation to build orchestration logic into their Lambda functions. Keep each function focused on a single responsibility. Let the state machine handle sequencing, retries, error routing, and parallelism. The result is a system where debugging takes minutes instead of hours and where new team members can understand the full workflow by reading a single JSON file.

Enjoy the cloud.

Osama

OCI Network Firewall: Building a Centralized Inspection Architecture with Terraform

Posted on May 30, 2026May 30, 2026 by Osama Mustafa in Uncategorized

Architecture: Hub-and-Spoke with Centralized Inspection

Step 1: Hub VCN and Firewall Subnet

			
resource "oci_core_vcn" "hub_vcn" {
  compartment_id = var.compartment_id
  cidr_blocks    = ["192.168.0.0/16"]
  display_name   = "hub-inspection-vcn"
  dns_label      = "hubvcn"
}
# Firewall subnet - the firewall VNIC lives here
resource "oci_core_subnet" "firewall_subnet" {
  compartment_id             = var.compartment_id
  vcn_id                     = oci_core_vcn.hub_vcn.id
  cidr_block                 = "192.168.1.0/24"
  display_name               = "firewall-subnet"
  dns_label                  = "fwsubnet"
  prohibit_public_ip_on_vnic = true
  route_table_id             = oci_core_route_table.firewall_subnet_rt.id
  security_list_ids          = [oci_core_security_list.firewall_sl.id]
}
# Internet Gateway for north-south traffic
resource "oci_core_internet_gateway" "hub_igw" {
  compartment_id = var.compartment_id
  vcn_id         = oci_core_vcn.hub_vcn.id
  display_name   = "hub-internet-gateway"
  enabled        = true
}
# DRG for spoke VCN attachment
resource "oci_core_drg" "hub_drg" {
  compartment_id = var.compartment_id
  display_name   = "hub-drg"
}
resource "oci_core_drg_attachment" "hub_vcn_attachment" {
  drg_id       = oci_core_drg.hub_drg.id
  display_name = "hub-vcn-attachment"
  network_details {
    id   = oci_core_vcn.hub_vcn.id
    type = "VCN"
  }
}

		

The firewall subnet must not have a public IP on its VNIC. The firewall receives traffic through routing, not through a public endpoint.

Step 2: Firewall Policy

			
resource "oci_network_firewall_network_firewall_policy" "production_policy" {
  compartment_id = var.compartment_id
  display_name   = "production-inspection-policy"
}
# IP address list for trusted internal RFC1918 ranges
resource "oci_network_firewall_network_firewall_policy_address_list" "internal_ranges" {
  name                       = "internal-rfc1918"
  network_firewall_policy_id = oci_network_firewall_network_firewall_policy.production_policy.id
  type                       = "IP"
  addresses = [
    "10.0.0.0/8",
    "172.16.0.0/12",
    "192.168.0.0/16"
  ]
}
# FQDN list for allowed outbound SaaS destinations
resource "oci_network_firewall_network_firewall_policy_address_list" "allowed_saas" {
  name                       = "allowed-saas-fqdns"
  network_firewall_policy_id = oci_network_firewall_network_firewall_policy.production_policy.id
  type                       = "FQDN"
  addresses = [
    "*.oracle.com",
    "*.oraclecloud.com",
    "*.github.com",
    "registry-1.docker.io",
    "auth.docker.io",
    "production.cloudflare.docker.com"
  ]
}
# URL list for blocked categories
resource "oci_network_firewall_network_firewall_policy_url_list" "blocked_urls" {
  name                       = "blocked-url-categories"
  network_firewall_policy_id = oci_network_firewall_network_firewall_policy.production_policy.id
  urls {
    pattern = "*.pastebin.com"
    type    = "SIMPLE"
  }
  urls {
    pattern = "*.ngrok.io"
    type    = "SIMPLE"
  }
  urls {
    pattern = "*.ngrok-free.app"
    type    = "SIMPLE"
  }
}
# Application list scoping HTTPS traffic
resource "oci_network_firewall_network_firewall_policy_application_group" "web_apps" {
  name                       = "web-traffic"
  network_firewall_policy_id = oci_network_firewall_network_firewall_policy.production_policy.id
  apps = ["HTTP", "HTTPS", "SSL"]
}

		

Step 3: Security Rules

OCI Dedicated Region Cloud@Customer: Architecture, Deployment Patterns, and Terraform Automation

Posted on May 2, 2026 by Osama Mustafa in Uncategorized

Most cloud conversations start with a simple assumption: your workloads go to the cloud provider’s data center. For a large number of organizations, particularly in government, financial services, healthcare, and defense, that assumption is the exact problem. Data sovereignty laws, regulatory requirements, and security classification levels mean that certain workloads cannot leave a specific physical location, full stop.

OCI Dedicated Region Cloud@Customer, commonly referred to as DRCC, solves this without forcing a compromise. Oracle deploys a full OCI region — not a subset of services, not a gateway appliance, but a complete cloud region with the same hardware, software stack, APIs, and SLAs — inside your own data center. You get every OCI service you would use in a public region, with the control plane managed by Oracle and the physical infrastructure sitting on your floor.

In this post I will cover how DRCC is architected, how it differs from OCI Exadata Cloud@Customer and Roving Edge, the networking requirements, IAM federation considerations, and how to automate workload deployment using Terraform once the region is live.

What DRCC Actually Delivers

The distinction between DRCC and other on-premises cloud appliances matters technically. Most cloud-at-customer offerings give you a subset of services through a dedicated appliance: a handful of compute shapes, object storage, and maybe a managed database. DRCC is architecturally different.

Oracle physically ships and installs the same rack infrastructure used in public OCI regions into your facility. The region runs the same OCI control plane software, exposes the same REST APIs, and integrates with OCI IAM and Oracle Cloud Console using the same tooling. When you run a Terraform plan against a DRCC region, the provider configuration is identical to a public region. You change the region identifier in your config and the code works without modification.

The full service catalog available in DRCC includes Compute (including bare metal and GPU shapes), OKE (Oracle Kubernetes Engine), Autonomous Database, Exadata Database Service, Object Storage, Block Volumes, File Storage, VCN, Load Balancer, API Gateway, Functions, Streaming, OCI Vault, Identity and Access Management, Monitoring, Logging, Events, and Notifications. This is not a stripped-down subset — it is the complete stack.

Hardware minimum footprint starts at a base rack configuration that supports a production workload. Oracle handles all hardware maintenance, software patching, and control plane operations. Your team manages what runs on top: compartments, IAM policies, networking, and workloads.

How DRCC Differs from Related Oracle Offerings

Before going further it is worth clarifying where DRCC sits relative to two commonly confused offerings.

OCI Exadata Cloud@Customer deploys Exadata Database Service hardware into your data center. It is a database-specific offering. You get Autonomous Database and Exadata Database Service on-premises, but not the broader OCI service catalog. If you need compute, containers, serverless, and object storage alongside the database layer, Exadata Cloud@Customer alone does not cover it.

OCI Roving Edge Infrastructure is a ruggedized portable device designed for disconnected or intermittently connected environments: ships, remote field operations, military forward deployments. It runs a subset of OCI services and is designed to operate without a persistent connection to the OCI control plane. DRCC requires a reliable network connection back to Oracle for control plane operations and is designed for fixed, well-connected facilities.

DRCC is the right choice when you need the full OCI service catalog, the workloads must stay on-premises for regulatory or sovereignty reasons, and you have a proper data center with the power, cooling, and network capacity to host the infrastructure.

Network Architecture Requirements

DRCC has specific network requirements that you need to understand before the hardware arrives. Getting these wrong means the region cannot operate.

The DRCC racks need connectivity on three planes: the management network, the customer data network, and the Oracle back-channel.

The management network connects Oracle’s control plane software running inside your facility to Oracle’s global control plane over the internet or a dedicated circuit. Oracle uses this path for software updates, monitoring, and operational management of the region. This connection is outbound-initiated from the DRCC hardware, encrypted with TLS, and authenticated with certificates. Oracle publishes the specific IP ranges that need to be permitted through your firewall. You do not control what flows over this channel, but Oracle’s contractual commitments define exactly what does.

The customer data network connects your existing on-premises infrastructure to the DRCC region. This is a standard 25G or 100G ethernet connection depending on the rack configuration. You configure VCN peering or FastConnect-equivalent local connections to bridge your existing network into the DRCC VCN.

Here is how you configure a VCN in DRCC using Terraform, which is identical to a public region:

			
terraform {
  required_providers {
    oci = {
      source  = "oracle/oci"
      version = ">= 5.0.0"
    }
  }
}
provider "oci" {
  tenancy_ocid     = var.tenancy_ocid
  user_ocid        = var.user_ocid
  fingerprint      = var.fingerprint
  private_key_path = var.private_key_path
  # This is your DRCC region identifier
  # Oracle assigns this during provisioning, format: us-yourdatacenter-1
  region           = var.drcc_region
}
resource "oci_core_vcn" "drcc_primary_vcn" {
  compartment_id = var.compartment_id
  cidr_blocks    = ["10.100.0.0/16"]
  display_name   = "drcc-primary-vcn"
  dns_label      = "drccprimary"
}
# Application tier subnet - private
resource "oci_core_subnet" "app_subnet" {
  compartment_id             = var.compartment_id
  vcn_id                     = oci_core_vcn.drcc_primary_vcn.id
  cidr_block                 = "10.100.1.0/24"
  display_name               = "app-private-subnet"
  dns_label                  = "apppriv"
  prohibit_public_ip_on_vnic = true
  route_table_id             = oci_core_route_table.private_rt.id
  security_list_ids          = [oci_core_security_list.app_sl.id]
}
# Database tier subnet - private
resource "oci_core_subnet" "db_subnet" {
  compartment_id             = var.compartment_id
  vcn_id                     = oci_core_vcn.drcc_primary_vcn.id
  cidr_block                 = "10.100.2.0/24"
  display_name               = "db-private-subnet"
  dns_label                  = "dbpriv"
  prohibit_public_ip_on_vnic = true
  route_table_id             = oci_core_route_table.private_rt.id
  security_list_ids          = [oci_core_security_list.db_sl.id]
}
# Local Peering Gateway to connect DRCC VCN to your on-premises network
resource "oci_core_local_peering_gateway" "onprem_lpg" {
  compartment_id = var.compartment_id
  vcn_id         = oci_core_vcn.drcc_primary_vcn.id
  display_name   = "onprem-peering-gateway"
}

		

The Local Peering Gateway in DRCC context connects the DRCC VCN to your on-premises routed network via the physical data network. This gives your existing on-premises workloads direct, low-latency access to everything running in the DRCC region without traffic ever leaving your facility.

IAM Federation in a DRCC Deployment

DRCC shares the OCI IAM control plane with the public region associated with your tenancy. This has important implications for how you manage identities.

Your DRCC region is part of your existing OCI tenancy. Users, groups, and dynamic groups created in OCI IAM apply to DRCC resources the same way they apply to public region resources. If you already federate OCI IAM with your corporate identity provider (Active Directory, Okta, Azure AD), those federated identities work in DRCC without additional configuration.

Here is the IAM federation configuration for Active Directory using SAML:

			
# Identity Provider configuration for AD FS
resource "oci_identity_identity_provider" "ad_federation" {
  compartment_id      = var.tenancy_ocid
  name                = "corporate-adfs"
  description         = "Corporate Active Directory Federation Services"
  product_type        = "ADFS"
  protocol            = "SAML2"
  metadata            = file("${path.module}/adfs-metadata.xml")
  freeform_tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}
# Map AD group to OCI group for DRCC operations team
resource "oci_identity_idp_group_mapping" "drcc_admins_mapping" {
  idp_id        = oci_identity_identity_provider.ad_federation.id
  idp_group_name = "CN=DRCC-Admins,OU=CloudTeams,DC=corp,DC=example,DC=com"
  group_id      = oci_identity_group.drcc_admins.id
}
resource "oci_identity_group" "drcc_admins" {
  compartment_id = var.tenancy_ocid
  name           = "drcc-platform-admins"
  description    = "DRCC platform administration team"
}
# Compartment structure for DRCC workload isolation
resource "oci_identity_compartment" "drcc_root" {
  compartment_id = var.tenancy_ocid
  name           = "drcc-production"
  description    = "Root compartment for all DRCC production workloads"
}
resource "oci_identity_compartment" "drcc_networking" {
  compartment_id = oci_identity_compartment.drcc_root.id
  name           = "drcc-networking"
  description    = "Networking resources for DRCC region"
}
resource "oci_identity_compartment" "drcc_workloads" {
  compartment_id = oci_identity_compartment.drcc_root.id
  name           = "drcc-workloads"
  description    = "Application workloads running in DRCC"
}
# Least-privilege policy for DRCC admins
resource "oci_identity_policy" "drcc_admin_policy" {
  compartment_id = oci_identity_compartment.drcc_root.id
  name           = "drcc-admin-policy"
  description    = "Platform admin permissions scoped to DRCC compartment"
  statements = [
    "Allow group drcc-platform-admins to manage all-resources in compartment drcc-production",
    "Allow group drcc-platform-admins to read all-resources in tenancy where request.region = '${var.drcc_region}'",
    "Allow group drcc-platform-admins to manage virtual-network-family in compartment drcc-production:drcc-networking",
    "Allow group drcc-platform-admins to manage instance-family in compartment drcc-production:drcc-workloads",
    "Allow group drcc-platform-admins to manage autonomous-database-family in compartment drcc-production:drcc-workloads"
  ]
}

		

One critical IAM behavior specific to DRCC: you can write IAM policies that restrict actions to your DRCC region using the request.region condition. This means a group can have full admin rights in DRCC but zero access to your public OCI regions, or vice versa. For organizations with strict separation between on-premises and cloud teams, this is an important control.

Deploying OKE on DRCC

OKE on DRCC runs the same as OKE in a public region. The control plane components run inside the DRCC rack. The API server endpoint is reachable from within your data center network without any traffic leaving the facility.

			
resource "oci_containerengine_cluster" "drcc_cluster" {
  compartment_id     = oci_identity_compartment.drcc_workloads.id
  kubernetes_version = "v1.29.1"
  name               = "drcc-production-cluster"
  vcn_id             = oci_core_vcn.drcc_primary_vcn.id
  endpoint_config {
    is_public_ip_enabled = false
    subnet_id            = oci_core_subnet.app_subnet.id
  }
  options {
    service_lb_subnet_ids = [oci_core_subnet.app_subnet.id]
    kubernetes_network_config {
      pods_cidr     = "10.244.0.0/16"
      services_cidr = "10.96.0.0/16"
    }
    add_ons {
      is_kubernetes_dashboard_enabled = false
      is_tiller_enabled               = false
    }
  }
}
resource "oci_containerengine_node_pool" "drcc_workers" {
  cluster_id         = oci_containerengine_cluster.drcc_cluster.id
  compartment_id     = oci_identity_compartment.drcc_workloads.id
  kubernetes_version = "v1.29.1"
  name               = "drcc-worker-pool"
  node_config_details {
    size = 3
    placement_configs {
      availability_domain = data.oci_identity_availability_domains.drcc_ads.availability_domains[0].name
      subnet_id           = oci_core_subnet.app_subnet.id
    }
  }
  node_shape = "VM.Standard3.Flex"
  node_shape_config {
    memory_in_gbs = 64
    ocpus         = 8
  }
  node_source_details {
    image_id    = data.oci_core_images.ol8_image.images[0].id
    source_type = "IMAGE"
    boot_volume_size_in_gbs = 100
  }
  initial_node_labels {
    key   = "workload-tier"
    value = "application"
  }
}

		

The is_public_ip_enabled = false on the endpoint config is non-negotiable in a DRCC context. The API server should only be reachable from within your data center network. Any tooling that manages the cluster (Argo CD, Flux, CI pipelines) connects to the internal endpoint directly.

Deploying Autonomous Database on DRCC

Autonomous Database on DRCC is identical in API and behavior to the public region version. The database runs entirely within your facility.

			
resource "oci_database_autonomous_database" "drcc_adb" {
  compartment_id           = oci_identity_compartment.drcc_workloads.id
  db_name                  = "DRCCPROD"
  display_name             = "drcc-production-adb"
  db_workload              = "OLTP"
  cpu_core_count           = 4
  data_storage_size_in_tbs = 2
  admin_password           = var.adb_admin_password
  is_auto_scaling_enabled  = true
  is_dedicated             = false
  # Private endpoint configuration - no public access
  subnet_id                  = oci_core_subnet.db_subnet.id
  private_endpoint_label     = "drccprodadb"
  is_access_control_enabled  = true
  whitelisted_ips = [
    oci_core_subnet.app_subnet.id
  ]
  defined_tags = {
    "Operations.Environment" = "production"
    "Operations.Region"      = "drcc"
    "Operations.ManagedBy"   = "terraform"
  }
}

		

The subnet_id and private_endpoint_label fields configure the database with a private endpoint inside the db subnet. Only resources in the whitelisted subnets can connect. No public endpoint is created.

Security Baseline for DRCC Deployments

DRCC gives you physical control over the hardware, but that does not mean you can skip the standard OCI security baseline. The software layer still requires proper configuration.

Enable Cloud Guard at the tenancy level scoped to your DRCC compartments:

			
resource "oci_cloud_guard_cloud_guard_configuration" "drcc_cloud_guard" {
  compartment_id   = var.tenancy_ocid
  reporting_region = var.drcc_region
  status           = "ENABLED"
}
resource "oci_cloud_guard_target" "drcc_target" {
  compartment_id       = oci_identity_compartment.drcc_root.id
  display_name         = "drcc-production-target"
  target_resource_id   = oci_identity_compartment.drcc_root.id
  target_resource_type = "COMPARTMENT"
  target_detector_recipes {
    detector_recipe_id = data.oci_cloud_guard_detector_recipes.config_recipe.detector_recipe_collection[0].items[0].id
  }
  target_responder_recipes {
    responder_recipe_id = data.oci_cloud_guard_responder_recipes.oci_responder.responder_recipe_collection[0].items[0].id
  }
}

		

Enable Vault for all secrets, keys, and credentials used by workloads running in DRCC. Because the Vault service runs inside the rack, key material never leaves your facility:

			
resource "oci_kms_vault" "drcc_vault" {
  compartment_id = oci_identity_compartment.drcc_workloads.id
  display_name   = "drcc-workloads-vault"
  vault_type     = "VIRTUAL_PRIVATE"
}
resource "oci_kms_key" "drcc_master_key" {
  compartment_id      = oci_identity_compartment.drcc_workloads.id
  display_name        = "drcc-master-encryption-key"
  management_endpoint = oci_kms_vault.drcc_vault.management_endpoint
  key_shape {
    algorithm = "AES"
    length    = 32
  }
  protection_mode = "HSM"
}

		

The VIRTUAL_PRIVATE vault type and HSM protection mode ensure the key material is stored in the hardware security module inside the DRCC rack. Combined with the fact that the rack is physically in your data center, you have full chain-of-custody over the cryptographic material protecting your data.

Operational Considerations

A few things that are specific to operating DRCC that do not come up when working with public regions.

Oracle is responsible for hardware maintenance and software patching of the control plane. You receive advance notification of maintenance windows. During a control plane maintenance window, the management APIs may be briefly unavailable, but running workloads continue without interruption. Plan your deployment pipelines to account for these windows.

Capacity planning is different from the public cloud. In a public region, you scale up by requesting more resources and the cloud absorbs the demand. In DRCC, you have a fixed hardware footprint. If you need to scale beyond the initial rack configuration, you work with Oracle to add capacity. Build capacity planning reviews into your quarterly operations cycle and monitor resource utilization with OCI Monitoring the same way you would in a public region.

The Oracle back-channel for management operations needs to be permanently open. If your network team applies a firewall rule that blocks this traffic, the control plane loses contact with Oracle and becomes degraded. Work with Oracle to get the exact IP ranges and port requirements before go-live and document them clearly in your firewall change management process.

When DRCC Is the Right Choice

DRCC makes sense when at least one of these conditions is true: your regulatory framework requires data residency within a specific physical location you control, your security classification means workloads cannot traverse public internet infrastructure at any point, your latency requirements for database and application tiers demand co-location in your own facility, or you have existing on-premises infrastructure that needs tight integration with cloud services without egress cost or latency overhead.

It is not the right choice for organizations that want cloud economics without data center investment, for workloads with highly variable capacity requirements that would benefit from elastic public cloud scaling, or for teams that want to avoid the operational overhead of maintaining physical infrastructure.

For those who do meet the criteria, DRCC is one of the more complete sovereign cloud offerings on the market. The fact that the APIs and tooling are identical to the public cloud means your engineers do not need to learn a second system, your Terraform code travels unchanged, and your OKE workloads run without modification.

Regards,

Osama

Cross-Cloud Secret Synchronization: AWS Secrets Manager and OCI Vault in a Production Multi-Cloud Setup

Posted on April 24, 2026 by Osama Mustafa in Uncategorized

One of the most overlooked problems in multi-cloud environments is secrets management across providers. Teams usually solve it badly: they store the same secret in both clouds manually, forget to rotate one of them, and find out during an outage that the credentials have been out of sync for three months.

In this post I will walk through building an automated secrets synchronization pipeline between AWS Secrets Manager and OCI Vault. When a secret rotates in AWS, the pipeline detects the rotation event, retrieves the new value, and pushes it into OCI Vault automatically. Everything is built with Terraform, an AWS Lambda function, and OCI IAM. No manual steps after the initial deployment.

This is a pattern I have used in environments where the database layer runs on OCI (leveraging Oracle Database pricing and performance) while the application layer runs on AWS. Both sides need the same database credentials, and both sides need to stay in sync without human intervention.

Architecture

The flow works like this:

AWS Secrets Manager rotation event fires via EventBridge, which triggers a Lambda function. The Lambda retrieves the new secret value, authenticates to OCI using an API key stored in its own environment (not hardcoded), and calls the OCI Vault API to update the corresponding secret version. OCI Vault stores the new value and makes it available to workloads running in OCI.

Prerequisites

Before starting you need:

AWS account with permissions to manage Secrets Manager, Lambda, EventBridge, and IAM
OCI tenancy with permissions to manage Vault, Keys, and IAM policies
Terraform 1.5 or later
Python 3.11 for the Lambda function
An existing OCI Vault and master encryption key (or we will create one)

Step 1: OCI Vault and IAM Setup

Start with OCI. We need a Vault, a master key, and an IAM user whose API key the Lambda will use to authenticate.

hcl

			
# OCI Vault
resource "oci_kms_vault" "app_vault" {
  compartment_id = var.compartment_id
  display_name   = "multi-cloud-secrets-vault"
  vault_type     = "DEFAULT"
}
# Master Encryption Key inside the Vault
resource "oci_kms_key" "secrets_key" {
  compartment_id      = var.compartment_id
  display_name        = "secrets-master-key"
  management_endpoint = oci_kms_vault.app_vault.management_endpoint
  key_shape {
    algorithm = "AES"
    length    = 32
  }
}
# IAM user for cross-cloud access
resource "oci_identity_user" "sync_user" {
  compartment_id = var.tenancy_ocid
  name           = "aws-secrets-sync-user"
  description    = "Service user for AWS Lambda to push secrets into OCI Vault"
  email          = "sync-user@internal.example.com"
}
# API key for the sync user (you will generate the actual key pair separately)
resource "oci_identity_api_key" "sync_user_key" {
  user_id   = oci_identity_user.sync_user.id
  key_value = var.oci_sync_user_public_key_pem
}
# IAM group for the sync user
resource "oci_identity_group" "sync_group" {
  compartment_id = var.tenancy_ocid
  name           = "secrets-sync-group"
  description    = "Group for cross-cloud secrets sync service users"
}
resource "oci_identity_user_group_membership" "sync_membership" {
  group_id = oci_identity_group.sync_group.id
  user_id  = oci_identity_user.sync_user.id
}
# Minimal IAM policy - only what is needed, nothing more
resource "oci_identity_policy" "sync_policy" {
  compartment_id = var.compartment_id
  name           = "secrets-sync-policy"
  description    = "Allows sync user to manage secrets in the app vault only"
  statements = [
    "Allow group secrets-sync-group to manage secret-family in compartment id ${var.compartment_id} where target.vault.id = '${oci_kms_vault.app_vault.id}'",
    "Allow group secrets-sync-group to use keys in compartment id ${var.compartment_id} where target.key.id = '${oci_kms_key.secrets_key.id}'"
  ]
}

		

The policy scope is intentionally narrow. The sync user can only manage secrets inside this specific vault and can only use this specific key. If the AWS Lambda credentials are ever compromised, the blast radius is limited to this vault.

Step 2: Create the Initial Secret in OCI Vault

We need a secret placeholder in OCI Vault that the Lambda will update. The initial value does not matter since it will be overwritten on the first sync.

hcl

			
resource "oci_vault_secret" "db_password" {
  compartment_id = var.compartment_id
  vault_id       = oci_kms_vault.app_vault.id
  key_id         = oci_kms_key.secrets_key.id
  secret_name    = "prod-db-password"
  secret_content {
    content_type = "BASE64"
    content      = base64encode("initial-placeholder-value")
    name         = "v1"
    stage        = "CURRENT"
  }
  metadata = {
    source      = "aws-secrets-manager"
    aws_secret  = "prod/database/password"
    environment = "production"
  }
}

		

Step 3: AWS Secrets Manager and the Source Secret

On the AWS side, create the authoritative secret and enable automatic rotation.

hcl

			
resource "aws_secretsmanager_secret" "db_password" {
  name                    = "prod/database/password"
  description             = "Production database password - synced to OCI Vault"
  recovery_window_in_days = 7
  tags = {
    Environment   = "production"
    SyncTarget    = "oci-vault"
    OciSecretName = "prod-db-password"
  }
}
resource "aws_secretsmanager_secret_version" "db_password_v1" {
  secret_id     = aws_secretsmanager_secret.db_password.id
  secret_string = jsonencode({
    username = "db_admin",
    password = var.initial_db_password,
    host     = var.db_host,
    port     = 1521,
    database = "PRODDB"
  })
}
# Rotation configuration - rotate every 30 days
resource "aws_secretsmanager_secret_rotation" "db_password_rotation" {
  secret_id           = aws_secretsmanager_secret.db_password.id
  rotation_lambda_arn = aws_lambda_function.db_rotation_lambda.arn
  rotation_rules {
    automatically_after_days = 30
  }
}

		

Step 4: Store OCI Credentials in AWS Secrets Manager

The Lambda needs OCI API credentials to authenticate. Store them as a secret in AWS Secrets Manager so they never appear in Lambda environment variables in plaintext.

hcl

			
resource "aws_secretsmanager_secret" "oci_credentials" {
  name        = "internal/oci-sync-credentials"
  description = "OCI API key credentials for secrets sync Lambda"
  tags = {
    Environment = "production"
    Purpose     = "cross-cloud-sync"
  }
}
resource "aws_secretsmanager_secret_version" "oci_credentials_v1" {
  secret_id = aws_secretsmanager_secret.oci_credentials.id
  secret_string = jsonencode({
    tenancy_ocid  = var.oci_tenancy_ocid,
    user_ocid     = var.oci_sync_user_ocid,
    fingerprint   = var.oci_api_key_fingerprint,
    private_key   = var.oci_private_key_pem,
    region        = var.oci_region
  })
}

		

Step 5: The Lambda Function

This is the core of the pipeline. The Lambda retrieves the rotated secret from AWS Secrets Manager, loads OCI credentials from its own secrets store, and calls the OCI Vault API to create a new secret version.

python

			
import boto3
import json
import base64
import oci
import logging
import os
from datetime import datetime, timezone
from botocore.exceptions import ClientError
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def get_oci_config():
    """Retrieve OCI credentials from AWS Secrets Manager."""
    client = boto3.client("secretsmanager", region_name=os.environ["AWS_REGION"])
    
    try:
        response = client.get_secret_value(
            SecretId=os.environ["OCI_CREDENTIALS_SECRET_ARN"]
        )
        creds = json.loads(response["SecretString"])
        
        return {
            "tenancy": creds["tenancy_ocid"],
            "user": creds["user_ocid"],
            "fingerprint": creds["fingerprint"],
            "key_content": creds["private_key"],
            "region": creds["region"]
        }
    except ClientError as e:
        logger.error(f"Failed to retrieve OCI credentials: {e}")
        raise
def get_aws_secret(secret_arn: str) -> str:
    """Retrieve the current value of an AWS secret."""
    client = boto3.client("secretsmanager", region_name=os.environ["AWS_REGION"])
    
    try:
        response = client.get_secret_value(SecretId=secret_arn)
        return response.get("SecretString") or base64.b64decode(
            response["SecretBinary"]
        ).decode("utf-8")
    except ClientError as e:
        logger.error(f"Failed to retrieve AWS secret {secret_arn}: {e}")
        raise
def push_to_oci_vault(
    oci_config: dict,
    vault_id: str,
    key_id: str,
    secret_ocid: str,
    secret_value: str
):
    """Create a new version of an OCI Vault secret."""
    vaults_client = oci.vault.VaultsClient(oci_config)
    
    encoded_value = base64.b64encode(secret_value.encode("utf-8")).decode("utf-8")
    
    update_details = oci.vault.models.UpdateSecretDetails(
        secret_content=oci.vault.models.Base64SecretContentDetails(
            content_type=oci.vault.models.SecretContentDetails.CONTENT_TYPE_BASE64,
            content=encoded_value,
            name=f"sync-{datetime.now(timezone.utc).strftime('%Y%m%d%H%M%S')}",
            stage="CURRENT"
        ),
        metadata={
            "synced_from": "aws-secrets-manager",
            "synced_at": datetime.now(timezone.utc).isoformat()
        }
    )
    
    response = vaults_client.update_secret(
        secret_id=secret_ocid,
        update_secret_details=update_details
    )
    
    logger.info(
        f"OCI secret updated. OCID: {secret_ocid}, "
        f"New version: {response.data.current_version_number}"
    )
    
    return response.data
def handler(event, context):
    """
    EventBridge trigger handler.
    Expects event detail to contain:
      - aws_secret_arn: ARN of the rotated AWS secret
      - oci_secret_ocid: OCID of the target OCI Vault secret
      - oci_vault_id: OCID of the target OCI Vault
      - oci_key_id: OCID of the OCI KMS key
    """
    logger.info(f"Received event: {json.dumps(event)}")
    
    detail = event.get("detail", {})
    aws_secret_arn  = detail.get("aws_secret_arn")
    oci_secret_ocid = detail.get("oci_secret_ocid")
    oci_vault_id    = detail.get("oci_vault_id")
    oci_key_id      = detail.get("oci_key_id")
    
    if not all([aws_secret_arn, oci_secret_ocid, oci_vault_id, oci_key_id]):
        logger.error("Missing required fields in event detail")
        raise ValueError("Event detail must include aws_secret_arn, oci_secret_ocid, oci_vault_id, oci_key_id")
    
    logger.info(f"Syncing secret: {aws_secret_arn} to OCI: {oci_secret_ocid}")
    
    # Step 1: Get OCI credentials
    oci_config = get_oci_config()
    
    # Step 2: Retrieve the rotated AWS secret
    secret_value = get_aws_secret(aws_secret_arn)
    
    # Step 3: Push to OCI Vault
    result = push_to_oci_vault(
        oci_config=oci_config,
        vault_id=oci_vault_id,
        key_id=oci_key_id,
        secret_ocid=oci_secret_ocid,
        secret_value=secret_value
    )
    
    return {
        "statusCode": 200,
        "body": {
            "message": "Secret synced successfully",
            "oci_secret_ocid": oci_secret_ocid,
            "oci_version": result.current_version_number
        }
    }

		

Step 6: Lambda IAM Role and Deployment

hcl

			
data "aws_iam_policy_document" "lambda_assume_role" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole"]
    principals {
      type        = "Service"
      identifiers = ["lambda.amazonaws.com"]
    }
  }
}
data "aws_iam_policy_document" "lambda_permissions" {
  statement {
    effect = "Allow"
    actions = [
      "secretsmanager:GetSecretValue",
      "secretsmanager:DescribeSecret"
    ]
    resources = [
      aws_secretsmanager_secret.db_password.arn,
      aws_secretsmanager_secret.oci_credentials.arn
    ]
  }
  statement {
    effect = "Allow"
    actions = [
      "logs:CreateLogGroup",
      "logs:CreateLogStream",
      "logs:PutLogEvents"
    ]
    resources = ["arn:aws:logs:*:*:*"]
  }
}
resource "aws_iam_role" "sync_lambda_role" {
  name               = "secrets-sync-lambda-role"
  assume_role_policy = data.aws_iam_policy_document.lambda_assume_role.json
}
resource "aws_iam_role_policy" "sync_lambda_policy" {
  name   = "secrets-sync-lambda-policy"
  role   = aws_iam_role.sync_lambda_role.id
  policy = data.aws_iam_policy_document.lambda_permissions.json
}
resource "aws_lambda_function" "secrets_sync" {
  filename         = "${path.module}/lambda/secrets_sync.zip"
  function_name    = "oci-secrets-sync"
  role             = aws_iam_role.sync_lambda_role.arn
  handler          = "main.handler"
  runtime          = "python3.11"
  timeout          = 60
  memory_size      = 256
  source_code_hash = filebase64sha256("${path.module}/lambda/secrets_sync.zip")
  environment {
    variables = {
      OCI_CREDENTIALS_SECRET_ARN = aws_secretsmanager_secret.oci_credentials.arn
      AWS_REGION                 = var.aws_region
    }
  }
  layers = [aws_lambda_layer_version.oci_sdk_layer.arn]
}

		

Bundle the OCI Python SDK as a Lambda Layer so the function does not need to package it inline:

bash

			
mkdir -p lambda_layer/python
pip install oci --target lambda_layer/python
cd lambda_layer && zip -r ../oci_sdk_layer.zip python/

hcl

			
resource "aws_lambda_layer_version" "oci_sdk_layer" {
  filename            = "${path.module}/oci_sdk_layer.zip"
  layer_name          = "oci-python-sdk"
  compatible_runtimes = ["python3.11"]
  source_code_hash    = filebase64sha256("${path.module}/oci_sdk_layer.zip")
}

		

Step 7: EventBridge Rule to Trigger on Rotation

hcl

			
resource "aws_cloudwatch_event_rule" "secret_rotation_rule" {
  name        = "detect-secret-rotation"
  description = "Fires when a Secrets Manager secret rotation completes"
  event_pattern = jsonencode({
    source      = ["aws.secretsmanager"],
    detail-type = ["AWS API Call via CloudTrail"],
    detail = {
      eventSource = ["secretsmanager.amazonaws.com"],
      eventName   = ["RotateSecret", "PutSecretValue"]
    }
  })
}
resource "aws_cloudwatch_event_target" "sync_lambda_target" {
  rule      = aws_cloudwatch_event_rule.secret_rotation_rule.name
  target_id = "SyncToOCI"
  arn       = aws_lambda_function.secrets_sync.arn
  input_transformer {
    input_paths = {
      secret_arn = "$.detail.requestParameters.secretId"
    }
    input_template = <<EOF
{
  "detail": {
    "aws_secret_arn": "<secret_arn>",
    "oci_secret_ocid": "${var.oci_db_password_secret_ocid}",
    "oci_vault_id": "${oci_kms_vault.app_vault.id}",
    "oci_key_id": "${oci_kms_key.secrets_key.id}"
  }
}
EOF
  }
}
resource "aws_lambda_permission" "allow_eventbridge" {
  statement_id  = "AllowEventBridgeInvoke"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.secrets_sync.function_name
  principal     = "events.amazonaws.com"
  source_arn    = aws_cloudwatch_event_rule.secret_rotation_rule.arn
}

		

Step 8: Verifying the Pipeline

Manually trigger a rotation to test the full pipeline without waiting 30 days:

bash

			
# Force a rotation in AWS
aws secretsmanager rotate-secret \
  --secret-id prod/database/password \
  --region us-east-1
# Check Lambda execution logs
aws logs tail /aws/lambda/oci-secrets-sync --follow
# Verify the new version appeared in OCI Vault
oci vault secret get \
  --secret-id <your-oci-secret-ocid> \
  --query 'data.{name:secret-name, version:"current-version-number", updated:"time-of-current-version-need-rotation"}' \
  --output table

		

A successful sync produces output similar to this in the Lambda logs:

			
INFO: Syncing secret: arn:aws:secretsmanager:us-east-1:123456789:secret:prod/database/password to OCI: ocid1.vaultsecret.oc1...
INFO: OCI secret updated. OCID: ocid1.vaultsecret.oc1..., New version: 3

Handling Failures and Drift

The pipeline as built is synchronous and event-driven, which means if the Lambda fails, the OCI secret does not get updated. Add a dead-letter queue and a reconciliation function that runs on a schedule to catch any drift.

hcl

			
resource "aws_sqs_queue" "sync_dlq" {
  name                      = "secrets-sync-dlq"
  message_retention_seconds = 86400
}
resource "aws_lambda_function_event_invoke_config" "sync_retry" {
  function_name                = aws_lambda_function.secrets_sync.function_name
  maximum_retry_attempts       = 2
  maximum_event_age_in_seconds = 300
  destination_config {
    on_failure {
      destination = aws_sqs_queue.sync_dlq.arn
    }
  }
}

		

For reconciliation, a scheduled Lambda that runs every hour compares the LastRotatedDate on the AWS secret against the synced_at metadata tag on the OCI secret. If they differ by more than five minutes, it triggers a forced sync.

Security Considerations

A few things to keep in mind when running this in production.

The OCI private key stored in AWS Secrets Manager should be rotated periodically, just like any other credential. Add it to your rotation schedule.

Enable CloudTrail in AWS and OCI Audit logging so every access to both secrets stores is recorded. If something is off with the sync, the audit logs tell you exactly which principal made the change and when.

Use VPC endpoints for Secrets Manager in AWS so the Lambda traffic never crosses the public internet when retrieving credentials.

On the OCI side, enable Vault audit logging to the OCI Logging service so every secret version write is captured.

Wrapping Up

This pipeline solves a real operational problem without requiring a third-party secrets broker. AWS Secrets Manager stays the authoritative source. OCI Vault stays current automatically. The only manual step is the initial deployment.

The pattern extends to other cross-cloud credential types. Database connection strings, API tokens, TLS certificates — any secret that needs to exist on both clouds can follow the same EventBridge to Lambda to OCI Vault flow. Extend the Lambda to support a mapping table of AWS secret ARNs to OCI secret OCIDs and one function handles your entire secrets estate across both providers.

Regards,
Osama

Building a Production Serverless API with OCI API Gateway and OCI Functions

Posted on April 13, 2026 by Osama Mustafa in Uncategorized

I have seen many teams deploy OCI Functions and call it done. The function works, the test passes, and then they realize there is no authentication, no rate limiting, no proper routing, and no way to version the API. The function URL is just floating there, exposed.

OCI API Gateway is what turns a collection of serverless functions into an actual production API. In this post I will walk through building a complete serverless API stack from scratch using OCI API Gateway, OCI Functions, and Terraform. Everything here is production-oriented: proper IAM, CORS, authentication via JWT, rate limiting, and a deployment pipeline.

Architecture Overview

The stack we are building looks like this:

Client Request → OCI API Gateway → Route Policy (auth + rate limit) → OCI Function → Response

The API Gateway sits in a public subnet and handles all the cross-cutting concerns: TLS termination, JWT validation, CORS headers, and usage plans. The Functions sit in a private subnet with no public exposure. The Gateway invokes them over OCI’s internal network.

Prerequisites

You need the following before starting:

OCI CLI configured with a valid profile
Terraform 1.5 or later
Docker installed locally (for building function images)
An OCI tenancy with permissions to manage API Gateway, Functions, IAM, and Networking

Setting Up the Network

Functions should never run in a public subnet. We need a VCN with a public subnet for the Gateway and a private subnet for Functions.

			
resource "oci_core_vcn" "api_vcn" {
  compartment_id = var.compartment_id
  cidr_block     = "10.0.0.0/16"
  display_name   = "api-gateway-vcn"
  dns_label      = "apivcn"
}
resource "oci_core_subnet" "public_subnet" {
  compartment_id    = var.compartment_id
  vcn_id            = oci_core_vcn.api_vcn.id
  cidr_block        = "10.0.1.0/24"
  display_name      = "gateway-public-subnet"
  dns_label         = "gatewaypub"
  route_table_id    = oci_core_route_table.public_rt.id
  security_list_ids = [oci_core_security_list.gateway_sl.id]
}
resource "oci_core_subnet" "private_subnet" {
  compartment_id             = var.compartment_id
  vcn_id                     = oci_core_vcn.api_vcn.id
  cidr_block                 = "10.0.2.0/24"
  display_name               = "functions-private-subnet"
  dns_label                  = "funcpriv"
  prohibit_public_ip_on_vnic = true
  route_table_id             = oci_core_route_table.private_rt.id
  security_list_ids          = [oci_core_security_list.functions_sl.id]
}

		

The private subnet has prohibit_public_ip_on_vnic = true. This is not optional — it ensures no Function instance can accidentally get a public IP assigned.

For the private subnet to reach OCI services (like Container Registry to pull images), add a Service Gateway:

			
resource "oci_core_service_gateway" "sgw" {
  compartment_id = var.compartment_id
  vcn_id         = oci_core_vcn.api_vcn.id
  display_name   = "functions-service-gateway"
  services {
    service_id = data.oci_core_services.all_services.services[0].id
  }
}
resource "oci_core_route_table" "private_rt" {
  compartment_id = var.compartment_id
  vcn_id         = oci_core_vcn.api_vcn.id
  display_name   = "private-route-table"
  route_rules {
    network_entity_id = oci_core_service_gateway.sgw.id
    destination       = "all-iad-services-in-oracle-services-network"
    destination_type  = "SERVICE_CIDR_BLOCK"
  }
}

		

IAM: Dynamic Groups and Policies

OCI Functions need explicit permission to be invoked by API Gateway. This is done through Dynamic Groups and policies not hardcoded credentials.

			
resource "oci_identity_dynamic_group" "api_gateway_dg" {
  compartment_id = var.tenancy_ocid
  name           = "api-gateway-dynamic-group"
  description    = "Allows API Gateway to invoke Functions"
  matching_rule  = "ALL {resource.type = 'ApiGateway', resource.compartment.id = '${var.compartment_id}'}"
}
resource "oci_identity_policy" "gateway_invoke_policy" {
  compartment_id = var.compartment_id
  name           = "gateway-invoke-functions-policy"
  description    = "Grants API Gateway permission to invoke Functions"
  statements = [
    "Allow dynamic-group api-gateway-dynamic-group to use functions-family in compartment id ${var.compartment_id}"
  ]
}

		

Without this policy, the Gateway will return a 500 when it tries to invoke your function, and the error message is not always obvious about the cause.

Building and Pushing the Function

We will write a simple order validation function in Python. Create this directory structure:

order-validator/
├── func.py
├── func.yaml
└── requirements.txt

			
schema_version: 20180708
name: order-validator
version: 0.0.1
runtime: python3.11
build_image: fnproject/python:3.11-dev
run_image: fnproject/python:3.11
entrypoint: /python/bin/fdk /function/func.py handler
memory: 256

		

requirements.txt:

fdk>=0.1.57

func.py:

			
import io
import json
import logging
from fdk import response
def handler(ctx, data: io.BytesIO = None):
    logger = logging.getLogger()
    
    try:
        body = json.loads(data.getvalue())
    except (Exception, ValueError) as ex:
        logger.error("Failed to parse request body: " + str(ex))
        return response.Response(
            ctx,
            status_code=400,
            response_data=json.dumps({"error": "Invalid JSON in request body"}),
            headers={"Content-Type": "application/json"}
        )
    required_fields = ["order_id", "customer_id", "items", "total_amount"]
    missing = [f for f in required_fields if f not in body]
    if missing:
        return response.Response(
            ctx,
            status_code=422,
            response_data=json.dumps({
                "error": "Missing required fields",
                "fields": missing
            }),
            headers={"Content-Type": "application/json"}
        )
    if not isinstance(body.get("items"), list) or len(body["items"]) == 0:
        return response.Response(
            ctx,
            status_code=422,
            response_data=json.dumps({"error": "Order must contain at least one item"}),
            headers={"Content-Type": "application/json"}
        )
    if body["total_amount"] <= 0:
        return response.Response(
            ctx,
            status_code=422,
            response_data=json.dumps({"error": "total_amount must be greater than zero"}),
            headers={"Content-Type": "application/json"}
        )
    return response.Response(
        ctx,
        status_code=200,
        response_data=json.dumps({
            "status": "valid",
            "order_id": body["order_id"],
            "item_count": len(body["items"]),
            "validated_at": ctx.RequestID()
        }),
        headers={"Content-Type": "application/json"}
    )

		

Build and push the function image to OCI Container Registry:

			
# Log in to OCI Container Registry
docker login <region-key>.ocir.io -u '<tenancy-namespace>/<username>'
# Build the function image
fn build --verbose
# Tag and push
docker tag order-validator:0.0.1 <region-key>.ocir.io/<tenancy-namespace>/functions/order-validator:0.0.1
docker push <region-key>.ocir.io/<tenancy-namespace>/functions/order-validator:0.0.1

		

Deploying the Function with Terraform

			
resource "oci_functions_application" "orders_app" {
  compartment_id = var.compartment_id
  display_name   = "orders-api"
  subnet_ids     = [oci_core_subnet.private_subnet.id]
  config = {
    LOG_LEVEL = "INFO"
    ENV       = "production"
  }
  trace_config {
    is_enabled = true
    domain_id  = oci_apm_apm_domain.tracing.id
  }
}
resource "oci_functions_function" "order_validator" {
  application_id = oci_functions_application.orders_app.id
  display_name   = "order-validator"
  image          = "<region-key>.ocir.io/<tenancy-namespace>/functions/order-validator:0.0.1"
  memory_in_mbs  = 256
  timeout_in_seconds = 30
  provisioned_concurrency_config {
    strategy = "CONSTANT"
    count    = 2
  }
}

		

The provisioned_concurrency_config block with count = 2 keeps two warm instances running at all times. This eliminates cold starts for your two most frequent concurrent requests — critical for an API that needs consistent latency.

Creating the API Gateway and Deployment

This is where everything comes together. The Gateway deployment defines your routes, authentication, and rate limiting in a single resource:

			
resource "oci_apigateway_gateway" "orders_gateway" {
  compartment_id = var.compartment_id
  display_name   = "orders-api-gateway"
  endpoint_type  = "PUBLIC"
  subnet_id      = oci_core_subnet.public_subnet.id
  certificate_id = var.tls_certificate_ocid
}
resource "oci_apigateway_deployment" "orders_deployment" {
  compartment_id = var.compartment_id
  display_name   = "orders-api-v1"
  gateway_id     = oci_apigateway_gateway.orders_gateway.id
  path_prefix    = "/v1"
  specification {
    request_policies {
      authentication {
        type                        = "JWT_AUTHENTICATION"
        token_header                = "Authorization"
        token_auth_scheme           = "Bearer"
        is_anonymous_access_allowed = false
        public_keys {
          type    = "REMOTE_JWKS"
          uri     = "https://your-identity-provider.com/.well-known/jwks.json"
          max_cache_duration_in_hours = 1
        }
        verify_claims {
          key      = "iss"
          values   = ["https://your-identity-provider.com"]
          is_required = true
        }
        verify_claims {
          key      = "aud"
          values   = ["orders-api"]
          is_required = true
        }
      }
      rate_limiting {
        rate_in_requests_per_second = 100
        rate_key                    = "CLIENT_IP"
      }
      cors {
        allowed_origins = ["https://yourdomain.com"]
        allowed_methods = ["GET", "POST", "OPTIONS"]
        allowed_headers = ["Authorization", "Content-Type"]
        max_age_in_seconds = 3600
        is_allow_credentials_enabled = true
      }
    }
    routes {
      path    = "/orders/validate"
      methods = ["POST"]
      backend {
        type        = "ORACLE_FUNCTIONS_BACKEND"
        function_id = oci_functions_function.order_validator.id
      }
      request_policies {
        body_validation {
          required = true
          content {
            media_type      = "application/json"
            validation_type = "NONE"
          }
        }
      }
      response_policies {
        header_transformations {
          set_headers {
            items {
              name   = "X-Request-ID"
              values = ["${request.headers[x-request-id]}"]
            }
            items {
              name   = "Strict-Transport-Security"
              values = ["max-age=31536000; includeSubDomains"]
            }
          }
        }
      }
    }
    routes {
      path    = "/orders/validate"
      methods = ["OPTIONS"]
      backend {
        type = "STOCK_RESPONSE_BACKEND"
        status  = 204
        headers {
          name  = "Access-Control-Allow-Origin"
          value = "https://yourdomain.com"
        }
      }
    }
  }
}

		

A few things worth highlighting in this configuration.

The JWT authentication block uses REMOTE_JWKS, which means the Gateway fetches your identity provider’s public keys and caches them for one hour. It validates the signature, the issuer, and the audience on every request before your Function ever sees the traffic. Your function code does not need to do any token validation at all.

The rate_limiting block uses CLIENT_IP as the rate key, which applies the 100 req/sec limit per caller rather than globally across all callers. Switch this to TOTAL if you want a single shared limit for the entire API.

The OPTIONS route returns a 204 with no backend function invoked. This handles preflight CORS requests without consuming Function compute time.

Testing the Deployment

Once Terraform applies successfully, get your Gateway endpoint:

terraform output gateway_endpoint

Test without a token (should get 401):

			
curl -X POST https://<gateway-id>.apigateway.<region>.oci.customer-oci.com/v1/orders/validate \
  -H "Content-Type: application/json" \
  -d '{"order_id": "ORD-001"}'

Test with a valid JWT:

			
TOKEN=$(curl -s -X POST https://your-idp.com/oauth/token \
  -d "grant_type=client_credentials&client_id=...&client_secret=..." | jq -r .access_token)
curl -X POST https://<gateway-id>.apigateway.<region>.oci.customer-oci.com/v1/orders/validate \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "order_id": "ORD-001",
    "customer_id": "CUST-999",
    "items": [{"sku": "ITEM-A", "qty": 2}],
    "total_amount": 49.99
  }'

		

Expected response:

			
{
  "status": "valid",
  "order_id": "ORD-001",
  "item_count": 1,
  "validated_at": "ocid1.apigateway.request..."
}

		

Enabling Execution Logs

Without logs, debugging a Gateway issue is nearly impossible. Enable execution logs on both the Gateway and the Function application:

			
resource "oci_logging_log" "gateway_execution_log" {
  display_name = "gateway-execution-log"
  log_group_id = oci_logging_log_group.api_logs.id
  log_type     = "SERVICE"
  configuration {
    source {
      category    = "execution"
      resource    = oci_apigateway_deployment.orders_deployment.id
      service     = "apigateway"
      source_type = "OCISERVICE"
    }
    compartment_id = var.compartment_id
  }
  retention_duration = 30
  is_enabled         = true
}

		

Gateway execution logs include the full request path, JWT claim values, matched route, backend invocation time, and response status for every request. These logs are the first place to look when a request is failing.

Final Thoughts

The combination of OCI API Gateway and OCI Functions gives you a production API stack with almost no infrastructure to manage. The Gateway handles authentication, rate limiting, CORS, and TLS. The Functions handle business logic. Terraform manages the entire configuration as code, so every change is reviewed, versioned, and repeatable.

The pieces that trip most people up are the Dynamic Group IAM policy (the Gateway cannot invoke Functions without it), provisioned concurrency (without it your p99 latency will be terrible on cold paths), and execution logging (without it you are debugging blind).

Get those three right and the rest follows naturally.

Regards, Osama

Building Kubernetes Sentinel: An AI-Powered Cluster Health Dashboard

Posted on April 6, 2026April 13, 2026 by Osama Mustafa in Application

When you manage Kubernetes clusters at scale, the hardest part is not keeping things running. It is knowing when something is about to break, understanding why it broke, and fixing it before it affects users. Traditional monitoring tools give you metrics and alerts, but they leave the diagnosis entirely up to you. You still have to correlate events, read logs, cross-reference namespaces, and figure out the right kubectl commands to run.

I wanted to change that. So I built Kubernetes Sentinel, an open-source dashboard that not only watches your entire cluster in real time but also uses Claude AI to explain what went wrong and tell you exactly how to fix it.

The Problem with Kubernetes Observability

Anyone who has been on call for a Kubernetes cluster knows the feeling. Your phone goes off at 2am. A pod is crashlooping. You open your terminal, start running kubectl commands, and spend the next twenty minutes piecing together what happened from logs, events, and resource descriptions spread across multiple namespaces.

The tooling has not kept up with the complexity. Prometheus and Grafana are powerful, but they require significant setup and expertise to use effectively. Most teams end up with dashboards full of graphs they never look at and alerts that fire so often they get ignored.

What I wanted was something simpler. A single view of the entire cluster, automatic detection of anything that looks wrong, and an AI that could look at the same data an experienced SRE would look at and tell me what is happening in plain English.

What Kubernetes Sentinel Does

Kubernetes Sentinel is a FastAPI backend that runs either locally or as a pod inside your cluster. It polls the Kubernetes API every 15 seconds across all namespaces, not just one, and stores the current state in memory. A React frontend connects to it over HTTP and receives live updates via Server-Sent Events.

The dashboard gives you four things at once. A health score from 0 to 100 that reflects the overall state of your cluster. A live pod table showing every pod across every namespace with restart counts, phase, and node assignment. An event stream showing everything Kubernetes has logged, filtered and color-coded by severity. And a resources view covering your nodes, deployments, services, and persistent volume claims.

On top of that, the backend runs seven anomaly detection rules continuously. CrashLoopBackOff, OOMKilled, NodeNotReady, FailedMount, BackOff, CPUThrottling, and high restart counts. When any of these fire, an anomaly banner appears at the top of the dashboard immediately.

The AI diagnosis feature is where it gets interesting. When you click Run Diagnosis, the backend assembles the current cluster state into a structured prompt and sends it to Claude. Within seconds you get back a plain-English summary of what is wrong, a root cause explanation, and three kubectl commands you can copy and run immediately to fix it. No more correlating events manually. No more searching Stack Overflow for the right flags.

The Technical Decisions

I made a few deliberate architectural choices that I think are worth explaining.

The backend runs as a single process with one Uvicorn worker. This is intentional. The background polling thread lives inside the same process, so multiple workers would each start their own independent loop and you would end up with redundant API calls and inconsistent state. One process, one source of truth.

Authentication with the Kubernetes API uses the official Python client, which handles both scenarios automatically. When the sentinel runs inside a cluster as a pod, it reads the ServiceAccount token that Kubernetes mounts automatically at a well-known path. When you run it locally for development, it falls back to your kubeconfig. The same code works in both environments without any changes.

The RBAC configuration is strictly read-only. The ClusterRole I wrote grants get, list, and watch on pods, events, nodes, services, persistent volume claims, configmaps, secrets, deployments, statefulsets, daemonsets, and replicasets. Nothing else. The sentinel can observe everything but change nothing. This was a hard requirement for me. A monitoring tool should never have write access to the cluster it is watching.

For the frontend I deliberately chose a single React file with no build step. The dashboard runs as a Claude.ai artifact or drops straight into any React project. There is nothing to compile, no node_modules to install, no webpack config to debug. The entire UI is one file you can read and understand in an afternoon.

I also added a DEV_MODE flag that bypasses the Kubernetes connection entirely and loads realistic mock data instead. This means anyone can clone the repo, set DEV_MODE=true, start the backend, and see the full dashboard working within five minutes even if they have never touched Kubernetes before. It made development much faster and makes the project far more accessible for contributors.

The Stack

The backend is Python 3.12 with FastAPI and the official Kubernetes client library. I used sse-starlette for Server-Sent Events, httpx for calling the Claude API, and Pydantic v2 for data validation. The Docker image is a two-stage build that ends up running as a non-root user.

The frontend is React 18 with no external UI library. All styling is plain inline JavaScript objects, which makes it trivially portable and means there are zero CSS conflicts when you embed it somewhere else.

Kubernetes manifests cover the full production deployment: namespace, ClusterRole, ClusterRoleBinding, ServiceAccount, ConfigMap, Deployment with liveness and readiness probes, and a ClusterIP Service. The Anthropic API key is never stored in any manifest file. It goes into a Kubernetes Secret created directly with kubectl.

What I Learned Building This

The biggest challenge was not the Kubernetes integration or the AI features. It was the import path problem. Claude Code generated all the backend files correctly, but because the server is started from inside the backend directory, every import had to be relative to that directory as the root. Files using from backend.core.x import y worked fine in isolation but crashed immediately when uvicorn tried to load them. Once I understood the issue it was a one-line fix in every file, but it cost me an hour of debugging.

The second thing I learned is that mock data is not optional for a project like this. Without DEV_MODE, you need a running Kubernetes cluster to develop against, which means either paying for cloud infrastructure or running a local cluster with kind. Adding ten lines of mock data to the poller made the development loop dramatically faster and opened the project up to contributors who want to work on the frontend without needing any cluster at all.

The AI diagnosis feature turned out to be far more useful than I expected. I assumed it would be a nice addition but not something I would rely on. After running it against realistic failure scenarios, the quality of the root cause analysis was genuinely impressive. It correctly identified memory limit misconfiguration from OOMKill events, correlated restart back-off with recent image pull failures, and suggested the right sequence of commands to investigate and resolve each issue.

Running It Yourself

The project is open source and available on GitHub. There are three ways to run it.

If you just want to see the dashboard without any cluster setup, clone the repo, copy .env.example to .env, set DEV_MODE=true and your Anthropic API key, then run uvicorn from the backend directory. The whole setup takes under five minutes.

If you have a Kubernetes cluster, set DEV_MODE=false and point it at your kubeconfig. The backend will start polling your real cluster immediately and the dashboard will show live data.

If you want to run it inside your cluster, build the Docker image, push it to your registry, create a Kubernetes Secret with your API key, and apply the manifests with kubectl. The deploy script handles the apply order automatically.

The repository is at https://github.com/OsamaOracle/k8s-sentinel/. Contributions, issues, and feedback are welcome.

Regards
Osama

Building Generative AI Applications with Vector Databases on AWS

Posted on April 3, 2026 by Osama Mustafa in Uncategorized

A few months ago, I was helping a team that had just integrated an LLM into their product. The use case was straightforward: users ask questions, the LLM answers. They had it running. The demos looked great. Then they went to production.

The model kept confidently making things up. It had no idea about the company’s internal documentation, the latest product specs, or anything that happened after its training cutoff. The team was frustrated. They had the right model, the right infrastructure, but the wrong architecture.

The fix was not fine-tuning. Fine-tuning is expensive, slow, and you have to redo it every time your data changes. The fix was Retrieval Augmented Generation, or RAG. And at the heart of RAG is something called a vector database.

In this article, I will walk you through building a production-grade RAG architecture on AWS. We will cover what vector databases actually are, when to use Aurora pgvector versus OpenSearch versus Amazon Bedrock Knowledge Bases, and how to wire everything together with real code.

What Is a Vector Database and Why Does It Matter

Before writing any infrastructure code, let me explain what problem we are actually solving.

When you work with text, images, or audio in AI systems, the raw data is not what gets compared. Instead, you pass the data through an embedding model, which converts it into a list of numbers called a vector. That vector captures the semantic meaning of the content.

Two sentences that mean the same thing will have vectors that are close to each other in vector space, even if they use completely different words. “The server is down” and “the system is not responding” will be closer to each other than “the server is down” and “I had pasta for lunch.”

A vector database is optimized for one specific operation: given a query vector, find me the N closest vectors in the collection. This is called approximate nearest neighbor search, and it is fundamentally different from SQL WHERE clauses or text search.

In a RAG architecture, the flow looks like this:

You chunk your documents and generate embeddings for each chunk
You store those embeddings in a vector database
When a user asks a question, you generate an embedding for the question
You query the vector database to retrieve the most semantically similar chunks
You pass the question plus those chunks to your LLM as context
The LLM answers based on actual, grounded information

The result is a model that knows your data, stays current as your data changes, and does not hallucinate facts from your knowledge base because the facts are right there in the prompt.

Options on AWS

AWS gives you three serious paths for vector storage, and choosing the wrong one will cost you performance and money.

Amazon Aurora PostgreSQL with pgvector

pgvector is an open source PostgreSQL extension that adds native vector storage and similarity search. If you already run Aurora PostgreSQL, this is often the right starting point.

The extension supports three distance metrics: L2 (Euclidean), inner product, and cosine similarity. For most text embedding use cases, cosine similarity is what you want.

Here is a minimal setup to get you started:

			
-- Enable the extension on your Aurora instance
CREATE EXTENSION vector;
-- Create a table for your document chunks
CREATE TABLE document_chunks (
    id          BIGSERIAL PRIMARY KEY,
    doc_id      TEXT NOT NULL,
    chunk_text  TEXT NOT NULL,
    source_url  TEXT,
    embedding   vector(1536),   -- 1536 dims for text-embedding-3-small
    created_at  TIMESTAMPTZ DEFAULT NOW()
);
-- IVFFlat index for approximate nearest neighbor search
-- lists = sqrt(number of rows) is a good starting point
CREATE INDEX ON document_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

		

			
SELECT
    chunk_text,
    source_url,
    1 - (embedding <=> $1::vector) AS similarity_score
FROM document_chunks
ORDER BY embedding <=> $1::vector
LIMIT 5;

		

The <=> operator computes cosine distance. One minus that gives you similarity.

For production, tune the ivfflat.probes parameter at query time. Higher probes means more accuracy but slower queries. For most use cases, setting it between 10 and 20 is a reasonable balance:

Aurora pgvector is the right choice when your team already knows PostgreSQL, you want to join vector search results with relational data in the same query, or you have an existing Aurora cluster and want to avoid managing another service.

The limitation is scale. Once you push past 10 to 20 million vectors, or you need sub-10ms latency at high concurrency, you will start to feel the ceiling.

Amazon OpenSearch Service with Vector Engine

OpenSearch’s vector engine is built for scale. It uses the HNSW (Hierarchical Navigable Small World) algorithm, which delivers excellent recall and latency even at hundreds of millions of vectors.

Setting up an index for vector search:

			
PUT /documents
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 512
    }
  },
  "mappings": {
    "properties": {
      "doc_id":     { "type": "keyword" },
      "chunk_text": { "type": "text" },
      "source_url": { "type": "keyword" },
      "embedding": {
        "type": "knn_vector",
        "dimension": 1536,
        "method": {
          "name":       "hnsw",
          "space_type": "cosinesimil",
          "engine":     "nmslib",
          "parameters": {
            "ef_construction": 512,
            "m": 16
          }
        }
      }
    }
  }
}

		

The ef_construction and m parameters control the index build quality. Higher values give better recall but increase memory usage and indexing time. For most production workloads, m=16 and ef_construction=512 is a solid baseline.

Indexing a document:

			
import boto3
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
region = "us-east-1"
service = "es"
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key,
                  region, service, session_token=credentials.token)
client = OpenSearch(
    hosts=[{"host": your_opensearch_endpoint, "port": 443}],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection
)
document = {
    "doc_id":     "product-manual-v3-page-42",
    "chunk_text": "The power button is located on the right side of the device...",
    "source_url": "s3://your-bucket/manuals/product-v3.pdf",
    "embedding":  generate_embedding("The power button is located...")
}
client.index(index="documents", body=document)

		

Querying for semantic similarity:

			
query = {
    "size": 5,
    "query": {
        "knn": {
            "embedding": {
                "vector": generate_embedding(user_question),
                "k": 5
            }
        }
    },
    "_source": ["chunk_text", "source_url"]
}
response = client.search(index="documents", body=query)

		

OpenSearch also lets you combine vector search with traditional filters, which is something pgvector struggles with at scale:

			
hybrid_query = {
    "size": 5,
    "query": {
        "bool": {
            "must": [
                {
                    "knn": {
                        "embedding": {
                            "vector": generate_embedding(user_question),
                            "k": 20
                        }
                    }
                }
            ],
            "filter": [
                { "term": { "product_line": "enterprise" } },
                { "range": { "doc_date": { "gte": "2024-01-01" } } }
            ]
        }
    }
}

		

Retrieving 20 candidates via vector search, then filtering down with metadata, is called pre-filtering, and it is critical when your knowledge base spans multiple products, teams, or access tiers.

Amazon Bedrock Knowledge Bases

If you want the fastest path to production and do not want to manage chunking, embedding, or indexing yourself, Bedrock Knowledge Bases handles all of it.

You point it at an S3 bucket. It crawls your documents, chunks them, generates embeddings using your chosen model, and stores them in an OpenSearch Serverless collection. When you query it, it handles the retrieval and optionally the generation too.

			
resource "aws_bedrockagent_knowledge_base" "product_docs" {
  name     = "product-documentation-kb"
  role_arn = aws_iam_role.bedrock_kb_role.arn
  knowledge_base_configuration {
    type = "VECTOR"
    vector_knowledge_base_configuration {
      embedding_model_arn = "arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-embed-text-v2:0"
    }
  }
  storage_configuration {
    type = "OPENSEARCH_SERVERLESS"
    opensearch_serverless_configuration {
      collection_arn    = aws_opensearchserverless_collection.kb_vectors.arn
      vector_index_name = "bedrock-knowledge-base-default-index"
      field_mapping {
        vector_field   = "bedrock-knowledge-base-default-vector"
        text_field     = "AMAZON_BEDROCK_TEXT_CHUNK"
        metadata_field = "AMAZON_BEDROCK_METADATA"
      }
    }
  }
}
resource "aws_bedrockagent_data_source" "s3_docs" {
  knowledge_base_id = aws_bedrockagent_knowledge_base.product_docs.id
  name              = "s3-product-documentation"
  data_source_configuration {
    type = "S3"
    s3_configuration {
      bucket_arn = aws_s3_bucket.documentation.arn
    }
  }
  vector_ingestion_configuration {
    chunking_configuration {
      chunking_strategy = "SEMANTIC"
      semantic_chunking_configuration {
        max_token       = 300
        buffer_size     = 0
        breakpoint_percentile_threshold = 95
      }
    }
  }
}

		

Querying it from your application:

			
import boto3
bedrock_agent = boto3.client("bedrock-agent-runtime", region_name="us-east-1")
response = bedrock_agent.retrieve_and_generate(
    input={
        "text": user_question
    },
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            "knowledgeBaseId": "YOUR_KB_ID",
            "modelArn": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0",
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults": 5,
                    "overrideSearchType": "HYBRID"
                }
            }
        }
    }
)
answer = response["output"]["text"]
citations = response["citations"]

		

The HYBRID search type combines vector similarity with keyword search under the hood, which improves recall for queries that contain specific product names, version numbers, or technical terms that embeddings alone sometimes miss.

Chunking Strategy: The Part Everyone Gets Wrong

The quality of your RAG system depends more on how you chunk your documents than on which vector database you choose. I have seen teams spend weeks optimizing their similarity search while their chunking strategy was destroying recall.

A few rules that hold up in practice:

Chunk size matters. Too small and you lose context. Too large and you dilute the semantic signal. For most document types, 300 to 500 tokens with a 50-token overlap between chunks is a reasonable starting point. The overlap ensures that sentences that fall on chunk boundaries are still retrievable.

Chunk by structure when you can. If your documents have headers, sections, or natural breaks, use those as chunk boundaries rather than fixed token counts. A section about “Troubleshooting Network Errors” should stay together rather than getting split at 400 tokens.

Store metadata with every chunk. The chunk text alone is not enough. You need the source document, the section title, the creation date, the product version. This metadata enables the filtering patterns we covered in OpenSearch and prevents your model from citing a three-year-old document when a current one exists.

Test with real queries. The only way to validate your chunking strategy is to run the queries your users will actually ask and check whether the right chunks are being retrieved. Build a small evaluation set early, before you optimize anything else.

Embedding Model Selection

For AWS workloads, you have two main options through Bedrock:

Amazon Titan Text Embeddings V2 produces 1024-dimensional vectors. It is fast, cheap, and fine for general English text. If you are building an internal knowledge base over English documents, this is the right default.

Cohere Embed v3 supports multilingual embeddings and produces 1024-dimensional vectors with better performance on technical and domain-specific text. If your documents cover specialized subject matter legal, medical, engineering Cohere will typically outperform Titan on retrieval quality.

A critical point that is easy to overlook: you must use the same embedding model at indexing time and query time. If you indexed your documents with Titan and query with Cohere, the vectors live in different spaces and your similarity scores will be meaningless. Build this constraint into your infrastructure from day one.

Architecture Summary

For a production RAG system on AWS, here is the architecture that has worked well for teams I have worked with.

Document ingestion: an S3 bucket triggers a Lambda function, or Step Functions for large files. The function chunks the document, generates embeddings via Bedrock, and writes to your vector store with metadata.

Vector storage: Aurora pgvector for under 5 million vectors with heavy relational joins. OpenSearch for everything larger, or when you need metadata filtering at scale. Bedrock Knowledge Bases when you want fully managed infrastructure and your team does not want to own the pipeline.

Query path: API Gateway triggers a Lambda function that embeds the user query, retrieves top-k chunks from the vector store, builds a context-enriched prompt, and calls Claude or another Bedrock model for the final response.

Observability: CloudWatch captures embedding latency, retrieval similarity scores, and end-to-end response time. Set alerts if retrieval quality drops since that is usually a signal that something changed in your document pipeline.

Regards
Osama