Designing a Disaster Recovery Strategy on Oracle Cloud Infrastructure: A Practical Guide

Posted on February 14, 2026February 19, 2026 by Osama Mustafa in OCI

Let me be honest with you. Nobody likes thinking about disasters. It’s one of those topics we all know is important, but it often gets pushed to the bottom of the priority list until something goes wrong. And when it does go wrong, it’s usually at 3 AM on a Saturday.

I’ve seen organizations lose days of productivity, thousands of dollars, and sometimes customer trust because they didn’t have a proper disaster recovery plan. The good news? OCI makes disaster recovery achievable without breaking the bank or requiring a dedicated team of engineers.

In this article, I’ll walk you through building a realistic DR strategy on OCI. Not the theoretical stuff you find in whitepapers, but the practical decisions you’ll actually face when setting this up.

Understanding Recovery Objectives

Before we touch any OCI console, we need to talk about two numbers that will drive every decision we make.

Recovery Time Objective (RTO) answers the question: How long can your business survive without this system? If your e-commerce platform goes down, can you afford to be offline for 4 hours? 1 hour? 5 minutes?

Recovery Point Objective (RPO) answers a different question: How much data can you afford to lose? If we restore from a backup taken 2 hours ago, is that acceptable? Or do you need every single transaction preserved?

These aren’t technical questions. They’re business questions. And honestly, the answers might surprise you. I’ve worked with clients who assumed they needed zero RPO for everything, only to realize that most of their systems could tolerate 15-30 minutes of data loss without significant business impact.

Here’s how I typically categorize systems:

Tier	RTO	RPO	Examples
Critical	< 15 min	Near zero	Payment processing, core databases
Important	1-4 hours	< 1 hour	Customer portals, internal apps
Standard	4-24 hours	< 24 hours	Dev environments, reporting systems

Once you know your tiers, the technical implementation becomes much clearer.

OCI Regions and Availability Domains

OCI’s physical infrastructure is your foundation for DR. Let me explain how it works in plain terms.

Regions are geographically separate data center locations. Think Dubai, Jeddah, Frankfurt, London. They’re far enough apart that a natural disaster affecting one region won’t touch another.

Availability Domains (ADs) are independent data centers within a region. Not all regions have multiple ADs, but the larger ones do. Each AD has its own power, cooling, and networking.

Fault Domains are groupings within an AD that protect against hardware failures. Think of them as different racks or sections of the data center.

For disaster recovery, you’ll typically replicate across regions. For high availability within normal operations, you spread across ADs and fault domains.

Here’s what this looks like in practice:

			
Primary Region: Dubai (me-dubai-1)
├── Availability Domain 1
│   ├── Fault Domain 1: Web servers (set 1)
│   ├── Fault Domain 2: Web servers (set 2)
│   └── Fault Domain 3: Application servers
└── Availability Domain 2
    └── Database primary + standby
DR Region: Jeddah (me-jeddah-1)
└── Full replica (activated during disaster)

		

Database Disaster Recovery with Data Guard

Let’s start with databases because that’s usually where the most critical data lives. OCI Autonomous Database and Base Database Service both support Data Guard, which handles replication automatically.

For Autonomous Database, enabling DR is surprisingly simple:

bash

			
# Create a cross-region standby for Autonomous Database
oci db autonomous-database create-cross-region-disaster-recovery-details \
  --autonomous-database-id ocid1.autonomousdatabase.oc1.me-dubai-1.xxx \
  --disaster-recovery-type BACKUP_BASED \
  --remote-disaster-recovery-type SNAPSHOT \
  --dr-region-name me-jeddah-1

		

But here’s where it gets interesting. You have choices:

Backup-Based DR copies backups to the remote region. It’s cheaper but has higher RPO (you might lose the data since the last backup). Good for Tier 2 and Tier 3 systems.

Real-Time DR uses Active Data Guard to replicate changes continuously. Near-zero RPO but costs more because you’re running a standby database. Essential for Tier 1 systems.

For Base Database Service with Data Guard, you configure it like this:

bash

			
# Enable Data Guard for DB System
oci db data-guard-association create \
  --database-id ocid1.database.oc1.me-dubai-1.xxx \
  --creation-type NewDbSystem \
  --database-admin-password "YourSecurePassword123!" \
  --protection-mode MAXIMUM_PERFORMANCE \
  --transport-type ASYNC \
  --peer-db-system-id ocid1.dbsystem.oc1.me-jeddah-1.xxx

		

The protection modes matter:

Maximum Performance: Transactions commit without waiting for standby confirmation. Best performance, slight risk of data loss during failover.
Maximum Availability: Transactions wait for standby acknowledgment but fall back to Maximum Performance if standby is unreachable.
Maximum Protection: Transactions fail if standby is unreachable. Zero data loss, but availability depends on standby.

Most production systems use Maximum Performance or Maximum Availability. Maximum Protection is rare because it can halt your primary if the network between regions has issues.

Compute and Application Layer DR

Databases are just one piece. Your application servers, load balancers, and supporting infrastructure also need DR planning.

Option 1: Pilot Light

This is my favorite approach for most organizations. You keep a minimal footprint running in the DR region, just enough to start recovery quickly.

hcl

			
# Terraform for pilot light infrastructure in DR region
# Minimal compute that can be scaled up during disaster
resource "oci_core_instance" "dr_pilot" {
  availability_domain = data.oci_identity_availability_domain.dr_ad.name
  compartment_id      = var.compartment_id
  shape               = "VM.Standard.E4.Flex"
  
  shape_config {
    ocpus         = 1  # Minimal during normal ops
    memory_in_gbs = 8
  }
  
  display_name = "dr-pilot-instance"
  
  source_details {
    source_type = "image"
    source_id   = var.application_image_id
  }
  
  metadata = {
    ssh_authorized_keys = var.ssh_public_key
    user_data = base64encode(file("./scripts/pilot-light-startup.sh"))
  }
}
# Load balancer ready but with no backends attached
resource "oci_load_balancer" "dr_lb" {
  compartment_id = var.compartment_id
  display_name   = "dr-load-balancer"
  shape          = "flexible"
  
  shape_details {
    minimum_bandwidth_in_mbps = 10
    maximum_bandwidth_in_mbps = 100
  }
  
  subnet_ids = [oci_core_subnet.dr_public_subnet.id]
}

		

The startup script keeps the instance ready without consuming resources:

bash

			
#!/bin/bash
# pilot-light-startup.sh
# Install application but don't start it
yum install -y application-server
# Pull latest configuration from Object Storage
oci os object get \
  --bucket-name dr-config-bucket \
  --name app-config.tar.gz \
  --file /opt/app/config.tar.gz
tar -xzf /opt/app/config.tar.gz -C /opt/app/
# Leave application stopped until failover activation
echo "Pilot light instance ready. Application not started."

		

Option 2: Warm Standby

For systems that need faster recovery, you run a scaled-down version of your production environment continuously:

hcl

			
# Warm standby with reduced capacity
resource "oci_core_instance_pool" "dr_app_pool" {
  compartment_id            = var.compartment_id
  instance_configuration_id = oci_core_instance_configuration.app_config.id
  
  placement_configurations {
    availability_domain = data.oci_identity_availability_domain.dr_ad.name
    primary_subnet_id   = oci_core_subnet.dr_app_subnet.id
  }
  
  size         = 2  # Production runs 6, DR runs 2
  display_name = "dr-app-pool"
}
# Autoscaling policy to expand during failover
resource "oci_autoscaling_auto_scaling_configuration" "dr_scaling" {
  compartment_id       = var.compartment_id
  auto_scaling_resources {
    id   = oci_core_instance_pool.dr_app_pool.id
    type = "instancePool"
  }
  
  policies {
    display_name = "failover-scale-up"
    policy_type  = "threshold"
    
    rules {
      action {
        type  = "CHANGE_COUNT_BY"
        value = 4  # Add 4 instances to match production
      }
      
      metric {
        metric_type = "CPU_UTILIZATION"
        threshold {
          operator = "GT"
          value    = 70
        }
      }
    }
  }
}

		

Object Storage Replication

Your files, backups, and static assets need protection too. OCI Object Storage supports cross-region replication:

bash

			
# Create replication policy
oci os replication create-replication-policy \
  --bucket-name production-assets \
  --destination-bucket-name dr-assets \
  --destination-region me-jeddah-1 \
  --name "prod-to-dr-replication"

		

One thing people often miss: replication is asynchronous. For critical files that absolutely cannot be lost, consider writing to both regions from your application:

python

			
# Python example: Writing to both regions
import oci
def upload_critical_file(file_path, object_name):
    config_primary = oci.config.from_file(profile_name="PRIMARY")
    config_dr = oci.config.from_file(profile_name="DR")
    
    primary_client = oci.object_storage.ObjectStorageClient(config_primary)
    dr_client = oci.object_storage.ObjectStorageClient(config_dr)
    
    with open(file_path, 'rb') as f:
        file_content = f.read()
    
    # Write to primary
    primary_client.put_object(
        namespace_name="your-namespace",
        bucket_name="critical-files",
        object_name=object_name,
        put_object_body=file_content
    )
    
    # Write to DR region
    dr_client.put_object(
        namespace_name="your-namespace",
        bucket_name="critical-files-dr",
        object_name=object_name,
        put_object_body=file_content
    )
    
    print(f"File {object_name} written to both regions")

		

DNS and Traffic Management

When disaster strikes, you need to redirect users to your DR region. OCI DNS with Traffic Management makes this manageable:

hcl

			
# Traffic Management Steering Policy
resource "oci_dns_steering_policy" "failover" {
  compartment_id = var.compartment_id
  display_name   = "app-failover-policy"
  template       = "FAILOVER"
  
  # Primary region answers
  answers {
    name        = "primary"
    rtype       = "A"
    rdata       = var.primary_lb_ip
    pool        = "primary-pool"
    is_disabled = false
  }
  
  # DR region answers
  answers {
    name        = "dr"
    rtype       = "A"
    rdata       = var.dr_lb_ip
    pool        = "dr-pool"
    is_disabled = false
  }
  
  rules {
    rule_type = "FILTER"
  }
  
  rules {
    rule_type = "HEALTH"
  }
  
  rules {
    rule_type = "PRIORITY"
    default_answer_data {
      answer_condition = "answer.pool == 'primary-pool'"
      value            = 1
    }
    default_answer_data {
      answer_condition = "answer.pool == 'dr-pool'"
      value            = 2
    }
  }
}
# Health check for primary region
resource "oci_health_checks_http_monitor" "primary_health" {
  compartment_id      = var.compartment_id
  display_name        = "primary-region-health"
  interval_in_seconds = 30
  
  targets     = [var.primary_lb_ip]
  protocol    = "HTTPS"
  port        = 443
  path        = "/health"
  
  timeout_in_seconds = 10
}

		

The Failover Runbook

All this infrastructure means nothing without a clear process. Here’s a realistic runbook:

Automated Detection

python

			
# OCI Function to detect and alert on regional issues
import oci
import json
def handler(ctx, data: io.BytesIO = None):
    signer = oci.auth.signers.get_resource_principals_signer()
    monitoring_client = oci.monitoring.MonitoringClient(config={}, signer=signer)
    
    # Check critical metrics
    response = monitoring_client.summarize_metrics_data(
        compartment_id="ocid1.compartment.xxx",
        summarize_metrics_data_details=oci.monitoring.models.SummarizeMetricsDataDetails(
            namespace="oci_lbaas",
            query='UnHealthyBackendServers[5m].sum() > 2'
        )
    )
    
    if response.data:
        # Trigger alert
        notifications_client = oci.ons.NotificationDataPlaneClient(config={}, signer=signer)
        notifications_client.publish_message(
            topic_id="ocid1.onstopic.xxx",
            message_details=oci.ons.models.MessageDetails(
                title="DR Alert: Primary Region Degraded",
                body="Multiple backend servers unhealthy. Consider initiating failover."
            )
        )
    
    return response

		

Manual Failover Steps

bash

			
#!/bin/bash
# failover.sh - Execute with caution
set -e
echo "=== OCI DISASTER RECOVERY FAILOVER ==="
echo "This will switch production traffic to the DR region."
read -p "Type 'FAILOVER' to confirm: " confirmation
if [ "$confirmation" != "FAILOVER" ]; then
    echo "Failover cancelled."
    exit 1
fi
echo "[1/5] Initiating database switchover..."
oci db data-guard-association switchover \
  --database-id $PRIMARY_DB_ID \
  --data-guard-association-id $DG_ASSOCIATION_ID
echo "[2/5] Scaling up DR compute instances..."
oci compute instance-pool update \
  --instance-pool-id $DR_INSTANCE_POOL_ID \
  --size 6
echo "[3/5] Waiting for instances to be running..."
sleep 120
echo "[4/5] Updating load balancer backends..."
oci lb backend-set update \
  --load-balancer-id $DR_LB_ID \
  --backend-set-name "app-backend-set" \
  --backends file://dr-backends.json
echo "[5/5] Updating DNS steering policy..."
oci dns steering-policy update \
  --steering-policy-id $STEERING_POLICY_ID \
  --rules file://failover-rules.json
echo "=== FAILOVER COMPLETE ==="
echo "Verify application at: https://app.example.com"

		

Testing Your DR Plan

Here’s the uncomfortable truth: a DR plan that hasn’t been tested is just documentation. You need to actually run failovers.

I recommend this schedule:

Monthly: Tabletop exercise. Walk through the runbook with your team without actually executing anything.
Quarterly: Partial failover. Switch one non-critical component to DR and back.
Annually: Full DR test. Fail over completely and run production from the DR region for at least 4 hours.

Document everything:

markdown

			
## DR Test Report - Q4 2025
**Date**: December 15, 2025
**Participants**: Ahmed, Sarah, Mohammed
**Test Type**: Full failover
### Timeline
- 09:00 - Initiated failover sequence
- 09:03 - Database switchover complete
- 09:08 - Compute instances running in DR
- 09:12 - DNS propagation confirmed
- 09:15 - Application accessible from DR region
### Issues Discovered
1. SSL certificate for DR load balancer had expired
   - Resolution: Renewed certificate, added calendar reminder
2. One microservice had hardcoded primary region endpoint
   - Resolution: Updated to use DNS name instead
### RTO Achieved
15 minutes (Target: 30 minutes) ✓
### RPO Achieved
< 30 seconds of transaction loss ✓
### Action Items
- [ ] Automate certificate renewal monitoring
- [ ] Audit all services for hardcoded endpoints
- [ ] Update runbook with SSL verification step

		

Cost Optimization

DR doesn’t have to be expensive. Here are real strategies I use:

Right-size your DR tier: Not everything needs instant failover. Be honest about what’s truly critical.

Use preemptible instances for testing: When you’re just validating your DR setup works, you don’t need full-price compute:

hcl

			
resource "oci_core_instance" "dr_test" {
  # ... other config ...
  
  preemptible_instance_config {
    preemption_action {
      type = "TERMINATE"
      preserve_boot_volume = false
    }
  }
}

		

Schedule DR resources: If you’re running warm standby, scale it down during your off-peak hours:

bash

			
# Scale down at night, scale up in morning
# Cron job or OCI Scheduler
0 22 * * * oci compute instance-pool update --instance-pool-id $POOL_ID --size 1
0 6 * * * oci compute instance-pool update --instance-pool-id $POOL_ID --size 2

Leverage reserved capacity: If you’re committed to DR, reserved capacity in your DR region is cheaper than on-demand.

Deep Dive into Oracle Kubernetes Engine Security and Networking in Production

Posted on December 22, 2025 by Osama Mustafa in Cloud, OCI

Oracle Kubernetes Engine is often introduced as a managed Kubernetes service, but its real strength only becomes clear when you operate it in production. OKE tightly integrates with OCI networking, identity, and security services, which gives you a very different operational model compared to other managed Kubernetes platforms.

This article walks through OKE from a production perspective, focusing on security boundaries, networking design, ingress exposure, private access, and mutual TLS. The goal is not to explain Kubernetes basics, but to explain how OKE behaves when you run regulated, enterprise workloads.

Understanding the OKE Networking Model

OKE does not abstract networking away from you. Every cluster is deeply tied to OCI VCN constructs.

Core Components

An OKE cluster consists of:

A managed Kubernetes control plane
Worker nodes running in OCI subnets
OCI networking primitives controlling traffic flow

Key OCI resources involved:

Virtual Cloud Network
Subnets for control plane and workers
Network Security Groups
Route tables
OCI Load Balancers

Unlike some platforms, security in OKE is enforced at multiple layers simultaneously.

Worker Node and Pod Networking

OKE uses OCI VCN-native networking. Pods receive IPs from the subnet CIDR through the OCI CNI plugin.

What this means in practice

Pods are first-class citizens on the VCN
Pod IPs are routable within the VCN
Network policies and OCI NSGs both apply

Example subnet design:

VCN: 10.0.0.0/16

Worker Subnet: 10.0.10.0/24
Load Balancer Subnet: 10.0.20.0/24
Private Endpoint Subnet: 10.0.30.0/24

This design allows you to:

Keep workers private
Expose only ingress through OCI Load Balancer
Control east-west traffic using Kubernetes NetworkPolicies and OCI NSGs together

Security Boundaries in OKE

Security in OKE is layered by design.

Layer 1: OCI IAM and Compartments

OKE clusters live inside OCI compartments. IAM policies control:

Who can create or modify clusters
Who can access worker nodes
Who can manage load balancers and subnets

Example IAM policy snippet:

Allow group OKE-Admins to manage cluster-family in compartment OKE-PROD
Allow group OKE-Admins to manage virtual-network-family in compartment OKE-PROD

This separation is critical for regulated environments.

Layer 2: Network Security Groups

Network Security Groups act as virtual firewalls at the VNIC level.

Typical NSG rules:

Allow node-to-node communication
Allow ingress from load balancer subnet only
Block all public inbound traffic

Example inbound NSG rule:

Source: 10.0.20.0/24
Protocol: TCP
Port: 443

This ensures only the OCI Load Balancer can reach your ingress controller.

Layer 3: Kubernetes Network Policies

NetworkPolicies control pod-level traffic.

Example policy allowing traffic only from ingress namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-ingress
  namespace: app-prod
spec:
  podSelector: {}
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              role: ingress

This blocks all lateral movement by default.

Ingress Design in OKE

OKE integrates natively with OCI Load Balancer.

Public vs Private Ingress

You can deploy ingress in two modes:

Public Load Balancer
Internal Load Balancer

For production workloads, private ingress is strongly recommended.

Example service annotation for private ingress:

service.beta.kubernetes.io/oci-load-balancer-internal: "true"
service.beta.kubernetes.io/oci-load-balancer-subnet1: ocid1.subnet.oc1..

This ensures the load balancer has no public IP.

Private Access to the Cluster Control Plane

OKE supports private API endpoints.

When enabled:

The Kubernetes API is accessible only from the VCN
No public endpoint exists

This is critical for Zero Trust environments.

Operational impact:

kubectl access requires VPN, Bastion, or OCI Cloud Shell inside the VCN
CI/CD runners must have private connectivity

This dramatically reduces the attack surface.

Mutual TLS Inside OKE

TLS termination at ingress is not enough for sensitive workloads. Many enterprises require mTLS between services.

Typical mTLS Architecture

TLS termination at ingress
Internal mTLS between services
Certificate management via Vault or cert-manager

Example cert-manager issuer using OCI Vault:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: oci-vault-issuer
spec:
  vault:
    server: https://vault.oci.oraclecloud.com
    path: pki/sign/oke

Each service receives:

Its own certificate
Short-lived credentials
Automatic rotation

Traffic Flow Example

End-to-end request path:

Client connects to OCI Load Balancer
Load Balancer forwards traffic to NGINX Ingress
Ingress enforces TLS and headers
Service-to-service traffic uses mTLS
NetworkPolicy restricts lateral movement
NSGs enforce VCN-level boundaries

Every hop is authenticated and encrypted.

Observability and Security Visibility

OKE integrates with:

OCI Logging
OCI Flow Logs
Kubernetes audit logs

This allows:

Tracking ingress traffic
Detecting unauthorized access attempts
Correlating pod-level events with network flows

Regards
Osama

Basic Guide to Build a Production-Architecture on OCI

Posted on December 3, 2025 by Osama Mustafa in Cloud, OCI

1. Why OCI for Modern Architecture?

Many architects underestimate how much OCI has matured. Today, OCI offers:

Low-latency networking with deterministic performance.
Flexible compute shapes (standard, dense I/O, high memory).
A Kubernetes service (OKE) with enterprise-level resilience.
Cloud-native storage (Block, Object, File).
A full security stack (Vault, Cloud Guard, WAF, IAM policies).
A pricing model that is often 30–50% cheaper than equivalent hyperscaler deployments.

Reference: OCI Overview
https://docs.oracle.com/en-us/iaas/Content/home.htm

2. Multi-Tier Production Architecture Overview

A typical production workload on OCI includes:

Network Layer: VCN, subnets, NAT, DRG, Load Balancers
Compute Layer: OKE, VMs, Functions
Data Layer: Autonomous DB, PostgreSQL, MySQL, Object Storage
Security Layer: OCI Vault, WAF, IAM policies
Observability Layer: Logging, Monitoring, Alarms, Prometheus/Grafana
Automation Layer: Terraform, OCI CLI, GitHub Actions/Azure DevOps

3. Networking Foundation

You start with a Virtual Cloud Network (VCN), structured in a way that isolates traffic properly:

VCN Example Layout

10.10.0.0/16 — VCN Root
- 10.10.1.0/24 — Public Subnet (Load Balancers)
- 10.10.2.0/24 — Private Subnet (Applications / OKE Nodes)
- 10.10.3.0/24 — DB Subnet
- 10.10.4.0/24 — Bastion Subnet

Terraform Example

resource "oci_core_vcn" "main" {
  cidr_block = "10.10.0.0/16"
  compartment_id = var.compartment_ocid
  display_name = "prod-vcn"
}

resource "oci_core_subnet" "private_app" {
  vcn_id = oci_core_vcn.main.id
  cidr_block = "10.10.2.0/24"
  prohibit_public_ip_on_vnic = true
  display_name = "app-private-subnet"
}

Reference: OCI Networking Concepts
https://docs.oracle.com/en-us/iaas/Content/Network/Concepts/overview.htm

4. Deploying Workloads on OKE (Oracle Kubernetes Engine)

OKE is one of OCI’s strongest services due to:

Native integration with VCN
Worker nodes running inside your own subnets
The ability to use OCI Load Balancers or NGINX ingress
Strong security by default

Cluster Creation Example (CLI)

oci ce cluster create \
  --name prod-oke \
  --vcn-id ocid1.vcn.oc1... \
  --kubernetes-version "1.30.1" \
  --compartment-id <compartment_ocid>

Node Pool Example

oci ce node-pool create \
  --name prod-nodepool \
  --cluster-id <cluster_ocid> \
  --node-shape VM.Standard3.Flex \
  --node-shape-config '{"ocpus":4,"memoryInGBs":32}' \
  --subnet-ids '["<subnet_ocid>"]'

5. Adding Ingress Traffic: OCI LB + NGINX

In multi-cloud architectures (Azure, GCP, OCI), it’s common to use Cloudflare or F5 for global routing, but within OCI you typically rely on:

OCI Load Balancer (Layer 4/7)
NGINX Ingress Controller on OKE

Example: Basic Ingress for Microservices

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: payments-ingress
spec:
  ingressClassName: nginx
  rules:
  - host: payments.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: payments-svc
            port:
              number: 8080

6. Secure Secrets With OCI Vault

Never store secrets in ConfigMaps or Docker images.
OCI Vault integrates tightly with:

Kubernetes Secrets via CSI Driver
Database credential rotation
Key management (KMS)

Example: Using OCI Vault with Kubernetes

apiVersion: v1
kind: Secret
metadata:
  name: db-secret
type: Opaque
stringData:
  username: appuser
  password: ${OCI_VAULT_SECRET_DB_PASSWORD}

7. Observability: Logging + Monitoring + Prometheus

OCI Monitoring handles metrics out of the box (CPU, memory, LB metrics, OKE metrics).
But for application-level observability, you deploy Prometheus/Grafana.

Prometheus Helm Install

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring

Add ServiceMonitor for your applications:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: payments-monitor
spec:
  selector:
    matchLabels:
      app: payments
  endpoints:
  - port: http

8. Disaster Recovery and Multi-Region Strategy

OCI provides:

Block Volume replication
Object Storage Cross-Region Replication
Multi-AD (Availability Domain) deployment
Cross-region DR using Remote Peering

Example: Autonomous DB Cross-Region DR

oci db autonomous-database create-adb-cross-region-disaster-recovery \
  --autonomous-database-id <db_ocid> \
  --disaster-recovery-region "eu-frankfurt-1"

9. CI/CD on OCI Using GitHub Actions

Example pipeline to build a Docker image and deploy to OKE:

name: Deploy to OKE

on:
  push:
    branches: [ "main" ]

jobs:
  build-deploy:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3

    - name: Build Docker Image
      run: docker build -t myapp:${{ github.sha }} .

    - name: OCI CLI Login
      run: |
        oci session authenticate

    - name: Push Image to OCIR
      run: |
        docker tag myapp:${{ github.sha }} \
        iad.ocir.io/tenancy/myapp:${{ github.sha }}
        docker push iad.ocir.io/tenancy/myapp:${{ github.sha }}

    - name: Deploy to OKE
      run: |
        kubectl set image deployment/myapp myapp=iad.ocir.io/tenancy/myapp:${{ github.sha }}

The Final Architecture will look like this

Building a Fully Private, Zero-Trust API Platform on OCI Using API Gateway, Private Endpoints, and VCN Integration

Posted on November 28, 2025November 28, 2025 by Osama Mustafa in Cloud, OCI

1. Why a Private API Gateway Matters

A typical API Gateway sits at the edge and exposes public REST endpoints.
But some environments require:

APIs callable only from internal systems
Backend microservices running in private subnets
Zero inbound public access
Authentication and authorization enforced at gateway level
Isolation between dev, test, pprd, prod

These requirements push you toward a private deployment using Private Endpoint Mode.

This means:

The API Gateway receives traffic only from inside your VCN
Clients must be inside the private network (on-prem, FastConnect, VPN, or private OCI services)
The entire flow stays within the private topology

2. Architecture Overview

A private API Gateway requires several OCI components working together:

API Gateway (Private Endpoint Mode)
VCN with private subnets
Service Gateway for private object storage access
Private Load Balancer for backend microservices
IAM policies controlling which groups can deploy APIs
VCN routing configuration to direct requests correctly
Optional WAF (private) for east-west inspection inside the VCN

The call flow:

A client inside your VCN sends a request to the Gateway’s private IP.
The Gateway handles authentication, request validation, and OCI IAM signature checks.
The Gateway forwards traffic to a backend private LB or private OKE services.
Logs go privately to Logging service via the service gateway.

All traffic stays private. No NAT, no public egress.

3. Deploying the Gateway in Private Endpoint Mode

When creating the API Gateway:

Choose Private Gateway Type
Select the VCN and Private Subnet
Ensure the subnet has no internet gateway
Disable public routing

You will receive a private IP instead of a public endpoint.

Example shape:

Private Gateway IP: 10.0.4.15
Subnet: app-private-subnet-1
VCN CIDR: 10.0.0.0/16

Only systems inside the 10.x.x.x network (or connected networks) can call it.

4. Routing APIs to Private Microservices

Your backend might be:

A microservice running in OKE
A VM instance
A container on Container Instances
A private load balancer
A function in a private subnet
An internal Oracle DB REST endpoint

For reliable routing:

a. Attach a Private Load Balancer

It’s best practice to put microservices behind an internal load balancer.

Example LB private IP: 10.0.20.10

b. Add Route Table Entries

Ensure the subnet hosting the API Gateway can route to the backend:

Destination: 10.0.20.0/24
Target: local

If OKE is involved, ensure proper security list or NSG rules:

Allow port 80 or 443 from Gateway subnet to LB subnet
Allow health checks

5. Creating an API Deployment (Technical Example)

Here is a minimal private deployment using a backend running at internal LB:

Deployment specification

{
  "routes": [
    {
      "path": "/v1/customer",
      "methods": ["GET"],
      "backend": {
        "type": "HTTP_BACKEND",
        "url": "http://10.0.20.10:8080/api/customer"
      }
    }
  ]
}

Upload this JSON file and create a new deployment under your private API Gateway.

The Gateway privately calls 10.0.20.10 using internal routing.

6. Adding Authentication and Authorization

OCI API Gateway supports:

OCI IAM Authorization (for IAM-authenticated clients)
JWT validation (OIDC tokens)
Custom authorizers using Functions

Example: validate a token from an internal identity provider.

"authentication": {
  "type": "JWT_AUTHENTICATION",
  "tokenHeader": "Authorization",
  "jwksUri": "https://id.internal.example.com/.well-known/jwks.json"
}

This ensures zero-trust by requiring token validation even inside the private network.

7. Logging, Metrics, and Troubleshooting 100 Percent Privately

Because we are running in private-only mode, logs and metrics must also stay private.

Use:

Service Gateway for Logging service
VCN Flow Logs for traffic inspection
WAF (private deployment) if deeper L7 filtering is needed

Enable Access Logs:

Enable access logs: Yes
Retention: 90 days

You will see logs in the Logging service with no public egress.

8. Common Mistakes and How to Avoid Them

Route table missing entries

Most issues come from mismatched route tables between:

Gateway subnet
Backend subnet
OKE node pools

Security Lists or NSGs blocking traffic

Ensure the backend allows inbound traffic from the Gateway subnet.

Incorrect backend URL

Use private IP or private LB hostname.

Backend certificate errors

If using HTTPS internally, ensure trusted CA is loaded on Gateway.

Regards

Osama

Implementing a Real-Time Anomaly Detection Pipeline on OCI Using Streaming Data, Oracle Autonomous Database & ML

Posted on November 22, 2025November 22, 2025 by Osama Mustafa in Cloud, OCI

Detecting unusual patterns in real time is critical to preventing outages, catching fraud, ensuring SLA compliance, and maintaining high-quality user experiences.
In this post, we build a real working pipeline on OCI that:

Ingests streaming data
Computes features in near-real time
Stores results in Autonomous Database
Runs anomaly detection logic
Sends alerts and exposes dashboards

This guide contains every technical step, including:
Streaming → Function → Autonomous DB → Anomaly Logic → Notifications → Dashboards

1. Architecture Overview

Components Used

OCI Streaming
OCI Functions
Oracle Autonomous Database
DBMS_SCHEDULER for anomaly detection job
OCI Notifications
Oracle Analytics Cloud / Grafana

2. Step-by-Step Implementation

2.1 Create OCI Streaming Stream

oci streaming stream create \
  --compartment-id $COMPARTMENT_OCID \
  --display-name "anomaly-events-stream" \
  --partitions 3

2.2 Autonomous Database Table

CREATE TABLE raw_events (
  event_id       VARCHAR2(50),
  event_time     TIMESTAMP,
  metric_value   NUMBER,
  feature1       NUMBER,
  feature2       NUMBER,
  processed_flag CHAR(1) DEFAULT 'N',
  anomaly_flag   CHAR(1) DEFAULT 'N',
  CONSTRAINT pk_raw_events PRIMARY KEY(event_id)
);

2.3 OCI Function – Feature Extraction

func.py:

import oci
import cx_Oracle
import json
from datetime import datetime

def handler(ctx, data: bytes=None):
    event = json.loads(data.decode('utf-8'))

    evt_id = event['id']
    evt_time = datetime.fromisoformat(event['time'])
    value = event['metric']

    # DB Connection
    conn = cx_Oracle.connect(user='USER', password='PWD', dsn='dsn')
    cur = conn.cursor()

    # Fetch previous value if exists
    cur.execute("SELECT metric_value FROM raw_events WHERE event_id=:1", (evt_id,))
    prev = cur.fetchone()
    prev_val = prev[0] if prev else 1.0

    # Compute features
    feature1 = value - prev_val
    feature2 = value / prev_val

    # Insert new event
    cur.execute("""
        INSERT INTO raw_events(event_id, event_time, metric_value, feature1, feature2)
        VALUES(:1, :2, :3, :4, :5)
    """, (evt_id, evt_time, value, feature1, feature2))

    conn.commit()
    cur.close()
    conn.close()

    return "ok"

Deploy the function and attach the streaming trigger.

2.4 Anomaly Detection Job (DBMS_SCHEDULER)

BEGIN
  FOR rec IN (
    SELECT event_id, feature1
    FROM raw_events
    WHERE processed_flag = 'N'
  ) LOOP
    DECLARE
      meanv NUMBER;
      stdv  NUMBER;
      zscore NUMBER;
    BEGIN
      SELECT AVG(feature1), STDDEV(feature1) INTO meanv, stdv FROM raw_events;

      zscore := (rec.feature1 - meanv) / NULLIF(stdv, 0);

      IF ABS(zscore) > 3 THEN
        UPDATE raw_events SET anomaly_flag='Y' WHERE event_id=rec.event_id;
      END IF;

      UPDATE raw_events SET processed_flag='Y' WHERE event_id=rec.event_id;
    END;
  END LOOP;
END;

Schedule this to run every 2 minutes:

BEGIN
  DBMS_SCHEDULER.CREATE_JOB (
    job_name        => 'ANOMALY_JOB',
    job_type        => 'PLSQL_BLOCK',
    job_action      => 'BEGIN anomaly_detection_proc; END;',
    repeat_interval => 'FREQ=MINUTELY;INTERVAL=2;',
    enabled         => TRUE
  );
END;

2.5 Notifications

oci ons topic create \
  --compartment-id $COMPARTMENT_OCID \
  --name "AnomalyAlerts"

In the DB, add a trigger:

CREATE OR REPLACE TRIGGER notify_anomaly
AFTER UPDATE ON raw_events
FOR EACH ROW
WHEN (NEW.anomaly_flag='Y' AND OLD.anomaly_flag='N')
BEGIN
  DBMS_OUTPUT.PUT_LINE('Anomaly detected for event ' || :NEW.event_id);
END;
/

2.6 Dashboarding

You may use:

Oracle Analytics Cloud (OAC)
Grafana + ADW Integration
Any BI tool with SQL

Example Query:

SELECT event_time, metric_value, anomaly_flag 
FROM raw_events
ORDER BY event_time;

2. Terraform + OCI CLI Script Bundle

Terraform – Streaming + Function + Policies

resource "oci_streaming_stream" "anomaly" {
  name           = "anomaly-events-stream"
  partitions     = 3
  compartment_id = var.compartment_id
}

resource "oci_functions_application" "anomaly_app" {
  compartment_id = var.compartment_id
  display_name   = "anomaly-function-app"
  subnet_ids     = var.subnets
}

Terraform Notification Topic

resource "oci_ons_notification_topic" "anomaly" {
  compartment_id = var.compartment_id
  name           = "AnomalyAlerts"
}

CLI Insert Test Events

oci streaming stream message put \
  --stream-id $STREAM_OCID \
  --messages '[{"key":"1","value":"{\"id\":\"1\",\"time\":\"2025-01-01T10:00:00\",\"metric\":58}"}]'

Automating Cost-Governance Workflows in Oracle Cloud Infrastructure (OCI) with APIs & Infrastructure as Code

Posted on October 24, 2025 by Osama Mustafa in Cloud, OCI

Introduction

Cloud cost management isn’t just about checking invoices once a month — it’s about embedding automation, governance, and insights into your infrastructure so that your engineering teams make cost-aware decisions in real time. With OCI, you have native tools (Cost Analysis, Usage APIs, Budgets, etc.) and infrastructure-as-code (IaC) tooling that can help turn cost governance from an after-thought into a proactive part of your DevOps workflow.

In this article you’ll learn how to:

Extract usage and cost data via the OCI Usage API / Cost Reports.
Define IaC workflows (e.g., with Terraform) that enforce budget/usage guardrails.
Build a simple example where you automatically tag resources, monitor spend by tag, and alert/correct when thresholds are exceeded.
Discuss best practices, pitfalls, and governance recommendations for embedding FinOps into OCI operations.

1. Understanding OCI Cost & Usage Data

What data is available?

OCI provides several cost/usage-data mechanisms:

The Cost Analysis tool in the console allows you to view trends by service, compartment, tag, etc. Oracle Docs+1
The Usage/Cost Reports (CSV format) which you can download or programmatically access via the Usage API. Oracle Docs+1
The Usage API (CLI/SDK) to query usage-and-cost programmatically. Oracle Docs+1

Why this matters

By surfacing cost data at a resource, compartment, or tag level, teams can answer questions like:

“Which tag values are consuming cost disproportionately?”
“Which compartments have heavy spend growth month-over-month?”
“Which services (Compute, Storage, Database, etc.) are the highest spenders and require optimization?”

Example: Downloading a cost report via CLI

Here’s a Python/CLI snippet that shows how to download a cost-report CSV from your tenancy:

oci os object get \
  --namespace-name bling \
  --bucket-name <your-tenancy-OCID> \
  --name reports/usage-csv/<report_name>.csv.gz \
  --file local_report.csv.gz

import oci
config = oci.config.from_file("~/.oci/config", "DEFAULT")
os_client = oci.object_storage.ObjectStorageClient(config)
namespace = "bling"
bucket = "<your-tenancy-OCID>"
object_name = "reports/usage-csv/2025-10-19-report-00001.csv.gz"

resp = os_client.get_object(namespace, bucket, object_name)
with open("report-2025-10-19.csv.gz", "wb") as f:
    for chunk in resp.data.raw.stream(1024*1024, decode_content=False):
        f.write(chunk)

2. Defining Cost-Governance Workflows with IaC

Once you have data flowing in, you can enforce guardrails and automate actions. Here’s one example pattern.

a) Enforce tagging rules

Ensure that every resource created in a compartment has a cost_center tag (for example). You can do this via policy + IaC.

# Example Terraform policy for tagging requirement
resource "oci_identity_tag_namespace" "governance" {
  compartment_id = var.compartment_id
  display_name   = "governance_tags"
  is_retired     = false
}

resource "oci_identity_tag_definition" "cost_center" {
  compartment_id = var.compartment_id
  tag_namespace_id = oci_identity_tag_namespace.governance.id
  name            = "cost_center"
  description     = "Cost Center code for FinOps tracking"
  is_retired      = false
}

You can then add an IAM policy that prevents creation of resources if the tag isn’t applied (or fails to meet allowed values). For example:

Allow group ComputeAdmins to manage instance-family in compartment Prod
  where request.operation = “CreateInstance”
  and request.resource.tag.cost_center is not null

b) Monitor vs budget

Use the Usage API or Cost Reports to pull monthly spend per tag, then compare against defined budgets. If thresholds are exceeded, trigger an alert or remediation.

Here’s an example Python pseudo-code:

from datetime import datetime, timedelta
import oci

config = oci.config.from_file()
usage_client = oci.usage_api.UsageapiClient(config)

today = datetime.utcnow()
start = today.replace(day=1)
end = today

req = oci.usage_api.models.RequestSummarizedUsagesDetails(
    tenant_id = config["tenancy"],
    time_usage_started = start,
    time_usage_ended   = end,
    granularity        = "DAILY",
    group_by           = ["tag.cost_center"]
)

resp = usage_client.request_summarized_usages(req)
for item in resp.data.items:
    tag_value = item.tag_map.get("cost_center", "untagged")
    cost     = float(item.computed_amount or 0)
    print(f"Cost for cost_center={tag_value}: {cost}")

    if cost > budget_for(tag_value):
        send_alert(tag_value, cost)
        take_remediation(tag_value)

c) Automated remediation

Remediation could mean:

Auto-shut down non-production instances in compartments after hours.
Resize or terminate idle resources.
Notify owners of over-spend via email/Slack.

Terraform, OCI Functions and Event-Service can help orchestrate that. For example, set up an Event when “cost by compartment exceeds X” → invoke Function → tag resources with “cost_alerted” → optional shutdown.

3. Putting It All Together

Here is a step-by-step scenario:

Define budget categories – e.g., cost_center codes: CC-101, CC-202, CC-303.
Tag resources on creation – via policy/IaC ensure all resources include cost_center tag with one of those codes.
Collect cost data – using Usage API daily, group by tag.cost_center.
Evaluate current spend vs budget – for each code, compare cumulative cost for current month against budget.
If over budget – then:
- send an alert to the team (via SNS, email, Slack)
- optionally trigger remediation: e.g., stop non-critical compute in that cost center’s compartments.
Dashboard & visibility – load cost data into a BI tool (could be OCI Analytics Cloud or Oracle Analytics) with trends, forecasts, anomaly detection. Use the “Show cost” in OCI Ops Insights to view usage & forecast cost. Oracle Docs
Continuous improvement – right-size instances, pause dev/test at night, switch to cheaper shapes or reserved/commit models (depending on your discount model). See OCI best practice guide for optimizing cost. Oracle Docs

Example snippet – alerting logic in CLI

# example command to get summarized usage for last 7 days
oci usage-api request-summarized-usages \
  --tenant-id $TENANCY_OCID \
  --time-usage-started $(date -u -d '-7 days' +%Y-%m-%dT00:00:00Z) \
  --time-usage-ended   $(date -u +%Y-%m-%dT00:00:00Z) \
  --granularity DAILY \
  --group-by "tag.cost_center" \
  --query "data.items[?tagMap.cost_center=='CC-101'].computedAmount" \
  --raw-output

Enjoy the OCI
Osama

Building a Real-Time Recommendation Engine on Oracle Cloud Infrastructure (OCI) Using Generative AI & Streaming

Posted on October 22, 2025 by Osama Mustafa in Cloud, OCI

Introduction

In many modern applications — e-commerce, media platforms, SaaS services — providing real-time personalized recommendations is a key differentiator. With OCI’s streaming, AI/ML and serverless capabilities you can build a recommendation engine that:

Ingests user events (clicks, views, purchases) in real time
Applies a generative-AI model (or fine-tuned model) to generate suggestions
Stores, serves, and updates recommendations frequently
Enables feedback loop to refine model based on real usage

In this article you’ll learn how to:

Set up a streaming pipeline using OCI Streaming Service to ingest user events.
Use OCI Data Science or OCI AI Services + a generative model (e.g., GPT-style) to produce recommendation outputs.
Build a serving layer to deliver recommendations (via OCI Functions + API Gateway).
Create the feedback loop — capturing user interactions, updating model or embeddings, automating retraining.
Walk through code snippets, architectural decisions, best practices and pitfalls.

1. Architecture Overview

Here’s a high-level architecture for our recommendation engine:

Event Ingestion: User activities → publish to OCI Streaming (Kafka-compatible)
Processing Layer: A consumer application (OCI Functions or Data Flow) reads events, preprocesses, enriches with user/profile/context data (from Autonomous DB or NoSQL).
Model Layer: A generative model (e.g., fine-tuned GPT or embedding-based recommender) inside OCI Data Science. It takes context + user history → produces N recommendations.
Serving Layer: OCI API Gateway + OCI Functions deliver recommendations to front-end or mobile apps.
Feedback Loop: User clicks or ignores recommendations → events fed back into streaming topic → periodic retraining/refinement of model or embedding space.
Storage / Feature Store: Use Autonomous NoSQL DB or Autonomous Database for storing user profiles, item embeddings, transaction history.

2. Setting Up Streaming Ingestion

Create an OCI Streaming topic

oci streaming stream create \
  --compartment-id $COMPARTMENT_OCID \
  --display-name "user-event-stream" \
  --partitions 4

Produce events (example with Python)

import oci
from oci.streaming import StreamClient
from oci.streaming.models import PutMessagesDetails, Message

config = oci.config.from_file()
stream_client = StreamClient(config)
stream_id = "<your_stream_OCID>"

def send_event(user_id, item_id, event_type, timestamp):
    msg = Message(value=f"{user_id},{item_id},{event_type},{timestamp}")
    resp = stream_client.put_messages(
        put_messages_details=PutMessagesDetails(
            stream_id=stream_id,
            messages=[msg]
        )
    )
    return resp

# Example
send_event("U123", "I456", "view", "2025-10-19T10:15:00Z")

3. Model Layer: Generative/Embedding-Based Recommendations

Option A: Embedding + similarity lookup

We pre-compute embeddings for users and items (e.g., using a transformer or collaborative model) and store them in a vector database (or NoSQL). When a new event arrives, we update the user embedding (incrementally) and compute top-K similar items.

Option B: Fine-tuned generative model

We fine-tune a GPT-style model on historical user → recommendation sequences so that given “User U123 last 5 items: I234, I456, I890… context: browsing category Sports” we get suggestions like “I333, I777, I222”.

Example snippet using OCI Data Science and Python

import oci
# assume model endpoint is deployed
from some_sdk import RecommendationModelClient  

config = oci.config.from_file()
model_client = RecommendationModelClient(config)
endpoint = "<model_endpoint_url>"

def get_recommendations(user_id, recent_items, context, top_k=5):
    prompt = f"""User: {user_id}
RecentItems: {','.join(recent_items)}
Context: {context}
Provide {top_k} item IDs with reasons:"""
    response = model_client.predict(endpoint, prompt)
    recommended = response['recommendations']
    return recommended

# example
recs = get_recommendations("U123", ["I234","I456","I890"], "Looking for running shoes", 5)
print(recs)

Model deployment

Train/fine-tune in OCI Data Science environment
Deploy as a real-time endpoint (OCI Data Science Model Deployment)
Or optionally use OCI Functions for low-latency, light-weight inference

4. Serving Layer & Feedback Loop

Serving via API Gateway + Functions

Create an OCI Function getRecommendations that takes user_id & context and returns recommendations by calling the model endpoint or embedding lookup
Expose via OCI API Gateway for external apps

Feedback capture

After the user sees recommendations and either clicks, ignores or purchases, capture that as event rec_click, rec_ignore, purchase and publish it back to the streaming topic
Use this feedback to:
- Incrementally update user embedding
- Record reinforcement signal for later batch retraining

Scheduled retraining / embedding update

Use OCI Data Science scheduled jobs or Data Flow to run nightly or weekly batch jobs: aggregate events, update embeddings, fine-tune model
Example pseudo-code:

from datetime import datetime, timedelta
import pandas as pd
# fetch events last 7 days
events = load_events(start=datetime.utcnow()-timedelta(days=7))
# update embeddings, retrain model

Conclusion

Building a real-time recommendation engine on OCI, combining streaming ingestion, generative AI or embedding-based models, and serverless serving, enables you to deliver personalized experiences at scale. By capturing user behaviour in real time, serving timely recommendations, and closing the feedback loop, you shift from static “top N” lists to dynamic, context-aware suggestions. With careful architecture, you can deliver high performance, relevance, and scalability.

Power of the OCI AI
Enjoy
Osama

Advanced OCI Cost Management Resource Optimization and Predictive Budget Control

Posted on August 15, 2025August 15, 2025 by Osama Mustafa in Cloud, OCI

Cloud cost management has evolved from simple monitoring to sophisticated FinOps practices that combine financial accountability with operational efficiency. Oracle Cloud Infrastructure provides powerful cost management capabilities that, when combined with intelligent automation, enable organizations to optimize spending while maintaining performance and availability. This comprehensive guide explores advanced cost optimization strategies, predictive analytics, and automated governance frameworks for enterprise OCI environments.

FinOps Framework and OCI Cost Architecture

Financial Operations (FinOps) represents a cultural shift where engineering, finance, and operations teams collaborate to maximize cloud value. OCI’s cost management architecture supports this collaboration through comprehensive billing analytics, resource tagging strategies, and automated policy enforcement mechanisms.

The cost management ecosystem integrates multiple data sources including usage metrics, billing information, and performance indicators to provide holistic visibility into cloud spending patterns. Unlike traditional cost tracking approaches, modern FinOps implementations use machine learning algorithms to predict future costs and recommend optimization actions proactively.

OCI’s native cost management tools include detailed billing analytics, budget controls with automated alerts, and resource usage tracking at granular levels. The platform supports advanced tagging strategies that enable cost allocation across business units, projects, and environments while maintaining operational flexibility.

Resource lifecycle management becomes critical for cost optimization, with automated policies that right-size instances, schedule non-production workloads, and implement tiered storage strategies based on access patterns and business requirements.

Intelligent Cost Analytics and Forecasting

Advanced cost analytics goes beyond simple billing reports to provide predictive insights and optimization recommendations. Machine learning models analyze historical usage patterns, seasonal variations, and growth trends to forecast future spending with high accuracy.

Anomaly detection algorithms identify unusual spending patterns that may indicate configuration drift, unauthorized resource creation, or inefficient resource utilization. These systems can detect cost anomalies within hours rather than waiting for monthly billing cycles.

Cost attribution models enable accurate allocation of shared resources across business units while maintaining transparency in cross-functional projects. Advanced algorithms can apportion costs for shared networking, storage, and security services based on actual usage metrics rather than static allocation formulas.

Predictive scaling models combine cost forecasting with performance requirements to recommend optimal resource configurations that minimize costs while meeting service level objectives.

Production Implementation with Automated Optimization

Here’s a comprehensive implementation of intelligent cost management with automated optimization and predictive analytics:

Infrastructure Cost Monitoring and Optimization Framework

#!/usr/bin/env python3
"""
Advanced OCI Cost Management and FinOps Automation Platform
Provides intelligent cost optimization, predictive analytics, and automated
governance for enterprise Oracle Cloud Infrastructure environments.
"""

import oci
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from typing import Dict, List, Tuple, Optional, Any
from dataclasses import dataclass, field
from enum import Enum
import logging
import asyncio
import json
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from email.mime.base import MIMEBase
from email import encoders
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import IsolationForest
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

class CostSeverity(Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    CRITICAL = "critical"

class OptimizationAction(Enum):
    RIGHT_SIZE = "right_size"
    SCHEDULE = "schedule"
    MIGRATE_STORAGE = "migrate_storage"
    TERMINATE = "terminate"
    UPGRADE_COMMITMENT = "upgrade_commitment"

@dataclass
class CostAnomaly:
    """Container for cost anomaly detection results"""
    resource_id: str
    resource_type: str
    resource_name: str
    expected_cost: float
    actual_cost: float
    anomaly_score: float
    severity: CostSeverity
    detected_at: datetime
    description: str
    recommended_action: OptimizationAction
    potential_savings: float = 0.0

@dataclass
class OptimizationRecommendation:
    """Container for cost optimization recommendations"""
    resource_id: str
    resource_type: str
    current_config: Dict[str, Any]
    recommended_config: Dict[str, Any]
    current_monthly_cost: float
    projected_monthly_cost: float
    potential_savings: float
    confidence_score: float
    implementation_effort: str
    risk_level: str
    business_impact: str

@dataclass
class BudgetAlert:
    """Container for budget alert information"""
    budget_name: str
    current_spend: float
    budget_amount: float
    utilization_percentage: float
    forecast_spend: float
    days_remaining: int
    severity: CostSeverity
    recommendations: List[str]

class OCICostOptimizer:
    def __init__(self, config_file: str = 'cost_config.yaml'):
        """Initialize the cost optimization system"""
        self.config = self._load_config(config_file)
        self.signer = oci.auth.signers.get_resource_principals_signer()
        
        # Initialize OCI clients
        self.usage_client = oci.usage_api.UsageapiClient({}, signer=self.signer)
        self.compute_client = oci.core.ComputeClient({}, signer=self.signer)
        self.network_client = oci.core.VirtualNetworkClient({}, signer=self.signer)
        self.storage_client = oci.core.BlockstorageClient({}, signer=self.signer)
        self.monitoring_client = oci.monitoring.MonitoringClient({}, signer=self.signer)
        self.budgets_client = oci.budget.BudgetClient({}, signer=self.signer)
        
        # Cost tracking and ML models
        self.cost_history = pd.DataFrame()
        self.anomaly_detector = IsolationForest(contamination=0.1, random_state=42)
        self.cost_forecaster = LinearRegression()
        self.scaler = StandardScaler()
        
        # Cost optimization thresholds
        self.thresholds = {
            'cost_spike_factor': 2.0,
            'utilization_threshold': 20.0,
            'savings_threshold': 50.0,
            'risk_tolerance': 'medium'
        }

    def _load_config(self, config_file: str) -> Dict:
        """Load configuration from file"""
        import yaml
        try:
            with open(config_file, 'r') as f:
                return yaml.safe_load(f)
        except FileNotFoundError:
            logger.warning(f"Config file {config_file} not found, using defaults")
            return {
                'tenancy_id': 'your-tenancy-id',
                'compartment_id': 'your-compartment-id',
                'time_granularity': 'DAILY',
                'forecast_days': 30,
                'optimization_enabled': True
            }

    async def analyze_cost_trends(self, days_back: int = 90) -> Dict[str, Any]:
        """Analyze cost trends and identify patterns"""
        end_date = datetime.utcnow()
        start_date = end_date - timedelta(days=days_back)
        
        try:
            # Get usage data from OCI
            usage_data = await self._fetch_usage_data(start_date, end_date)
            
            if usage_data.empty:
                logger.warning("No usage data available for analysis")
                return {}
            
            # Perform trend analysis
            trends = {
                'total_cost_trend': self._calculate_cost_trend(usage_data),
                'service_cost_breakdown': self._analyze_service_costs(usage_data),
                'daily_cost_variation': self._analyze_daily_patterns(usage_data),
                'cost_efficiency_metrics': self._calculate_efficiency_metrics(usage_data),
                'anomalies': await self._detect_cost_anomalies(usage_data)
            }
            
            # Generate cost forecast
            trends['cost_forecast'] = await self._forecast_costs(usage_data)
            
            return trends
            
        except Exception as e:
            logger.error(f"Failed to analyze cost trends: {str(e)}")
            return {}

    async def _fetch_usage_data(self, start_date: datetime, end_date: datetime) -> pd.DataFrame:
        """Fetch usage and cost data from OCI"""
        try:
            request_details = oci.usage_api.models.RequestSummarizedUsagesDetails(
                tenant_id=self.config['tenancy_id'],
                time_usage_started=start_date,
                time_usage_ended=end_date,
                granularity=self.config.get('time_granularity', 'DAILY'),
                compartment_depth=6,
                group_by=['compartmentName', 'service', 'resource']
            )
            
            response = self.usage_client.request_summarized_usages(
                request_details=request_details
            )
            
            # Convert to DataFrame
            usage_records = []
            for item in response.data.items:
                usage_records.append({
                    'date': item.time_usage_started,
                    'compartment': item.compartment_name,
                    'service': item.service,
                    'resource': item.resource_name,
                    'computed_amount': float(item.computed_amount) if item.computed_amount else 0.0,
                    'computed_quantity': float(item.computed_quantity) if item.computed_quantity else 0.0,
                    'currency': item.currency,
                    'unit': item.unit,
                    'tags': item.tags if item.tags else {}
                })
            
            df = pd.DataFrame(usage_records)
            if not df.empty:
                df['date'] = pd.to_datetime(df['date'])
                df = df.sort_values('date')
            
            return df
            
        except Exception as e:
            logger.error(f"Failed to fetch usage data: {str(e)}")
            return pd.DataFrame()

    def _calculate_cost_trend(self, usage_data: pd.DataFrame) -> Dict[str, Any]:
        """Calculate overall cost trends"""
        if usage_data.empty:
            return {}
        
        # Group by date and sum costs
        daily_costs = usage_data.groupby('date')['computed_amount'].sum().reset_index()
        
        if len(daily_costs) < 7:
            return {'trend': 'insufficient_data'}
        
        # Calculate trend metrics
        days = np.arange(len(daily_costs))
        costs = daily_costs['computed_amount'].values
        
        # Linear regression for trend
        slope, intercept = np.polyfit(days, costs, 1)
        trend_direction = 'increasing' if slope > 0 else 'decreasing'
        
        # Calculate period-over-period growth
        recent_period = costs[-7:].mean()
        previous_period = costs[-14:-7].mean() if len(costs) >= 14 else costs[:-7].mean()
        
        growth_rate = ((recent_period - previous_period) / previous_period * 100) if previous_period > 0 else 0
        
        # Cost volatility
        volatility = np.std(costs) / np.mean(costs) * 100 if np.mean(costs) > 0 else 0
        
        return {
            'trend': trend_direction,
            'growth_rate_percent': round(growth_rate, 2),
            'volatility_percent': round(volatility, 2),
            'average_daily_cost': round(np.mean(costs), 2),
            'total_period_cost': round(np.sum(costs), 2),
            'trend_slope': slope
        }

    def _analyze_service_costs(self, usage_data: pd.DataFrame) -> Dict[str, Any]:
        """Analyze costs by service type"""
        if usage_data.empty:
            return {}
        
        service_costs = usage_data.groupby('service')['computed_amount'].agg([
            'sum', 'mean', 'count'
        ]).round(2)
        
        service_costs.columns = ['total_cost', 'avg_cost', 'usage_count']
        service_costs['cost_percentage'] = (
            service_costs['total_cost'] / service_costs['total_cost'].sum() * 100
        ).round(2)
        
        # Identify top cost drivers
        top_services = service_costs.nlargest(10, 'total_cost')
        
        # Calculate service growth rates
        service_growth = {}
        for service in usage_data['service'].unique():
            service_data = usage_data[usage_data['service'] == service]
            if len(service_data) >= 14:
                recent_cost = service_data.tail(7)['computed_amount'].sum()
                previous_cost = service_data.iloc[-14:-7]['computed_amount'].sum()
                
                if previous_cost > 0:
                    growth = (recent_cost - previous_cost) / previous_cost * 100
                    service_growth[service] = round(growth, 2)
        
        return {
            'service_breakdown': top_services.to_dict('index'),
            'service_growth_rates': service_growth,
            'total_services': len(service_costs),
            'cost_concentration': service_costs['cost_percentage'].iloc[0]  # Top service percentage
        }

    def _analyze_daily_patterns(self, usage_data: pd.DataFrame) -> Dict[str, Any]:
        """Analyze daily usage patterns"""
        if usage_data.empty:
            return {}
        
        usage_data['day_of_week'] = usage_data['date'].dt.day_name()
        usage_data['hour'] = usage_data['date'].dt.hour
        
        # Daily patterns
        daily_avg = usage_data.groupby('day_of_week')['computed_amount'].mean()
        
        # Identify peak and off-peak periods
        peak_day = daily_avg.idxmax()
        off_peak_day = daily_avg.idxmin()
        
        # Weekend vs weekday analysis
        weekends = ['Saturday', 'Sunday']
        weekend_avg = usage_data[usage_data['day_of_week'].isin(weekends)]['computed_amount'].mean()
        weekday_avg = usage_data[~usage_data['day_of_week'].isin(weekends)]['computed_amount'].mean()
        
        weekend_ratio = weekend_avg / weekday_avg if weekday_avg > 0 else 0
        
        return {
            'daily_averages': daily_avg.to_dict(),
            'peak_day': peak_day,
            'off_peak_day': off_peak_day,
            'weekend_to_weekday_ratio': round(weekend_ratio, 2),
            'cost_variation_coefficient': round(daily_avg.std() / daily_avg.mean(), 2) if daily_avg.mean() > 0 else 0
        }

    def _calculate_efficiency_metrics(self, usage_data: pd.DataFrame) -> Dict[str, Any]:
        """Calculate cost efficiency metrics"""
        if usage_data.empty:
            return {}
        
        # Cost per unit metrics
        efficiency_metrics = {}
        
        for service in usage_data['service'].unique():
            service_data = usage_data[usage_data['service'] == service]
            
            if service_data['computed_quantity'].sum() > 0:
                cost_per_unit = (
                    service_data['computed_amount'].sum() / 
                    service_data['computed_quantity'].sum()
                )
                efficiency_metrics[service] = {
                    'cost_per_unit': round(cost_per_unit, 4),
                    'total_units': service_data['computed_quantity'].sum(),
                    'unit_type': service_data['unit'].iloc[0] if len(service_data) > 0 else 'unknown'
                }
        
        # Overall efficiency trends
        total_cost = usage_data['computed_amount'].sum()
        total_quantity = usage_data['computed_quantity'].sum()
        
        return {
            'service_efficiency': efficiency_metrics,
            'overall_cost_per_unit': round(total_cost / total_quantity, 4) if total_quantity > 0 else 0,
            'efficiency_score': self._calculate_efficiency_score(usage_data)
        }

    def _calculate_efficiency_score(self, usage_data: pd.DataFrame) -> float:
        """Calculate overall efficiency score (0-100)"""
        if usage_data.empty:
            return 0.0
        
        # Factors that contribute to efficiency score
        factors = []
        
        # Cost volatility (lower is better)
        daily_costs = usage_data.groupby('date')['computed_amount'].sum()
        if len(daily_costs) > 1:
            volatility = daily_costs.std() / daily_costs.mean()
            volatility_score = max(0, 100 - (volatility * 100))
            factors.append(volatility_score)
        
        # Resource utilization (mock calculation - would need actual metrics)
        # In real implementation, this would come from monitoring data
        utilization_score = 75  # Placeholder
        factors.append(utilization_score)
        
        # Cost trend (stable or decreasing is better)
        if len(daily_costs) >= 7:
            recent_avg = daily_costs.tail(7).mean()
            previous_avg = daily_costs.head(7).mean()
            
            if previous_avg > 0:
                trend_factor = (previous_avg - recent_avg) / previous_avg
                trend_score = min(100, max(0, 50 + (trend_factor * 50)))
                factors.append(trend_score)
        
        return round(np.mean(factors), 1) if factors else 50.0

    async def _detect_cost_anomalies(self, usage_data: pd.DataFrame) -> List[CostAnomaly]:
        """Detect cost anomalies using machine learning"""
        anomalies = []
        
        if usage_data.empty or len(usage_data) < 30:
            return anomalies
        
        try:
            # Prepare data for anomaly detection
            daily_costs = usage_data.groupby(['date', 'service'])['computed_amount'].sum().reset_index()
            
            for service in daily_costs['service'].unique():
                service_data = daily_costs[daily_costs['service'] == service]
                
                if len(service_data) < 14:  # Need sufficient data
                    continue
                
                costs = service_data['computed_amount'].values.reshape(-1, 1)
                
                # Fit anomaly detector
                detector = IsolationForest(contamination=0.1, random_state=42)
                detector.fit(costs)
                
                # Detect anomalies
                anomaly_scores = detector.decision_function(costs)
                is_anomaly = detector.predict(costs) == -1
                
                # Process anomalies
                for i, (anomaly, score) in enumerate(zip(is_anomaly, anomaly_scores)):
                    if anomaly:
                        date = service_data.iloc[i]['date']
                        actual_cost = service_data.iloc[i]['computed_amount']
                        
                        # Calculate expected cost (median of recent normal values)
                        normal_costs = costs[~is_anomaly]
                        expected_cost = np.median(normal_costs) if len(normal_costs) > 0 else actual_cost
                        
                        # Determine severity
                        cost_factor = actual_cost / expected_cost if expected_cost > 0 else 1
                        
                        if cost_factor >= 3:
                            severity = CostSeverity.CRITICAL
                        elif cost_factor >= 2:
                            severity = CostSeverity.HIGH
                        elif cost_factor >= 1.5:
                            severity = CostSeverity.MEDIUM
                        else:
                            severity = CostSeverity.LOW
                        
                        anomaly = CostAnomaly(
                            resource_id=f"{service}-{date.strftime('%Y%m%d')}",
                            resource_type=service,
                            resource_name=service,
                            expected_cost=expected_cost,
                            actual_cost=actual_cost,
                            anomaly_score=abs(score),
                            severity=severity,
                            detected_at=datetime.utcnow(),
                            description=f"Cost spike detected: {actual_cost:.2f} vs expected {expected_cost:.2f}",
                            recommended_action=OptimizationAction.RIGHT_SIZE,
                            potential_savings=actual_cost - expected_cost
                        )
                        
                        anomalies.append(anomaly)
            
            return sorted(anomalies, key=lambda x: x.potential_savings, reverse=True)
            
        except Exception as e:
            logger.error(f"Failed to detect cost anomalies: {str(e)}")
            return []

    async def _forecast_costs(self, usage_data: pd.DataFrame, forecast_days: int = 30) -> Dict[str, Any]:
        """Forecast future costs using machine learning"""
        if usage_data.empty or len(usage_data) < 14:
            return {'status': 'insufficient_data'}
        
        try:
            # Prepare data for forecasting
            daily_costs = usage_data.groupby('date')['computed_amount'].sum().reset_index()
            daily_costs['days'] = (daily_costs['date'] - daily_costs['date'].min()).dt.days
            
            X = daily_costs[['days']].values
            y = daily_costs['computed_amount'].values
            
            # Fit forecasting model
            self.cost_forecaster.fit(X, y)
            
            # Generate forecast
            last_day = daily_costs['days'].max()
            future_days = np.arange(last_day + 1, last_day + forecast_days + 1).reshape(-1, 1)
            forecasted_costs = self.cost_forecaster.predict(future_days)
            
            # Calculate confidence intervals (simplified)
            residuals = y - self.cost_forecaster.predict(X)
            std_error = np.std(residuals)
            
            forecast_dates = [
                daily_costs['date'].max() + timedelta(days=i) 
                for i in range(1, forecast_days + 1)
            ]
            
            forecast_data = []
            for i, (date, cost) in enumerate(zip(forecast_dates, forecasted_costs)):
                forecast_data.append({
                    'date': date.strftime('%Y-%m-%d'),
                    'forecasted_cost': round(max(0, cost), 2),
                    'confidence_lower': round(max(0, cost - 1.96 * std_error), 2),
                    'confidence_upper': round(cost + 1.96 * std_error, 2)
                })
            
            return {
                'status': 'success',
                'forecast_period_days': forecast_days,
                'total_forecasted_cost': round(sum(forecasted_costs), 2),
                'average_daily_cost': round(np.mean(forecasted_costs), 2),
                'forecast_accuracy': round(self.cost_forecaster.score(X, y), 3),
                'daily_forecasts': forecast_data
            }
            
        except Exception as e:
            logger.error(f"Failed to forecast costs: {str(e)}")
            return {'status': 'error', 'message': str(e)}

    async def discover_optimization_opportunities(self) -> List[OptimizationRecommendation]:
        """Discover cost optimization opportunities across resources"""
        recommendations = []
        
        try:
            # Discover compute instances
            compute_recommendations = await self._analyze_compute_costs()
            recommendations.extend(compute_recommendations)
            
            # Discover storage optimization
            storage_recommendations = await self._analyze_storage_costs()
            recommendations.extend(storage_recommendations)
            
            # Discover network optimization
            network_recommendations = await self._analyze_network_costs()
            recommendations.extend(network_recommendations)
            
            # Sort by potential savings
            recommendations.sort(key=lambda x: x.potential_savings, reverse=True)
            
            return recommendations
            
        except Exception as e:
            logger.error(f"Failed to discover optimization opportunities: {str(e)}")
            return []

    async def _analyze_compute_costs(self) -> List[OptimizationRecommendation]:
        """Analyze compute instance costs and recommend optimizations"""
        recommendations = []
        
        try:
            # Get all compute instances
            instances = self.compute_client.list_instances(
                compartment_id=self.config['compartment_id'],
                lifecycle_state='RUNNING'
            ).data
            
            for instance in instances:
                # Get instance metrics (simplified - would use actual monitoring data)
                utilization_data = await self._get_instance_utilization(instance.id)
                
                # Calculate current cost (simplified pricing)
                current_cost = self._calculate_instance_cost(instance)
                
                # Analyze for right-sizing opportunities
                if utilization_data.get('cpu_utilization', 50) < 20:
                    # Recommend smaller shape
                    recommended_shape = self._recommend_smaller_shape(instance.shape)
                    
                    if recommended_shape:
                        projected_cost = current_cost * 0.6  # Approximate cost reduction
                        savings = current_cost - projected_cost
                        
                        recommendation = OptimizationRecommendation(
                            resource_id=instance.id,
                            resource_type='compute_instance',
                            current_config={
                                'shape': instance.shape,
                                'ocpus': getattr(instance.shape_config, 'ocpus', 'unknown'),
                                'memory_gb': getattr(instance.shape_config, 'memory_in_gbs', 'unknown')
                            },
                            recommended_config={
                                'shape': recommended_shape,
                                'action': 'resize_instance'
                            },
                            current_monthly_cost=current_cost,
                            projected_monthly_cost=projected_cost,
                            potential_savings=savings,
                            confidence_score=0.8,
                            implementation_effort='medium',
                            risk_level='low',
                            business_impact='minimal'
                        )
                        
                        recommendations.append(recommendation)
                
                # Check for unused instances
                if utilization_data.get('cpu_utilization', 50) < 5:
                    recommendation = OptimizationRecommendation(
                        resource_id=instance.id,
                        resource_type='compute_instance',
                        current_config={'shape': instance.shape, 'state': 'running'},
                        recommended_config={'action': 'terminate_or_stop'},
                        current_monthly_cost=current_cost,
                        projected_monthly_cost=0,
                        potential_savings=current_cost,
                        confidence_score=0.9,
                        implementation_effort='low',
                        risk_level='medium',
                        business_impact='requires_validation'
                    )
                    
                    recommendations.append(recommendation)
            
            return recommendations
            
        except Exception as e:
            logger.error(f"Failed to analyze compute costs: {str(e)}")
            return []

    async def _get_instance_utilization(self, instance_id: str) -> Dict[str, float]:
        """Get instance utilization metrics (simplified)"""
        try:
            # In a real implementation, this would query OCI Monitoring
            # For demo purposes, returning mock data
            return {
                'cpu_utilization': np.random.uniform(5, 95),
                'memory_utilization': np.random.uniform(10, 90),
                'network_utilization': np.random.uniform(1, 50)
            }
        except Exception as e:
            logger.error(f"Failed to get utilization for {instance_id}: {str(e)}")
            return {}

    def _calculate_instance_cost(self, instance) -> float:
        """Calculate monthly cost for instance (simplified)"""
        # Simplified cost calculation - in reality would use OCI pricing API
        shape_costs = {
            'VM.Standard2.1': 67.0,
            'VM.Standard2.2': 134.0,
            'VM.Standard2.4': 268.0,
            'VM.Standard2.8': 536.0,
            'VM.Standard.E3.Flex': 50.0,  # Base cost
            'VM.Standard.E4.Flex': 45.0   # Base cost
        }
        
        base_cost = shape_costs.get(instance.shape, 100.0)
        
        # Adjust for flex shapes based on OCPUs
        if 'Flex' in instance.shape and hasattr(instance, 'shape_config'):
            if hasattr(instance.shape_config, 'ocpus'):
                base_cost *= float(instance.shape_config.ocpus)
        
        return base_cost

    def _recommend_smaller_shape(self, current_shape: str) -> Optional[str]:
        """Recommend a smaller instance shape"""
        shape_hierarchy = {
            'VM.Standard2.8': 'VM.Standard2.4',
            'VM.Standard2.4': 'VM.Standard2.2',
            'VM.Standard2.2': 'VM.Standard2.1',
            'VM.Standard.E4.Flex': 'VM.Standard.E3.Flex'
        }
        
        return shape_hierarchy.get(current_shape)

    async def _analyze_storage_costs(self) -> List[OptimizationRecommendation]:
        """Analyze storage costs and recommend optimizations"""
        recommendations = []
        
        try:
            # Get block volumes
            volumes = self.storage_client.list_volumes(
                compartment_id=self.config['compartment_id'],
                lifecycle_state='AVAILABLE'
            ).data
            
            for volume in volumes:
                # Analyze volume usage patterns (simplified)
                usage_pattern = await self._analyze_volume_usage(volume.id)
                
                current_cost = volume.size_in_gbs * 0.0255  # Simplified cost per GB
                
                # Check for infrequent access patterns
                if usage_pattern.get('access_frequency', 'high') == 'low':
                    # Recommend moving to lower performance tier
                    projected_cost = current_cost * 0.7  # Lower tier pricing
                    savings = current_cost - projected_cost
                    
                    recommendation = OptimizationRecommendation(
                        resource_id=volume.id,
                        resource_type='block_volume',
                        current_config={
                            'size_gb': volume.size_in_gbs,
                            'vpus_per_gb': getattr(volume, 'vpus_per_gb', 10)
                        },
                        recommended_config={
                            'action': 'change_volume_performance',
                            'new_vpus_per_gb': 0
                        },
                        current_monthly_cost=current_cost,
                        projected_monthly_cost=projected_cost,
                        potential_savings=savings,
                        confidence_score=0.7,
                        implementation_effort='low',
                        risk_level='low',
                        business_impact='minimal'
                    )
                    
                    recommendations.append(recommendation)
                
                # Check for oversized volumes
                if usage_pattern.get('utilization_percent', 50) < 30:
                    # Recommend volume resize
                    new_size = int(volume.size_in_gbs * 0.6)
                    projected_cost = new_size * 0.0255
                    savings = current_cost - projected_cost
                    
                    recommendation = OptimizationRecommendation(
                        resource_id=volume.id,
                        resource_type='block_volume',
                        current_config={'size_gb': volume.size_in_gbs},
                        recommended_config={
                            'action': 'resize_volume',
                            'new_size_gb': new_size
                        },
                        current_monthly_cost=current_cost,
                        projected_monthly_cost=projected_cost,
                        potential_savings=savings,
                        confidence_score=0.6,
                        implementation_effort='medium',
                        risk_level='medium',
                        business_impact='requires_validation'
                    )
                    
                    recommendations.append(recommendation)
            
            return recommendations
            
        except Exception as e:
            logger.error(f"Failed to analyze storage costs: {str(e)}")
            return []

    async def _analyze_volume_usage(self, volume_id: str) -> Dict[str, Any]:
        """Analyze volume usage patterns (simplified)"""
        # In reality, this would analyze metrics from OCI Monitoring
        return {
            'access_frequency': np.random.choice(['high', 'medium', 'low'], p=[0.3, 0.4, 0.3]),
            'utilization_percent': np.random.uniform(10, 95),
            'iops_usage': np.random.uniform(100, 10000)
        }

    async def _analyze_network_costs(self) -> List[OptimizationRecommendation]:
        """Analyze network costs and recommend optimizations"""
        recommendations = []
        
        try:
            # Get load balancers
            load_balancers = self.network_client.list_load_balancers(
                compartment_id=self.config['compartment_id']
            ).data
            
            for lb in load_balancers:
                # Analyze load balancer utilization
                utilization = await self._analyze_lb_utilization(lb.id)
                
                # Calculate current cost (simplified)
                if hasattr(lb, 'shape_details') and lb.shape_details:
                    current_bandwidth = lb.shape_details.maximum_bandwidth_in_mbps
                    current_cost = current_bandwidth * 0.008  # Simplified pricing
                    
                    # Check for over-provisioning
                    if utilization.get('avg_bandwidth_usage', 50) < current_bandwidth * 0.3:
                        recommended_bandwidth = max(10, int(current_bandwidth * 0.5))
                        projected_cost = recommended_bandwidth * 0.008
                        savings = current_cost - projected_cost
                        
                        recommendation = OptimizationRecommendation(
                            resource_id=lb.id,
                            resource_type='load_balancer',
                            current_config={
                                'max_bandwidth_mbps': current_bandwidth,
                                'shape': getattr(lb, 'shape_name', 'flexible')
                            },
                            recommended_config={
                                'action': 'resize_load_balancer',
                                'new_max_bandwidth_mbps': recommended_bandwidth
                            },
                            current_monthly_cost=current_cost,
                            projected_monthly_cost=projected_cost,
                            potential_savings=savings,
                            confidence_score=0.75,
                            implementation_effort='low',
                            risk_level='low',
                            business_impact='minimal'
                        )
                        
                        recommendations.append(recommendation)
            
            return recommendations
            
        except Exception as e:
            logger.error(f"Failed to analyze network costs: {str(e)}")
            return []

    async def _analyze_lb_utilization(self, lb_id: str) -> Dict[str, Any]:
        """Analyze load balancer utilization (simplified)"""
        return {
            'avg_bandwidth_usage': np.random.uniform(5, 100),
            'peak_bandwidth_usage': np.random.uniform(20, 150),
            'avg_requests_per_second': np.random.uniform(10, 1000)
        }

    async def monitor_budgets(self) -> List[BudgetAlert]:
        """Monitor budget usage and generate alerts"""
        alerts = []
        
        try:
            # Get all budgets
            budgets = self.budgets_client.list_budgets(
                compartment_id=self.config['compartment_id']
            ).data
            
            for budget in budgets:
                # Get current spend
                current_spend = await self._get_current_budget_spend(budget.id)
                budget_amount = float(budget.amount)
                
                utilization_percentage = (current_spend / budget_amount * 100) if budget_amount > 0 else 0
                
                # Forecast end-of-period spend
                forecast_spend = await self._forecast_budget_spend(budget.id)
                
                # Calculate days remaining in budget period
                days_remaining = self._calculate_days_remaining(budget)
                
                # Determine severity
                if utilization_percentage >= 90 or forecast_spend > budget_amount * 1.1:
                    severity = CostSeverity.CRITICAL
                elif utilization_percentage >= 75 or forecast_spend > budget_amount:
                    severity = CostSeverity.HIGH
                elif utilization_percentage >= 60:
                    severity = CostSeverity.MEDIUM
                else:
                    severity = CostSeverity.LOW
                
                # Generate recommendations based on severity
                recommendations = []
                if severity in [CostSeverity.HIGH, CostSeverity.CRITICAL]:
                    recommendations = await self._generate_budget_recommendations(budget.id)
                
                alert = BudgetAlert(
                    budget_name=budget.display_name,
                    current_spend=current_spend,
                    budget_amount=budget_amount,
                    utilization_percentage=utilization_percentage,
                    forecast_spend=forecast_spend,
                    days_remaining=days_remaining,
                    severity=severity,
                    recommendations=recommendations
                )
                
                alerts.append(alert)
            
            return alerts
            
        except Exception as e:
            logger.error(f"Failed to monitor budgets: {str(e)}")
            return []

    async def _get_current_budget_spend(self, budget_id: str) -> float:
        """Get current spend for a budget (simplified)"""
        # In reality, this would query actual spend data
        return np.random.uniform(1000, 50000)

    async def _forecast_budget_spend(self, budget_id: str) -> float:
        """Forecast end-of-period spend for budget"""
        current_spend = await self._get_current_budget_spend(budget_id)
        # Simplified forecast - would use actual trend analysis
        growth_factor = np.random.uniform(1.05, 1.3)
        return current_spend * growth_factor

    def _calculate_days_remaining(self, budget) -> int:
        """Calculate days remaining in budget period"""
        # Simplified calculation - would use actual budget period
        return np.random.randint(1, 30)

    async def _generate_budget_recommendations(self, budget_id: str) -> List[str]:
        """Generate recommendations for budget management"""
        recommendations = [
            "Review and optimize underutilized compute instances",
            "Implement automated scheduling for non-production workloads",
            "Consider Reserved Instances for predictable workloads",
            "Review storage usage and archive old data",
            "Optimize load balancer configurations"
        ]
        
        return recommendations[:3]  # Return top 3 recommendations

    async def generate_cost_report(self, trends: Dict[str, Any], 
                                 recommendations: List[OptimizationRecommendation],
                                 budget_alerts: List[BudgetAlert]) -> str:
        """Generate comprehensive cost management report"""
        
        report_time = datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S UTC')
        
        # Calculate summary metrics
        total_potential_savings = sum(r.potential_savings for r in recommendations)
        high_impact_recommendations = [r for r in recommendations if r.potential_savings > 100]
        critical_budget_alerts = [a for a in budget_alerts if a.severity == CostSeverity.CRITICAL]
        
        report = f"""
# OCI Cost Management and FinOps Report
**Generated:** {report_time}

## Executive Summary

### Cost Overview
- **Total Potential Monthly Savings:** ${total_potential_savings:.2f}
- **High-Impact Opportunities:** {len(high_impact_recommendations)} recommendations
- **Critical Budget Alerts:** {len(critical_budget_alerts)} budgets requiring attention
- **Overall Cost Efficiency Score:** {trends.get('cost_efficiency_metrics', {}).get('efficiency_score', 'N/A')}

### Key Insights
"""
        
        # Add cost trend insights
        cost_trend = trends.get('total_cost_trend', {})
        if cost_trend:
            report += f"""
- **Cost Trend:** {cost_trend.get('trend', 'Unknown')} ({cost_trend.get('growth_rate_percent', 0):+.1f}% growth)
- **Daily Average Cost:** ${cost_trend.get('average_daily_cost', 0):.2f}
- **Cost Volatility:** {cost_trend.get('volatility_percent', 0):.1f}%
"""
        
        # Service cost breakdown
        service_costs = trends.get('service_cost_breakdown', {})
        if service_costs and service_costs.get('service_breakdown'):
            report += f"""

## Service Cost Analysis

### Top Cost Drivers
"""
            for service, data in list(service_costs['service_breakdown'].items())[:5]:
                report += f"- **{service}:** ${data['total_cost']:.2f} ({data['cost_percentage']:.1f}%)\n"
        
        # Cost anomalies
        anomalies = trends.get('anomalies', [])
        if anomalies:
            report += f"""

## Cost Anomalies Detected

Found {len(anomalies)} cost anomalies requiring investigation:
"""
            for anomaly in anomalies[:5]:  # Show top 5 anomalies
                report += f"""
### {anomaly.resource_name}
- **Severity:** {anomaly.severity.value.upper()}
- **Expected Cost:** ${anomaly.expected_cost:.2f}
- **Actual Cost:** ${anomaly.actual_cost:.2f}
- **Potential Savings:** ${anomaly.potential_savings:.2f}
- **Recommended Action:** {anomaly.recommended_action.value}
"""
        
        # Optimization recommendations
        if recommendations:
            report += f"""

## Cost Optimization Recommendations

### Top Savings Opportunities
"""
            
            for i, rec in enumerate(recommendations[:10], 1):
                report += f"""
#### {i}. {rec.resource_type.replace('_', ' ').title()} Optimization
- **Current Monthly Cost:** ${rec.current_monthly_cost:.2f}
- **Projected Monthly Cost:** ${rec.projected_monthly_cost:.2f}
- **Monthly Savings:** ${rec.potential_savings:.2f}
- **Confidence Score:** {rec.confidence_score:.0%}
- **Implementation Effort:** {rec.implementation_effort}
- **Risk Level:** {rec.risk_level}
"""
        
        # Budget alerts
        if budget_alerts:
            report += f"""

## Budget Monitoring

### Budget Status Overview
"""
            for alert in budget_alerts:
                status_emoji = "🔴" if alert.severity == CostSeverity.CRITICAL else "🟡" if alert.severity == CostSeverity.HIGH else "🟢"
                
                report += f"""
#### {status_emoji} {alert.budget_name}
- **Current Spend:** ${alert.current_spend:.2f} / ${alert.budget_amount:.2f}
- **Utilization:** {alert.utilization_percentage:.1f}%
- **Forecast Spend:** ${alert.forecast_spend:.2f}
- **Days Remaining:** {alert.days_remaining}
"""
                
                if alert.recommendations:
                    report += "- **Recommendations:**\n"
                    for rec in alert.recommendations:
                        report += f"  - {rec}\n"
        
        # Cost forecast
        forecast = trends.get('cost_forecast', {})
        if forecast.get('status') == 'success':
            report += f"""

## Cost Forecast

### Next 30 Days Projection
- **Total Forecasted Cost:** ${forecast.get('total_forecasted_cost', 0):.2f}
- **Average Daily Cost:** ${forecast.get('average_daily_cost', 0):.2f}
- **Forecast Accuracy:** {forecast.get('forecast_accuracy', 0):.1%}
"""
        
        # Action items and recommendations
        report += f"""

## Recommended Actions

### Immediate Actions (Next 7 Days)
1. **Review Critical Budget Alerts** - {len(critical_budget_alerts)} budgets need immediate attention
2. **Implement High-Impact Optimizations** - Focus on recommendations with savings > $100/month
3. **Investigate Cost Anomalies** - {len([a for a in anomalies if a.severity in [CostSeverity.HIGH, CostSeverity.CRITICAL]])} critical anomalies detected

### Short-term Actions (Next 30 Days)
1. **Resource Right-sizing** - Implement compute and storage optimizations
2. **Automation Implementation** - Set up automated scheduling for non-production workloads
3. **Policy Enforcement** - Implement cost governance policies

### Long-term Initiatives (Next Quarter)
1. **Reserved Instance Strategy** - Evaluate commitment-based pricing for predictable workloads
2. **Architecture Optimization** - Review overall architecture for cost efficiency
3. **FinOps Process Maturity** - Enhance cross-team collaboration and cost accountability

## Cost Optimization Priorities

Based on the analysis, focus on these optimization areas:
"""
        
        # Prioritize recommendations by savings and confidence
        priority_areas = {}
        for rec in recommendations:
            resource_type = rec.resource_type
            if resource_type not in priority_areas:
                priority_areas[resource_type] = {
                    'total_savings': 0,
                    'count': 0,
                    'avg_confidence': 0
                }
            
            priority_areas[resource_type]['total_savings'] += rec.potential_savings
            priority_areas[resource_type]['count'] += 1
            priority_areas[resource_type]['avg_confidence'] += rec.confidence_score
        
        # Calculate averages and sort by impact
        for area in priority_areas.values():
            area['avg_confidence'] = area['avg_confidence'] / area['count']
        
        sorted_areas = sorted(
            priority_areas.items(), 
            key=lambda x: x[1]['total_savings'], 
            reverse=True
        )
        
        for i, (area, data) in enumerate(sorted_areas[:5], 1):
            report += f"""
{i}. **{area.replace('_', ' ').title()}** - ${data['total_savings']:.2f} potential monthly savings
   - {data['count']} optimization opportunities
   - {data['avg_confidence']:.0%} average confidence score
"""
        
        return report

# Automated cost optimization workflow
async def run_cost_optimization_workflow():
    """Run comprehensive cost optimization workflow"""
    optimizer = OCICostOptimizer()
    
    try:
        logger.info("Starting cost optimization workflow...")
        
        # Step 1: Analyze cost trends
        logger.info("Analyzing cost trends...")
        trends = await optimizer.analyze_cost_trends(days_back=90)
        
        # Step 2: Discover optimization opportunities
        logger.info("Discovering optimization opportunities...")
        recommendations = await optimizer.discover_optimization_opportunities()
        
        # Step 3: Monitor budgets
        logger.info("Monitoring budget status...")
        budget_alerts = await optimizer.monitor_budgets()
        
        # Step 4: Generate comprehensive report
        logger.info("Generating cost management report...")
        report = await optimizer.generate_cost_report(trends, recommendations, budget_alerts)
        
        # Step 5: Save report and send notifications
        timestamp = datetime.utcnow().strftime('%Y%m%d_%H%M%S')
        report_filename = f"oci_cost_report_{timestamp}.md"
        
        with open(report_filename, 'w') as f:
            f.write(report)
        
        logger.info(f"Cost optimization report saved to {report_filename}")
        
        # Send alerts for critical issues
        critical_issues = []
        critical_issues.extend([a for a in trends.get('anomalies', []) if a.severity == CostSeverity.CRITICAL])
        critical_issues.extend([a for a in budget_alerts if a.severity == CostSeverity.CRITICAL])
        
        if critical_issues:
            await send_critical_cost_alerts(critical_issues, report_filename)
        
        # Return summary for API consumers
        return {
            'status': 'success',
            'report_file': report_filename,
            'summary': {
                'total_potential_savings': sum(r.potential_savings for r in recommendations),
                'optimization_opportunities': len(recommendations),
                'critical_budget_alerts': len([a for a in budget_alerts if a.severity == CostSeverity.CRITICAL]),
                'cost_anomalies': len(trends.get('anomalies', [])),
                'efficiency_score': trends.get('cost_efficiency_metrics', {}).get('efficiency_score', 0)
            }
        }
        
    except Exception as e:
        logger.error(f"Cost optimization workflow failed: {str(e)}")
        return {'status': 'error', 'message': str(e)}

async def send_critical_cost_alerts(critical_issues: List, report_file: str):
    """Send alerts for critical cost issues"""
    try:
        # Prepare alert message
        alert_message = f"""
CRITICAL COST ALERT - OCI Environment

{len(critical_issues)} critical cost issues detected requiring immediate attention.

Issues:
"""
        for issue in critical_issues[:5]:  # Limit to top 5
            if hasattr(issue, 'resource_name'):
                alert_message += f"- {issue.resource_name}: ${getattr(issue, 'potential_savings', 0):.2f} potential savings\n"
            else:
                alert_message += f"- {issue.budget_name}: {issue.utilization_percentage:.1f}% budget utilization\n"
        
        alert_message += f"\nFull report available in: {report_file}"
        
        # Send to configured notification channels
        # Implementation would depend on your notification preferences
        logger.warning(f"CRITICAL COST ALERT: {len(critical_issues)} issues detected")
        
    except Exception as e:
        logger.error(f"Failed to send critical cost alerts: {str(e)}")

if __name__ == "__main__":
    # Run the cost optimization workflow
    import asyncio
    result = asyncio.run(run_cost_optimization_workflow())
    print(f"Cost optimization completed: {result}")

Automated Cost Governance and Policy Enforcement

Advanced FinOps implementations require automated governance mechanisms that prevent cost overruns before they occur. Policy-as-code frameworks enable organizations to define spending rules, approval workflows, and automated remediation actions that maintain cost discipline across development teams.

Budget enforcement policies can automatically halt resource provisioning when spending thresholds are exceeded, while notification workflows ensure appropriate stakeholders receive timely alerts about budget utilization. These policies integrate with existing CI/CD pipelines to provide cost validation during infrastructure deployments.

Resource tagging policies ensure consistent cost allocation across business units and projects, with automated compliance checking that flags untagged resources or incorrect tag values. This standardization enables accurate chargebacks and cost center reporting.

Automated resource lifecycle management implements policies for non-production environments, automatically stopping development instances outside business hours and deleting temporary resources after predefined periods.

Real-time Cost Monitoring and Alerting

Production FinOps requires real-time cost monitoring that provides immediate visibility into spending changes. Integration with OCI Events service enables automatic notifications when resource costs exceed predefined thresholds or when unusual spending patterns are detected.

Custom dashboards aggregate cost data across multiple dimensions including service type, environment, project, and business unit. These dashboards provide executives with high-level spending trends while giving engineers detailed cost attribution for their specific resources.

Anomaly detection algorithms continuously monitor spending patterns and automatically alert teams when costs deviate significantly from established baselines. Machine learning models learn normal spending patterns and adapt to seasonal variations while maintaining sensitivity to genuine cost anomalies.

Predictive cost modeling uses historical data and planned deployments to forecast future spending with confidence intervals, enabling proactive budget management and capacity planning decisions.

Integration with Enterprise Financial Systems

Enterprise FinOps implementations require integration with existing financial systems for seamless cost allocation and reporting. APIs enable automatic synchronization of OCI billing data with enterprise resource planning (ERP) systems and financial management platforms.

Automated chargeback mechanisms calculate costs by business unit, project, or customer based on resource utilization and predefined allocation rules. These calculations integrate with billing systems to generate accurate invoices for internal cost centers or external customers.

Cost center mapping enables automatic allocation of shared infrastructure costs across multiple business units based on actual usage metrics rather than static percentages. This approach provides more accurate cost attribution while maintaining fairness across different usage patterns.

Integration with procurement systems enables automatic validation of spending against approved budgets and purchase orders, with workflow integration for approval processes when costs exceed authorized amounts.

This comprehensive FinOps approach establishes a mature cost management practice that balances financial accountability with operational agility, enabling organizations to optimize cloud spending while maintaining innovation velocity and service quality.

Enjoy the Cloud
Osama Mustafa

Implementing GitOps with OCI Resource Manager: Advanced Infrastructure Drift Detection and Automated Remediation

Posted on July 10, 2025 by Osama Mustafa in Cloud, OCI

Modern cloud infrastructure demands continuous monitoring and automated governance to ensure configurations remain compliant with intended designs. Oracle Cloud Infrastructure Resource Manager provides powerful capabilities for implementing GitOps workflows with sophisticated drift detection and automated remediation mechanisms. This comprehensive guide explores advanced patterns for infrastructure state management, policy enforcement, and automated compliance restoration.

GitOps Architecture with OCI Resource Manager

GitOps represents a paradigm shift where Git repositories serve as the single source of truth for infrastructure definitions. OCI Resource Manager extends this concept by providing native integration with Git repositories, enabling automatic infrastructure updates triggered by code commits and sophisticated state reconciliation mechanisms.

The architecture centers around declarative infrastructure definitions stored in Git repositories, with Resource Manager continuously monitoring for changes and automatically applying updates. This approach eliminates configuration drift, ensures audit trails, and enables rapid rollback capabilities when issues arise.

Unlike traditional infrastructure management approaches, GitOps with Resource Manager provides immutable infrastructure deployments where every change goes through version control, peer review, and automated testing before reaching production environments.

Advanced Drift Detection Mechanisms

Infrastructure drift occurs when deployed resources deviate from their intended configurations due to manual changes, external automation, or service updates. OCI Resource Manager’s drift detection capabilities go beyond simple configuration comparison to provide comprehensive analysis of resource state variations.

The drift detection engine continuously compares actual resource configurations against Terraform state files, identifying discrepancies in real-time. Advanced algorithms analyze configuration changes, resource dependencies, and policy violations to provide actionable insights into infrastructure deviations.

Machine learning models enhance drift detection by establishing baseline behaviors and identifying anomalous configuration changes that might indicate security incidents or operational issues. This intelligent analysis reduces false positives while ensuring critical deviations receive immediate attention.

Production Implementation with Automated Remediation

Here’s a comprehensive implementation demonstrating GitOps workflows with advanced drift detection and automated remediation capabilities:

Terraform Configuration with Policy Enforcement

# main.tf - Infrastructure with built-in compliance checks
terraform {
  required_providers {
    oci = {
      source  = "oracle/oci"
      version = "~> 5.0"
    }
  }
  
  backend "remote" {
    organization = "your-org"
    workspaces {
      name = "oci-production"
    }
  }
}

# Data source for compliance validation
data "oci_identity_policies" "required_policies" {
  compartment_id = var.tenancy_ocid
  
  filter {
    name   = "name"
    values = ["security-baseline-policy"]
  }
}

# Compliance validation resource
resource "null_resource" "compliance_check" {
  triggers = {
    always_run = timestamp()
  }
  
  provisioner "local-exec" {
    command = "python3 ${path.module}/scripts/compliance_validator.py --compartment ${var.compartment_id}"
  }
  
  lifecycle {
    precondition {
      condition     = length(data.oci_identity_policies.required_policies.policies) > 0
      error_message = "Required security baseline policy not found."
    }
  }
}

# VCN with mandatory security configurations
resource "oci_core_vcn" "production_vcn" {
  compartment_id = var.compartment_id
  display_name   = "production-vcn"
  cidr_blocks    = ["10.0.0.0/16"]
  dns_label      = "prodvcn"
  
  # Mandatory tags for compliance
  defined_tags = {
    "Security.Classification" = "Confidential"
    "Operations.Environment"  = "Production"
    "Compliance.Required"     = "True"
  }
  
  lifecycle {
    prevent_destroy = true
    
    postcondition {
      condition     = contains(self.defined_tags["Security.Classification"], "Confidential")
      error_message = "Production VCN must be tagged as Confidential."
    }
  }
}

# Security list with drift detection webhook
resource "oci_core_security_list" "production_security_list" {
  compartment_id = var.compartment_id
  vcn_id        = oci_core_vcn.production_vcn.id
  display_name  = "production-security-list"
  
  # Ingress rules with compliance requirements
  dynamic "ingress_security_rules" {
    for_each = var.allowed_ingress_rules
    content {
      protocol    = ingress_security_rules.value.protocol
      source      = ingress_security_rules.value.source
      source_type = "CIDR_BLOCK"
      
      tcp_options {
        min = ingress_security_rules.value.port_range.min
        max = ingress_security_rules.value.port_range.max
      }
    }
  }
  
  # Egress rules with monitoring
  egress_security_rules {
    destination      = "0.0.0.0/0"
    destination_type = "CIDR_BLOCK"
    protocol        = "6"
    
    tcp_options {
      min = 443
      max = 443
    }
  }
  
  # Custom webhook for change notifications
  provisioner "local-exec" {
    when    = create
    command = "curl -X POST ${var.webhook_url} -H 'Content-Type: application/json' -d '{\"event\":\"security_list_created\",\"resource_id\":\"${self.id}\"}'"
  }
  
  lifecycle {
    postcondition {
      condition = length([
        for rule in self.egress_security_rules : rule
        if rule.destination == "0.0.0.0/0" && rule.protocol == "6"
      ]) <= 2
      error_message = "Security list has too many permissive egress rules."
    }
  }
}

# Compute instance with security baseline
resource "oci_core_instance" "production_instance" {
  count               = var.instance_count
  availability_domain = data.oci_identity_availability_domains.ads.availability_domains[count.index % 3].name
  compartment_id     = var.compartment_id
  display_name       = "prod-instance-${count.index + 1}"
  shape             = var.instance_shape
  
  shape_config {
    ocpus         = var.instance_ocpus
    memory_in_gbs = var.instance_memory
  }
  
  create_vnic_details {
    subnet_id        = oci_core_subnet.production_subnet.id
    display_name     = "primaryvnic-${count.index + 1}"
    assign_public_ip = false
    nsg_ids         = [oci_core_network_security_group.production_nsg.id]
  }
  
  source_details {
    source_type = "image"
    source_id   = data.oci_core_images.oracle_linux.images[0].id
  }
  
  metadata = {
    ssh_authorized_keys = var.ssh_public_key
    user_data = base64encode(templatefile("${path.module}/templates/cloud-init.yaml", {
      monitoring_agent_config = var.monitoring_config
      security_agent_config   = var.security_config
    }))
  }
  
  # Anti-drift configuration
  defined_tags = {
    "Security.Classification" = "Confidential"
    "Operations.Environment"  = "Production"
    "Drift.Monitor"          = "Enabled"
    "Compliance.Baseline"    = "CIS-1.0"
  }
  
  lifecycle {
    postcondition {
      condition     = self.state == "RUNNING"
      error_message = "Instance must be in RUNNING state after creation."
    }
    
    replace_triggered_by = [
      oci_core_security_list.production_security_list
    ]
  }
}

# Network Security Group with policy enforcement
resource "oci_core_network_security_group" "production_nsg" {
  compartment_id = var.compartment_id
  vcn_id        = oci_core_vcn.production_vcn.id
  display_name  = "production-nsg"
  
  defined_tags = {
    "Security.Purpose" = "Application-Tier"
    "Drift.Monitor"   = "Enabled"
  }
}

# Dynamic NSG rules based on compliance requirements
resource "oci_core_network_security_group_security_rule" "application_rules" {
  for_each = {
    for rule in var.security_rules : "${rule.direction}-${rule.protocol}-${rule.port}" => rule
    if rule.enabled
  }
  
  network_security_group_id = oci_core_network_security_group.production_nsg.id
  direction                 = each.value.direction
  protocol                  = each.value.protocol
  
  source      = each.value.direction == "INGRESS" ? each.value.source : null
  source_type = each.value.direction == "INGRESS" ? "CIDR_BLOCK" : null
  
  destination      = each.value.direction == "EGRESS" ? each.value.destination : null
  destination_type = each.value.direction == "EGRESS" ? "CIDR_BLOCK" : null
  
  dynamic "tcp_options" {
    for_each = each.value.protocol == "6" ? [1] : []
    content {
      destination_port_range {
        min = each.value.port
        max = each.value.port
      }
    }
  }
}

# Load balancer with health monitoring
resource "oci_load_balancer_load_balancer" "production_lb" {
  compartment_id = var.compartment_id
  display_name   = "production-lb"
  shape         = "flexible"
  subnet_ids    = [oci_core_subnet.production_subnet.id]
  
  shape_details {
    minimum_bandwidth_in_mbps = 10
    maximum_bandwidth_in_mbps = 100
  }
  
  defined_tags = {
    "Security.Classification" = "Confidential"
    "Drift.Monitor"          = "Enabled"
    "HealthCheck.Critical"   = "True"
  }
  
  lifecycle {
    postcondition {
      condition     = self.state == "ACTIVE"
      error_message = "Load balancer must be in ACTIVE state."
    }
  }
}

# Backend set with automated health checks
resource "oci_load_balancer_backend_set" "production_backend" {
  load_balancer_id = oci_load_balancer_load_balancer.production_lb.id
  name            = "production-backend"
  policy          = "ROUND_ROBIN"
  
  health_checker {
    port                = 8080
    protocol           = "HTTP"
    url_path           = "/health"
    interval_ms        = 10000
    timeout_in_millis  = 3000
    retries            = 3
    return_code        = 200
  }
  
  session_persistence_configuration {
    cookie_name      = "X-Oracle-BMC-LBS-Route"
    disable_fallback = false
  }
}

# Backend instances with drift monitoring
resource "oci_load_balancer_backend" "production_backends" {
  count            = length(oci_core_instance.production_instance)
  load_balancer_id = oci_load_balancer_load_balancer.production_lb.id
  backendset_name  = oci_load_balancer_backend_set.production_backend.name
  ip_address      = oci_core_instance.production_instance[count.index].private_ip
  port            = 8080
  backup          = false
  drain           = false
  offline         = false
  weight          = 1
  
  lifecycle {
    postcondition {
      condition     = self.health_status == "OK"
      error_message = "Backend must be healthy after creation."
    }
  }
}

Advanced Drift Detection and Remediation Script

#!/usr/bin/env python3
"""
Advanced Infrastructure Drift Detection and Remediation System
Provides comprehensive monitoring, analysis, and automated remediation
of infrastructure configuration drift in OCI environments.
"""

import json
import logging
import asyncio
import hashlib
import difflib
from datetime import datetime, timedelta
from typing import Dict, List, Tuple, Optional, Any
from dataclasses import dataclass, field
from enum import Enum
import oci
import git
import requests
import yaml
from pathlib import Path

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

class DriftSeverity(Enum):
    LOW = "low"
    MEDIUM = "medium" 
    HIGH = "high"
    CRITICAL = "critical"

class RemediationAction(Enum):
    ALERT_ONLY = "alert_only"
    AUTO_REMEDIATE = "auto_remediate"
    MANUAL_APPROVAL = "manual_approval"
    ROLLBACK = "rollback"

@dataclass
class DriftDetection:
    """Container for drift detection results"""
    resource_id: str
    resource_type: str
    resource_name: str
    expected_config: Dict[str, Any]
    actual_config: Dict[str, Any]
    drift_items: List[str]
    severity: DriftSeverity
    confidence_score: float
    detected_at: datetime
    remediation_action: RemediationAction = RemediationAction.ALERT_ONLY
    tags: Dict[str, str] = field(default_factory=dict)

@dataclass
class RemediationResult:
    """Container for remediation operation results"""
    drift_detection: DriftDetection
    action_taken: str
    success: bool
    error_message: Optional[str] = None
    execution_time: float = 0.0
    rollback_info: Optional[Dict] = None

class InfrastructureDriftMonitor:
    def __init__(self, config_file: str = 'drift_config.yaml'):
        """Initialize the drift monitoring system"""
        self.config = self._load_config(config_file)
        self.signer = oci.auth.signers.get_resource_principals_signer()
        
        # Initialize OCI clients
        self.compute_client = oci.core.ComputeClient({}, signer=self.signer)
        self.network_client = oci.core.VirtualNetworkClient({}, signer=self.signer)
        self.lb_client = oci.load_balancer.LoadBalancerClient({}, signer=self.signer)
        self.resource_manager_client = oci.resource_manager.ResourceManagerClient({}, signer=self.signer)
        
        # Drift detection state
        self.baseline_configs = {}
        self.drift_history = []
        self.remediation_queue = asyncio.Queue()
        
        # Initialize Git repository for state tracking
        self.git_repo = self._initialize_git_repo()
        
    def _load_config(self, config_file: str) -> Dict:
        """Load configuration from YAML file"""
        try:
            with open(config_file, 'r') as f:
                return yaml.safe_load(f)
        except FileNotFoundError:
            logger.error(f"Configuration file {config_file} not found")
            return {}

    def _initialize_git_repo(self) -> git.Repo:
        """Initialize Git repository for configuration tracking"""
        repo_path = self.config.get('git_repo_path', './infrastructure-state')
        
        try:
            if Path(repo_path).exists():
                repo = git.Repo(repo_path)
            else:
                repo = git.Repo.clone_from(
                    self.config['git_repo_url'], 
                    repo_path
                )
            
            return repo
        except Exception as e:
            logger.error(f"Failed to initialize Git repository: {str(e)}")
            raise

    async def discover_resources(self) -> Dict[str, List[Dict]]:
        """Discover all monitored resources in the compartment"""
        compartment_id = self.config['compartment_id']
        discovered_resources = {
            'compute_instances': [],
            'vcns': [],
            'security_lists': [],
            'network_security_groups': [],
            'load_balancers': []
        }
        
        try:
            # Discover compute instances
            instances = self.compute_client.list_instances(
                compartment_id=compartment_id,
                lifecycle_state='RUNNING'
            ).data
            
            for instance in instances:
                if self._should_monitor_resource(instance):
                    discovered_resources['compute_instances'].append({
                        'id': instance.id,
                        'name': instance.display_name,
                        'type': 'compute_instance',
                        'config': await self._get_instance_config(instance.id)
                    })
            
            # Discover VCNs
            vcns = self.network_client.list_vcns(
                compartment_id=compartment_id,
                lifecycle_state='AVAILABLE'
            ).data
            
            for vcn in vcns:
                if self._should_monitor_resource(vcn):
                    discovered_resources['vcns'].append({
                        'id': vcn.id,
                        'name': vcn.display_name,
                        'type': 'vcn',
                        'config': await self._get_vcn_config(vcn.id)
                    })
            
            # Discover Security Lists
            security_lists = self.network_client.list_security_lists(
                compartment_id=compartment_id,
                lifecycle_state='AVAILABLE'
            ).data
            
            for sl in security_lists:
                if self._should_monitor_resource(sl):
                    discovered_resources['security_lists'].append({
                        'id': sl.id,
                        'name': sl.display_name,
                        'type': 'security_list',
                        'config': await self._get_security_list_config(sl.id)
                    })
            
            # Discover Load Balancers
            load_balancers = self.lb_client.list_load_balancers(
                compartment_id=compartment_id,
                lifecycle_state='ACTIVE'
            ).data
            
            for lb in load_balancers:
                if self._should_monitor_resource(lb):
                    discovered_resources['load_balancers'].append({
                        'id': lb.id,
                        'name': lb.display_name,
                        'type': 'load_balancer',
                        'config': await self._get_load_balancer_config(lb.id)
                    })
            
            logger.info(f"Discovered {sum(len(resources) for resources in discovered_resources.values())} resources")
            return discovered_resources
            
        except Exception as e:
            logger.error(f"Failed to discover resources: {str(e)}")
            return discovered_resources

    def _should_monitor_resource(self, resource) -> bool:
        """Determine if a resource should be monitored for drift"""
        if not hasattr(resource, 'defined_tags'):
            return False
        
        defined_tags = resource.defined_tags or {}
        drift_monitor = defined_tags.get('Drift', {}).get('Monitor', 'Disabled')
        
        return drift_monitor.lower() == 'enabled'

    async def _get_instance_config(self, instance_id: str) -> Dict:
        """Get detailed configuration for a compute instance"""
        try:
            instance = self.compute_client.get_instance(instance_id).data
            
            # Get VNIC attachments
            vnic_attachments = self.compute_client.list_vnic_attachments(
                compartment_id=instance.compartment_id,
                instance_id=instance_id
            ).data
            
            vnics = []
            for attachment in vnic_attachments:
                vnic = self.network_client.get_vnic(attachment.vnic_id).data
                vnics.append({
                    'vnic_id': vnic.id,
                    'subnet_id': vnic.subnet_id,
                    'private_ip': vnic.private_ip,
                    'public_ip': vnic.public_ip,
                    'nsg_ids': vnic.nsg_ids
                })
            
            return {
                'display_name': instance.display_name,
                'lifecycle_state': instance.lifecycle_state,
                'availability_domain': instance.availability_domain,
                'shape': instance.shape,
                'shape_config': instance.shape_config.__dict__ if instance.shape_config else {},
                'defined_tags': instance.defined_tags,
                'freeform_tags': instance.freeform_tags,
                'vnics': vnics,
                'metadata': instance.metadata
            }
            
        except Exception as e:
            logger.error(f"Failed to get instance config for {instance_id}: {str(e)}")
            return {}

    async def _get_vcn_config(self, vcn_id: str) -> Dict:
        """Get detailed configuration for a VCN"""
        try:
            vcn = self.network_client.get_vcn(vcn_id).data
            
            return {
                'display_name': vcn.display_name,
                'cidr_blocks': vcn.cidr_blocks,
                'dns_label': vcn.dns_label,
                'lifecycle_state': vcn.lifecycle_state,
                'defined_tags': vcn.defined_tags,
                'freeform_tags': vcn.freeform_tags
            }
            
        except Exception as e:
            logger.error(f"Failed to get VCN config for {vcn_id}: {str(e)}")
            return {}

    async def _get_security_list_config(self, security_list_id: str) -> Dict:
        """Get detailed configuration for a security list"""
        try:
            sl = self.network_client.get_security_list(security_list_id).data
            
            return {
                'display_name': sl.display_name,
                'lifecycle_state': sl.lifecycle_state,
                'ingress_security_rules': [rule.__dict__ for rule in sl.ingress_security_rules],
                'egress_security_rules': [rule.__dict__ for rule in sl.egress_security_rules],
                'defined_tags': sl.defined_tags,
                'freeform_tags': sl.freeform_tags
            }
            
        except Exception as e:
            logger.error(f"Failed to get security list config for {security_list_id}: {str(e)}")
            return {}

    async def _get_load_balancer_config(self, lb_id: str) -> Dict:
        """Get detailed configuration for a load balancer"""
        try:
            lb = self.lb_client.get_load_balancer(lb_id).data
            
            # Get backend sets
            backend_sets = {}
            for name, backend_set in lb.backend_sets.items():
                backend_sets[name] = {
                    'policy': backend_set.policy,
                    'health_checker': backend_set.health_checker.__dict__,
                    'backends': [backend.__dict__ for backend in backend_set.backends]
                }
            
            return {
                'display_name': lb.display_name,
                'shape_name': lb.shape_name,
                'lifecycle_state': lb.lifecycle_state,
                'backend_sets': backend_sets,
                'listeners': {name: listener.__dict__ for name, listener in lb.listeners.items()},
                'defined_tags': lb.defined_tags,
                'freeform_tags': lb.freeform_tags
            }
            
        except Exception as e:
            logger.error(f"Failed to get load balancer config for {lb_id}: {str(e)}")
            return {}

    async def detect_drift(self, current_resources: Dict[str, List[Dict]]) -> List[DriftDetection]:
        """Detect configuration drift across all monitored resources"""
        detected_drifts = []
        
        # Load baseline configurations from Git
        baseline_configs = await self._load_baseline_configs()
        
        for resource_type, resources in current_resources.items():
            for resource in resources:
                resource_id = resource['id']
                current_config = resource['config']
                
                # Get baseline configuration
                baseline_key = f"{resource_type}:{resource_id}"
                baseline_config = baseline_configs.get(baseline_key, {})
                
                if not baseline_config:
                    # First time seeing this resource, establish baseline
                    await self._establish_baseline(resource_type, resource_id, current_config)
                    continue
                
                # Compare configurations
                drift_items = self._compare_configurations(baseline_config, current_config)
                
                if drift_items:
                    # Calculate drift severity and confidence
                    severity, confidence = self._analyze_drift_severity(drift_items, resource_type)
                    
                    # Determine remediation action
                    remediation_action = self._determine_remediation_action(
                        severity, resource_type, resource['config']
                    )
                    
                    drift_detection = DriftDetection(
                        resource_id=resource_id,
                        resource_type=resource_type,
                        resource_name=resource.get('name', 'Unknown'),
                        expected_config=baseline_config,
                        actual_config=current_config,
                        drift_items=drift_items,
                        severity=severity,
                        confidence_score=confidence,
                        detected_at=datetime.utcnow(),
                        remediation_action=remediation_action,
                        tags=current_config.get('defined_tags', {})
                    )
                    
                    detected_drifts.append(drift_detection)
                    logger.warning(f"Drift detected in {resource_type} {resource_id}: {len(drift_items)} changes")
        
        return detected_drifts

    async def _load_baseline_configs(self) -> Dict[str, Dict]:
        """Load baseline configurations from Git repository"""
        try:
            # Pull latest changes
            self.git_repo.remotes.origin.pull()
            
            baseline_file = Path(self.git_repo.working_dir) / 'baseline_configs.json'
            
            if baseline_file.exists():
                with open(baseline_file, 'r') as f:
                    return json.load(f)
            else:
                return {}
                
        except Exception as e:
            logger.error(f"Failed to load baseline configurations: {str(e)}")
            return {}

    async def _establish_baseline(self, resource_type: str, resource_id: str, config: Dict):
        """Establish baseline configuration for a new resource"""
        try:
            baseline_file = Path(self.git_repo.working_dir) / 'baseline_configs.json'
            
            # Load existing baselines
            if baseline_file.exists():
                with open(baseline_file, 'r') as f:
                    baselines = json.load(f)
            else:
                baselines = {}
            
            # Add new baseline
            baseline_key = f"{resource_type}:{resource_id}"
            baselines[baseline_key] = {
                'config': config,
                'established_at': datetime.utcnow().isoformat(),
                'checksum': hashlib.sha256(json.dumps(config, sort_keys=True).encode()).hexdigest()
            }
            
            # Save to file
            with open(baseline_file, 'w') as f:
                json.dump(baselines, f, indent=2, default=str)
            
            # Commit to Git
            self.git_repo.index.add([str(baseline_file)])
            self.git_repo.index.commit(f"Establish baseline for {resource_type} {resource_id}")
            self.git_repo.remotes.origin.push()
            
            logger.info(f"Established baseline for {resource_type} {resource_id}")
            
        except Exception as e:
            logger.error(f"Failed to establish baseline: {str(e)}")

    def _compare_configurations(self, baseline: Dict, current: Dict) -> List[str]:
        """Compare two configurations and identify differences"""
        drift_items = []
        
        def deep_compare(base_obj, curr_obj, path=""):
            if isinstance(base_obj, dict) and isinstance(curr_obj, dict):
                # Compare dictionary keys
                base_keys = set(base_obj.keys())
                curr_keys = set(curr_obj.keys())
                
                # Added keys
                for key in curr_keys - base_keys:
                    drift_items.append(f"Added: {path}.{key} = {curr_obj[key]}")
                
                # Removed keys
                for key in base_keys - curr_keys:
                    drift_items.append(f"Removed: {path}.{key} = {base_obj[key]}")
                
                # Changed values
                for key in base_keys & curr_keys:
                    deep_compare(base_obj[key], curr_obj[key], f"{path}.{key}" if path else key)
                    
            elif isinstance(base_obj, list) and isinstance(curr_obj, list):
                # Compare lists
                if len(base_obj) != len(curr_obj):
                    drift_items.append(f"List size changed: {path} from {len(base_obj)} to {len(curr_obj)}")
                
                for i, (base_item, curr_item) in enumerate(zip(base_obj, curr_obj)):
                    deep_compare(base_item, curr_item, f"{path}[{i}]")
                    
            else:
                # Compare primitive values
                if base_obj != curr_obj:
                    drift_items.append(f"Changed: {path} from '{base_obj}' to '{curr_obj}'")
        
        # Exclude timestamp and volatile fields
        baseline_filtered = self._filter_volatile_fields(baseline)
        current_filtered = self._filter_volatile_fields(current)
        
        deep_compare(baseline_filtered, current_filtered)
        return drift_items

    def _filter_volatile_fields(self, config: Dict) -> Dict:
        """Filter out volatile fields that change frequently"""
        volatile_fields = {
            'time_created', 'time_updated', 'etag', 'lifecycle_details',
            'system_tags', 'time_maintenance_begin', 'time_maintenance_end'
        }
        
        def filter_recursive(obj):
            if isinstance(obj, dict):
                return {
                    k: filter_recursive(v) 
                    for k, v in obj.items() 
                    if k not in volatile_fields
                }
            elif isinstance(obj, list):
                return [filter_recursive(item) for item in obj]
            else:
                return obj
        
        return filter_recursive(config)

    def _analyze_drift_severity(self, drift_items: List[str], resource_type: str) -> Tuple[DriftSeverity, float]:
        """Analyze drift severity based on changes and resource type"""
        severity_scores = {
            'security': 0,
            'availability': 0,
            'performance': 0,
            'compliance': 0
        }
        
        # Analyze each drift item
        for item in drift_items:
            if any(keyword in item.lower() for keyword in ['security', 'ingress', 'egress', 'port', 'protocol']):
                severity_scores['security'] += 10
            
            if any(keyword in item.lower() for keyword in ['availability_domain', 'fault_domain', 'backup']):
                severity_scores['availability'] += 8
            
            if any(keyword in item.lower() for keyword in ['shape', 'cpu', 'memory', 'bandwidth']):
                severity_scores['performance'] += 6
            
            if any(keyword in item.lower() for keyword in ['tag', 'compliance', 'classification']):
                severity_scores['compliance'] += 7
        
        # Calculate overall severity
        total_score = sum(severity_scores.values())
        confidence = min(len(drift_items) * 0.1, 1.0)
        
        if total_score >= 30 or severity_scores['security'] >= 20:
            return DriftSeverity.CRITICAL, confidence
        elif total_score >= 20:
            return DriftSeverity.HIGH, confidence
        elif total_score >= 10:
            return DriftSeverity.MEDIUM, confidence
        else:
            return DriftSeverity.LOW, confidence

    def _determine_remediation_action(self, severity: DriftSeverity, resource_type: str, config: Dict) -> RemediationAction:
        """Determine appropriate remediation action based on severity and resource type"""
        tags = config.get('defined_tags', {})
        auto_remediate = tags.get('Drift', {}).get('AutoRemediate', 'False').lower() == 'true'
        
        if severity == DriftSeverity.CRITICAL:
            if auto_remediate and resource_type in ['security_list', 'network_security_group']:
                return RemediationAction.AUTO_REMEDIATE
            else:
                return RemediationAction.MANUAL_APPROVAL
        
        elif severity == DriftSeverity.HIGH:
            if auto_remediate:
                return RemediationAction.AUTO_REMEDIATE
            else:
                return RemediationAction.MANUAL_APPROVAL
        
        elif severity == DriftSeverity.MEDIUM:
            return RemediationAction.MANUAL_APPROVAL
        
        else:
            return RemediationAction.ALERT_ONLY

    async def remediate_drift(self, drift_detections: List[DriftDetection]) -> List[RemediationResult]:
        """Execute remediation actions for detected drift"""
        remediation_results = []
        
        for drift in drift_detections:
            start_time = datetime.utcnow()
            
            try:
                if drift.remediation_action == RemediationAction.AUTO_REMEDIATE:
                    result = await self._auto_remediate_drift(drift)
                
                elif drift.remediation_action == RemediationAction.MANUAL_APPROVAL:
                    result = await self._request_manual_approval(drift)
                
                elif drift.remediation_action == RemediationAction.ROLLBACK:
                    result = await self._rollback_changes(drift)
                
                else:  # ALERT_ONLY
                    result = await self._send_drift_alert(drift)
                
                execution_time = (datetime.utcnow() - start_time).total_seconds()
                result.execution_time = execution_time
                
                remediation_results.append(result)
                
            except Exception as e:
                logger.error(f"Remediation failed for {drift.resource_id}: {str(e)}")
                
                remediation_results.append(RemediationResult(
                    drift_detection=drift,
                    action_taken="remediation_failed",
                    success=False,
                    error_message=str(e),
                    execution_time=(datetime.utcnow() - start_time).total_seconds()
                ))
        
        return remediation_results

    async def _auto_remediate_drift(self, drift: DriftDetection) -> RemediationResult:
        """Automatically remediate detected drift"""
        try:
            # Create backup before remediation
            backup_info = await self._create_resource_backup(drift.resource_id, drift.resource_type)
            
            # Apply expected configuration
            if drift.resource_type == 'security_list':
                success = await self._remediate_security_list(drift)
            
            elif drift.resource_type == 'network_security_group':
                success = await self._remediate_network_security_group(drift)
            
            elif drift.resource_type == 'load_balancer':
                success = await self._remediate_load_balancer(drift)
            
            else:
                success = False
                raise NotImplementedError(f"Auto-remediation not implemented for {drift.resource_type}")
            
            if success:
                # Update baseline configuration
                await self._update_baseline_config(drift.resource_id, drift.resource_type, drift.expected_config)
                
                # Send success notification
                await self._send_remediation_notification(drift, "success")
            
            return RemediationResult(
                drift_detection=drift,
                action_taken="auto_remediated",
                success=success,
                rollback_info=backup_info
            )
            
        except Exception as e:
            return RemediationResult(
                drift_detection=drift,
                action_taken="auto_remediation_failed",
                success=False,
                error_message=str(e)
            )

    async def _remediate_security_list(self, drift: DriftDetection) -> bool:
        """Remediate security list configuration drift"""
        try:
            security_list_id = drift.resource_id
            expected_config = drift.expected_config
            
            # Prepare update details
            update_details = oci.core.models.UpdateSecurityListDetails(
                display_name=expected_config.get('display_name'),
                ingress_security_rules=[
                    oci.core.models.IngressSecurityRule(**rule) 
                    for rule in expected_config.get('ingress_security_rules', [])
                ],
                egress_security_rules=[
                    oci.core.models.EgressSecurityRule(**rule) 
                    for rule in expected_config.get('egress_security_rules', [])
                ],
                defined_tags=expected_config.get('defined_tags'),
                freeform_tags=expected_config.get('freeform_tags')
            )
            
            # Update security list
            response = self.network_client.update_security_list(
                security_list_id=security_list_id,
                update_security_list_details=update_details
            )
            
            # Wait for update to complete
            oci.wait_until(
                self.network_client,
                self.network_client.get_security_list(security_list_id),
                'lifecycle_state',
                'AVAILABLE'
            )
            
            logger.info(f"Successfully remediated security list {security_list_id}")
            return True
            
        except Exception as e:
            logger.error(f"Failed to remediate security list {drift.resource_id}: {str(e)}")
            return False

    async def _create_resource_backup(self, resource_id: str, resource_type: str) -> Dict:
        """Create backup of current resource configuration before remediation"""
        try:
            backup_info = {
                'resource_id': resource_id,
                'resource_type': resource_type,
                'backup_time': datetime.utcnow().isoformat(),
                'backup_id': f"backup-{resource_id}-{int(datetime.utcnow().timestamp())}"
            }
            
            # Get current configuration
            if resource_type == 'security_list':
                current_config = await self._get_security_list_config(resource_id)
            elif resource_type == 'network_security_group':
                # Implementation for NSG backup
                current_config = {}
            else:
                current_config = {}
            
            backup_info['configuration'] = current_config
            
            # Store backup in Git repository
            backup_file = Path(self.git_repo.working_dir) / 'backups' / f"{backup_info['backup_id']}.json"
            backup_file.parent.mkdir(exist_ok=True)
            
            with open(backup_file, 'w') as f:
                json.dump(backup_info, f, indent=2, default=str)
            
            # Commit backup to Git
            self.git_repo.index.add([str(backup_file)])
            self.git_repo.index.commit(f"Backup {resource_type} {resource_id} before remediation")
            
            return backup_info
            
        except Exception as e:
            logger.error(f"Failed to create backup for {resource_id}: {str(e)}")
            return {}

    async def _send_drift_alert(self, drift: DriftDetection) -> RemediationResult:
        """Send drift detection alert"""
        try:
            alert_payload = {
                'event_type': 'infrastructure_drift_detected',
                'severity': drift.severity.value,
                'resource_id': drift.resource_id,
                'resource_type': drift.resource_type,
                'resource_name': drift.resource_name,
                'drift_count': len(drift.drift_items),
                'confidence_score': drift.confidence_score,
                'detected_at': drift.detected_at.isoformat(),
                'drift_details': drift.drift_items[:10],  # Limit details
                'remediation_action': drift.remediation_action.value
            }
            
            # Send to webhook if configured
            webhook_url = self.config.get('alert_webhook_url')
            if webhook_url:
                response = requests.post(
                    webhook_url,
                    json=alert_payload,
                    timeout=30
                )
                response.raise_for_status()
            
            # Send to OCI Notifications if configured
            notification_topic = self.config.get('notification_topic_ocid')
            if notification_topic:
                ons_client = oci.ons.NotificationDataPlaneClient({}, signer=self.signer)
                
                message_details = oci.ons.models.MessageDetails(
                    body=json.dumps(alert_payload, indent=2),
                    title=f"Infrastructure Drift Detected - {drift.severity.value.upper()}"
                )
                
                ons_client.publish_message(
                    topic_id=notification_topic,
                    message_details=message_details
                )
            
            logger.info(f"Drift alert sent for {drift.resource_id}")
            
            return RemediationResult(
                drift_detection=drift,
                action_taken="alert_sent",
                success=True
            )
            
        except Exception as e:
            return RemediationResult(
                drift_detection=drift,
                action_taken="alert_failed",
                success=False,
                error_message=str(e)
            )

    async def generate_drift_report(self, drift_detections: List[DriftDetection], 
                                  remediation_results: List[RemediationResult]) -> str:
        """Generate comprehensive drift detection and remediation report"""
        
        report_time = datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S UTC')
        
        # Statistics
        total_drifts = len(drift_detections)
        critical_drifts = len([d for d in drift_detections if d.severity == DriftSeverity.CRITICAL])
        high_drifts = len([d for d in drift_detections if d.severity == DriftSeverity.HIGH])
        auto_remediated = len([r for r in remediation_results if r.action_taken == "auto_remediated" and r.success])
        
        report = f"""
# Infrastructure Drift Detection Report
**Generated:** {report_time}

## Executive Summary
- **Total Drift Detections:** {total_drifts}
- **Critical Severity:** {critical_drifts}
- **High Severity:** {high_drifts}
- **Successfully Auto-Remediated:** {auto_remediated}
- **Detection Accuracy:** {sum(d.confidence_score for d in drift_detections) / max(total_drifts, 1):.2%}

## Drift Analysis by Resource Type
"""
        
        # Group by resource type
        drift_by_type = {}
        for drift in drift_detections:
            if drift.resource_type not in drift_by_type:
                drift_by_type[drift.resource_type] = []
            drift_by_type[drift.resource_type].append(drift)
        
        for resource_type, drifts in drift_by_type.items():
            report += f"""
### {resource_type.replace('_', ' ').title()}
- **Count:** {len(drifts)}
- **Average Confidence:** {sum(d.confidence_score for d in drifts) / len(drifts):.2%}
- **Severity Distribution:**
  - Critical: {len([d for d in drifts if d.severity == DriftSeverity.CRITICAL])}
  - High: {len([d for d in drifts if d.severity == DriftSeverity.HIGH])}
  - Medium: {len([d for d in drifts if d.severity == DriftSeverity.MEDIUM])}
  - Low: {len([d for d in drifts if d.severity == DriftSeverity.LOW])}
"""
        
        # Detailed drift information
        if drift_detections:
            report += "\n## Detailed Drift Analysis\n"
            
            for drift in sorted(drift_detections, key=lambda x: x.severity.value, reverse=True)[:20]:
                report += f"""
### {drift.resource_name} ({drift.resource_type})
- **Severity:** {drift.severity.value.upper()}
- **Confidence:** {drift.confidence_score:.2%}
- **Remediation Action:** {drift.remediation_action.value}
- **Changes Detected:** {len(drift.drift_items)}

**Key Changes:**
"""
                for change in drift.drift_items[:5]:  # Show top 5 changes
                    report += f"- {change}\n"
                
                if len(drift.drift_items) > 5:
                    report += f"- ... and {len(drift.drift_items) - 5} more changes\n"
        
        # Remediation results
        if remediation_results:
            report += "\n## Remediation Results\n"
            
            successful_remediations = [r for r in remediation_results if r.success]
            failed_remediations = [r for r in remediation_results if not r.success]
            
            report += f"""
- **Successful Actions:** {len(successful_remediations)}
- **Failed Actions:** {len(failed_remediations)}
- **Average Execution Time:** {sum(r.execution_time for r in remediation_results) / max(len(remediation_results), 1):.2f} seconds
"""
            
            if failed_remediations:
                report += "\n### Failed Remediations\n"
                for result in failed_remediations:
                    report += f"""
- **Resource:** {result.drift_detection.resource_name}
- **Action:** {result.action_taken}
- **Error:** {result.error_message}
"""
        
        # Recommendations
        report += f"""
## Recommendations

### Immediate Actions Required
"""
        
        critical_items = [d for d in drift_detections if d.severity == DriftSeverity.CRITICAL]
        if critical_items:
            report += "- **Critical drift detected** - Review and remediate immediately\n"
            for item in critical_items[:3]:
                report += f"  - {item.resource_name}: {len(item.drift_items)} critical changes\n"
        
        report += f"""
### Process Improvements
- Enable auto-remediation for {len([d for d in drift_detections if d.remediation_action == RemediationAction.MANUAL_APPROVAL])} resources with manual approval requirements
- Review baseline configurations for {len([d for d in drift_detections if d.confidence_score < 0.7])} resources with low confidence scores
- Implement preventive controls for {len(drift_by_type.get('security_list', []))} security list changes

### Monitoring Enhancements
- Increase monitoring frequency for critical resources
- Implement real-time alerting for security-related changes
- Establish automated testing for configuration changes
"""
        
        return report

# GitOps Integration Functions
async def setup_gitops_pipeline():
    """Set up GitOps pipeline for continuous infrastructure monitoring"""
    pipeline_config = {
        'git_repo': 'https://github.com/your-org/infrastructure-config.git',
        'branch': 'main',
        'polling_interval': 300,  # 5 minutes
        'webhook_url': 'https://your-webhook-endpoint.com/drift-alerts',
        'auto_remediation_enabled': True,
        'notification_topic': 'ocid1.onstopic.oc1..example'
    }
    
    # Initialize monitoring system
    monitor = InfrastructureDriftMonitor('drift_config.yaml')
    
    while True:
        try:
            logger.info("Starting drift detection cycle...")
            
            # Discover current infrastructure
            current_resources = await monitor.discover_resources()
            
            # Detect drift
            drift_detections = await monitor.detect_drift(current_resources)
            
            if drift_detections:
                logger.warning(f"Detected {len(drift_detections)} configuration drifts")
                
                # Execute remediation
                remediation_results = await monitor.remediate_drift(drift_detections)
                
                # Generate and save report
                report = await monitor.generate_drift_report(drift_detections, remediation_results)
                
                # Save report to Git repository
                report_file = Path(monitor.git_repo.working_dir) / 'reports' / f"drift_report_{datetime.utcnow().strftime('%Y%m%d_%H%M%S')}.md"
                report_file.parent.mkdir(exist_ok=True)
                
                with open(report_file, 'w') as f:
                    f.write(report)
                
                # Commit report
                monitor.git_repo.index.add([str(report_file)])
                monitor.git_repo.index.commit(f"Drift detection report - {len(drift_detections)} issues found")
                monitor.git_repo.remotes.origin.push()
                
            else:
                logger.info("No configuration drift detected")
            
            # Wait for next cycle
            await asyncio.sleep(pipeline_config['polling_interval'])
            
        except Exception as e:
            logger.error(f"GitOps pipeline error: {str(e)}")
            await asyncio.sleep(60)  # Wait before retrying

if __name__ == "__main__":
    asyncio.run(setup_gitops_pipeline())

Advanced GitOps implementations require policy enforcement mechanisms that prevent configuration drift before it occurs. OCI Resource Manager integrates with Open Policy Agent (OPA) to provide policy-as-code capabilities that validate infrastructure changes against organizational standards.

Policy definitions stored in Git repositories enable version-controlled governance rules that automatically reject non-compliant infrastructure changes. These policies can enforce security baselines, cost optimization requirements, and operational standards across all infrastructure deployments.

The integration supports both admission control policies that prevent problematic changes and monitoring policies that detect violations in existing infrastructure. This dual approach ensures comprehensive coverage while maintaining operational flexibility.

Regards
Osama

Advanced OCI Autonomous Database Automation with Fleet Management and Performance Intelligence

Posted on June 18, 2025 by Osama Mustafa in Cloud, OCI

Oracle Cloud Infrastructure’s Autonomous Database represents a paradigm shift in database administration, combining machine learning-driven automation with enterprise-grade performance and security. This comprehensive guide explores advanced fleet management strategies, performance optimization techniques, and automated operational workflows that enable organizations to scale database operations efficiently while maintaining optimal performance.

Autonomous Database Architecture Deep Dive

Autonomous Database operates on a fully managed infrastructure where Oracle handles patching, tuning, scaling, and security updates automatically. The service leverages machine learning algorithms trained on Oracle’s extensive database performance dataset to make real-time optimization decisions without human intervention.
The architecture consists of multiple layers of automation. The infrastructure layer manages compute and storage scaling based on workload demands. The database layer continuously optimizes SQL execution plans, indexes, and memory allocation. The security layer automatically applies patches and implements threat detection mechanisms.
Unlike traditional database services, Autonomous Database provides predictable performance through automatic workload management. The service can handle mixed workloads by automatically prioritizing critical transactions and throttling less important background processes during peak periods.

Resource allocation occurs dynamically across CPU, memory, and I/O subsystems. The machine learning algorithms analyze query patterns and automatically adjust resource distribution to optimize for current workload characteristics while maintaining performance SLAs.

Fleet Management and Automation Strategies

Managing multiple Autonomous Databases across development, testing, and production environments requires sophisticated automation strategies. Fleet management enables consistent configuration, monitoring, and lifecycle management across database instances.

Automated provisioning workflows ensure new database instances follow organizational standards for security, backup policies, and resource allocation. Template-based deployment eliminates configuration drift and reduces manual errors during database creation.

Cross-database monitoring provides unified visibility into performance metrics, resource utilization, and cost optimization opportunities across the entire database fleet. Centralized alerting ensures rapid response to performance degradation or security incidents.

Production Implementation Example

Here’s a comprehensive implementation of automated Autonomous Database fleet management with advanced monitoring and optimization:
Terraform Infrastructure for Database Fleet

# Variables for fleet configuration
variable "database_environments" {
  description = "Database environments configuration"
  type = map(object({
    cpu_core_count          = number
    data_storage_size_in_tbs = number
    display_name           = string
    db_name               = string
    admin_password        = string
    db_workload           = string
    license_model         = string
    whitelisted_ips       = list(string)
    auto_scaling_enabled  = bool
    backup_retention_days = number
  }))
  default = {
    production = {
      cpu_core_count          = 4
      data_storage_size_in_tbs = 2
      display_name           = "Production ADB"
      db_name               = "PRODADB"
      admin_password        = "ComplexPassword123!"
      db_workload           = "OLTP"
      license_model         = "LICENSE_INCLUDED"
      whitelisted_ips       = ["10.0.0.0/16"]
      auto_scaling_enabled  = true
      backup_retention_days = 30
    }
    staging = {
      cpu_core_count          = 2
      data_storage_size_in_tbs = 1
      display_name           = "Staging ADB"
      db_name               = "STAGINGADB"
      admin_password        = "ComplexPassword123!"
      db_workload           = "OLTP"
      license_model         = "LICENSE_INCLUDED"
      whitelisted_ips       = ["10.0.0.0/16"]
      auto_scaling_enabled  = false
      backup_retention_days = 7
    }
  }
}

# Autonomous Database instances
resource "oci_database_autonomous_database" "fleet_databases" {
  for_each = var.database_environments
  
  compartment_id              = var.compartment_id
  cpu_core_count             = each.value.cpu_core_count
  data_storage_size_in_tbs   = each.value.data_storage_size_in_tbs
  db_name                    = each.value.db_name
  display_name               = each.value.display_name
  admin_password             = each.value.admin_password
  db_workload               = each.value.db_workload
  license_model             = each.value.license_model
  is_auto_scaling_enabled   = each.value.auto_scaling_enabled
  
  # Network security
  whitelisted_ips = each.value.whitelisted_ips
  subnet_id      = oci_core_subnet.database_subnet.id
  nsg_ids        = [oci_core_network_security_group.database_nsg.id]
  
  # Backup configuration
  backup_config {
    manual_backup_bucket_name = oci_objectstorage_bucket.backup_bucket[each.key].name
    manual_backup_type       = "OBJECT_STORE"
  }
  
  # Enable advanced features
  operations_insights_status = "ENABLED"
  database_management_status = "ENABLED"
  
  # Tags for fleet management
  defined_tags = {
    "Operations.Environment" = each.key
    "Operations.CostCenter" = "Database"
    "Operations.Owner"      = "DBA-Team"
  }
  
  lifecycle {
    ignore_changes = [
      admin_password,
    ]
  }
}

# Dedicated backup buckets per environment
resource "oci_objectstorage_bucket" "backup_bucket" {
  for_each       = var.database_environments
  compartment_id = var.compartment_id
  name          = "${each.key}-adb-backups"
  namespace     = data.oci_objectstorage_namespace.ns.namespace
  
  retention_rules {
    display_name = "backup-retention"
    duration {
      time_amount = each.value.backup_retention_days
      time_unit   = "DAYS"
    }
    time_rule_locked = formatdate("YYYY-MM-DD'T'hh:mm:ss'Z'", timeadd(timestamp(), "24h"))
  }
  
  object_events_enabled = true
  versioning           = "Enabled"
}

# Database monitoring alarms
resource "oci_monitoring_alarm" "cpu_utilization" {
  for_each                = var.database_environments
  compartment_id         = var.compartment_id
  destinations          = [oci_ons_notification_topic.database_alerts.id]
  display_name          = "${each.value.display_name} - High CPU"
  is_enabled            = true
  metric_compartment_id = var.compartment_id
  namespace             = "oci_autonomous_database"
  
  query = "CpuUtilization[5m]{resourceId = \"${oci_database_autonomous_database.fleet_databases[each.key].id}\"}.mean() > 80"
  
  severity = "WARNING"
  
  suppression {
    time_suppress_from  = "0T08:00:00Z"
    time_suppress_until = "0T09:00:00Z"
  }
  
  repeat_notification_duration = "PT2H"
}

resource "oci_monitoring_alarm" "storage_utilization" {
  for_each                = var.database_environments
  compartment_id         = var.compartment_id
  destinations          = [oci_ons_notification_topic.database_alerts.id]
  display_name          = "${each.value.display_name} - High Storage"
  is_enabled            = true
  metric_compartment_id = var.compartment_id
  namespace             = "oci_autonomous_database"
  
  query = "StorageUtilization[5m]{resourceId = \"${oci_database_autonomous_database.fleet_databases[each.key].id}\"}.mean() > 85"
  
  severity = "CRITICAL"
  repeat_notification_duration = "PT30M"
}

# Network Security Group for database access
resource "oci_core_network_security_group" "database_nsg" {
  compartment_id = var.compartment_id
  vcn_id        = oci_core_vcn.database_vcn.id
  display_name  = "database-nsg"
}

resource "oci_core_network_security_group_security_rule" "database_ingress_https" {
  network_security_group_id = oci_core_network_security_group.database_nsg.id
  direction                 = "INGRESS"
  protocol                  = "6"
  source                   = "10.0.0.0/16"
  source_type              = "CIDR_BLOCK"
  
  tcp_options {
    destination_port_range {
      max = 1522
      min = 1521
    }
  }
}

# Notification topic for database alerts
resource "oci_ons_notification_topic" "database_alerts" {
  compartment_id = var.compartment_id
  name          = "database-fleet-alerts"
  description   = "Alerts for Autonomous Database fleet"
}

Advanced Performance Monitoring Script

#!/usr/bin/env python3
"""
Advanced Autonomous Database Fleet Performance Monitor
Provides automated performance analysis, recommendation generation,
and proactive optimization suggestions.
"""

import oci
import json
import logging
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from typing import Dict, List, Tuple, Optional
import cx_Oracle
import asyncio
import aiohttp
from dataclasses import dataclass
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

@dataclass
class DatabaseMetrics:
    """Database performance metrics container"""
    database_id: str
    database_name: str
    cpu_utilization: float
    memory_utilization: float
    storage_utilization: float
    active_sessions: int
    blocked_sessions: int
    average_response_time: float
    throughput_transactions: float
    wait_events: Dict[str, float]
    top_sql: List[Dict]
    timestamp: datetime

@dataclass
class PerformanceRecommendation:
    """Performance optimization recommendation"""
    database_id: str
    category: str
    severity: str
    title: str
    description: str
    impact_score: float
    implementation_effort: str
    sql_statements: List[str]

class AutonomousDatabaseFleetMonitor:
    def __init__(self, config_file: str = 'config.json'):
        """Initialize the fleet monitoring system"""
        self.config = self._load_config(config_file)
        self.signer = oci.auth.signers.get_resource_principals_signer()
        
        # Initialize OCI clients
        self.db_client = oci.database.DatabaseClient({}, signer=self.signer)
        self.monitoring_client = oci.monitoring.MonitoringClient({}, signer=self.signer)
        self.ons_client = oci.ons.NotificationDataPlaneClient({}, signer=self.signer)
        
        # Performance thresholds
        self.thresholds = {
            'cpu_warning': 70.0,
            'cpu_critical': 85.0,
            'memory_warning': 75.0,
            'memory_critical': 90.0,
            'storage_warning': 80.0,
            'storage_critical': 90.0,
            'response_time_warning': 2.0,
            'response_time_critical': 5.0
        }
        
        # Initialize database connections cache
        self.db_connections = {}

    def _load_config(self, config_file: str) -> Dict:
        """Load configuration from JSON file"""
        try:
            with open(config_file, 'r') as f:
                return json.load(f)
        except FileNotFoundError:
            logger.error(f"Configuration file {config_file} not found")
            return {}

    async def monitor_fleet(self) -> List[DatabaseMetrics]:
        """Monitor all databases in the fleet"""
        databases = await self._discover_databases()
        monitoring_tasks = [
            self._monitor_database(db) for db in databases
        ]
        
        results = await asyncio.gather(*monitoring_tasks, return_exceptions=True)
        
        # Filter out exceptions and return valid metrics
        valid_metrics = [
            result for result in results 
            if isinstance(result, DatabaseMetrics)
        ]
        
        # Log any errors
        for result in results:
            if isinstance(result, Exception):
                logger.error(f"Monitoring error: {str(result)}")
        
        return valid_metrics

    async def _discover_databases(self) -> List[Dict]:
        """Discover all Autonomous Databases in the compartment"""
        try:
            response = self.db_client.list_autonomous_databases(
                compartment_id=self.config['compartment_id'],
                lifecycle_state='AVAILABLE'
            )
            return response.data
        except Exception as e:
            logger.error(f"Failed to discover databases: {str(e)}")
            return []

    async def _monitor_database(self, database: Dict) -> DatabaseMetrics:
        """Monitor individual database performance"""
        db_id = database.id
        db_name = database.display_name
        
        try:
            # Get connection to database
            connection = await self._get_database_connection(database)
            
            # Collect performance metrics
            cpu_util = await self._get_cpu_utilization(db_id)
            memory_util = await self._get_memory_utilization(connection)
            storage_util = await self._get_storage_utilization(db_id)
            session_metrics = await self._get_session_metrics(connection)
            response_time = await self._get_response_time_metrics(connection)
            throughput = await self._get_throughput_metrics(connection)
            wait_events = await self._get_wait_events(connection)
            top_sql = await self._get_top_sql_statements(connection)
            
            return DatabaseMetrics(
                database_id=db_id,
                database_name=db_name,
                cpu_utilization=cpu_util,
                memory_utilization=memory_util,
                storage_utilization=storage_util,
                active_sessions=session_metrics['active'],
                blocked_sessions=session_metrics['blocked'],
                average_response_time=response_time,
                throughput_transactions=throughput,
                wait_events=wait_events,
                top_sql=top_sql,
                timestamp=datetime.utcnow()
            )
            
        except Exception as e:
            logger.error(f"Error monitoring database {db_name}: {str(e)}")
            raise

    async def _get_database_connection(self, database: Dict):
        """Get or create database connection"""
        db_id = database.id
        
        if db_id not in self.db_connections:
            try:
                # Get connection details
                wallet_response = self.db_client.generate_autonomous_database_wallet(
                    autonomous_database_id=db_id,
                    generate_autonomous_database_wallet_details=oci.database.models.GenerateAutonomousDatabaseWalletDetails(
                        password="WalletPassword123!"
                    )
                )
                
                # Create connection (implementation depends on wallet setup)
                # This is a simplified example
                connection_string = f"{database.connection_urls.sql_dev_web_url}"
                
                connection = cx_Oracle.connect(
                    user="ADMIN",
                    password=self.config['admin_password'],
                    dsn=connection_string
                )
                
                self.db_connections[db_id] = connection
                
            except Exception as e:
                logger.error(f"Failed to connect to database {database.display_name}: {str(e)}")
                raise
        
        return self.db_connections[db_id]

    async def _get_cpu_utilization(self, database_id: str) -> float:
        """Get CPU utilization from OCI Monitoring"""
        try:
            end_time = datetime.utcnow()
            start_time = end_time - timedelta(minutes=5)
            
            response = self.monitoring_client.summarize_metrics_data(
                compartment_id=self.config['compartment_id'],
                summarize_metrics_data_details=oci.monitoring.models.SummarizeMetricsDataDetails(
                    namespace="oci_autonomous_database",
                    query=f'CpuUtilization[1m]{{resourceId = "{database_id}"}}.mean()',
                    start_time=start_time,
                    end_time=end_time
                )
            )
            
            if response.data and response.data[0].aggregated_datapoints:
                latest_datapoint = response.data[0].aggregated_datapoints[-1]
                return latest_datapoint.value
            
            return 0.0
            
        except Exception as e:
            logger.error(f"Failed to get CPU utilization: {str(e)}")
            return 0.0

    async def _get_memory_utilization(self, connection) -> float:
        """Get memory utilization from database"""
        try:
            cursor = connection.cursor()
            cursor.execute("""
                SELECT ROUND((1 - (bytes_free / bytes_total)) * 100, 2) as memory_usage_pct
                FROM (
                    SELECT SUM(bytes) as bytes_total
                    FROM v$sgainfo
                    WHERE name = 'Maximum SGA Size'
                ), (
                    SELECT SUM(bytes) as bytes_free
                    FROM v$sgastat
                    WHERE name = 'free memory'
                )
            """)
            result = cursor.fetchone()
            cursor.close()
            return float(result[0]) if result else 0.0
        except Exception as e:
            logger.error(f"Failed to get memory utilization: {str(e)}")
            return 0.0

    async def _get_storage_utilization(self, database_id: str) -> float:
        """Get storage utilization from OCI Monitoring"""
        try:
            end_time = datetime.utcnow()
            start_time = end_time - timedelta(minutes=5)
            
            response = self.monitoring_client.summarize_metrics_data(
                compartment_id=self.config['compartment_id'],
                summarize_metrics_data_details=oci.monitoring.models.SummarizeMetricsDataDetails(
                    namespace="oci_autonomous_database",
                    query=f'StorageUtilization[1m]{{resourceId = "{database_id}"}}.mean()',
                    start_time=start_time,
                    end_time=end_time
                )
            )
            
            if response.data and response.data[0].aggregated_datapoints:
                latest_datapoint = response.data[0].aggregated_datapoints[-1]
                return latest_datapoint.value
            
            return 0.0
            
        except Exception as e:
            logger.error(f"Failed to get storage utilization: {str(e)}")
            return 0.0

    async def _get_session_metrics(self, connection) -> Dict[str, int]:
        """Get session metrics from database"""
        try:
            cursor = connection.cursor()
            cursor.execute("""
                SELECT 
                    COUNT(CASE WHEN status = 'ACTIVE' THEN 1 END) as active_sessions,
                    COUNT(CASE WHEN blocking_session IS NOT NULL THEN 1 END) as blocked_sessions
                FROM v$session
                WHERE type = 'USER'
            """)
            result = cursor.fetchone()
            cursor.close()
            
            return {
                'active': int(result[0]) if result[0] else 0,
                'blocked': int(result[1]) if result[1] else 0
            }
        except Exception as e:
            logger.error(f"Failed to get session metrics: {str(e)}")
            return {'active': 0, 'blocked': 0}

    async def _get_response_time_metrics(self, connection) -> float:
        """Get average response time metrics"""
        try:
            cursor = connection.cursor()
            cursor.execute("""
                SELECT AVG(elapsed_time) / 1000000 as avg_response_time_seconds
                FROM v$sql
                WHERE last_active_time > SYSDATE - 1/24
                AND executions > 0
            """)
            result = cursor.fetchone()
            cursor.close()
            return float(result[0]) if result and result[0] else 0.0
        except Exception as e:
            logger.error(f"Failed to get response time metrics: {str(e)}")
            return 0.0

    async def _get_throughput_metrics(self, connection) -> float:
        """Get transaction throughput metrics"""
        try:
            cursor = connection.cursor()
            cursor.execute("""
                SELECT value
                FROM v$sysstat
                WHERE name = 'user commits'
            """)
            result = cursor.fetchone()
            cursor.close()
            return float(result[0]) if result else 0.0
        except Exception as e:
            logger.error(f"Failed to get throughput metrics: {str(e)}")
            return 0.0

    async def _get_wait_events(self, connection) -> Dict[str, float]:
        """Get top wait events"""
        try:
            cursor = connection.cursor()
            cursor.execute("""
                SELECT event, time_waited_micro / 1000000 as time_waited_seconds
                FROM v$system_event
                WHERE wait_class != 'Idle'
                ORDER BY time_waited_micro DESC
                FETCH FIRST 10 ROWS ONLY
            """)
            results = cursor.fetchall()
            cursor.close()
            
            return {row[0]: float(row[1]) for row in results}
        except Exception as e:
            logger.error(f"Failed to get wait events: {str(e)}")
            return {}

    async def _get_top_sql_statements(self, connection) -> List[Dict]:
        """Get top SQL statements by various metrics"""
        try:
            cursor = connection.cursor()
            cursor.execute("""
                SELECT 
                    sql_id,
                    executions,
                    elapsed_time / 1000000 as elapsed_seconds,
                    cpu_time / 1000000 as cpu_seconds,
                    buffer_gets,
                    disk_reads,
                    SUBSTR(sql_text, 1, 100) as sql_text_preview
                FROM v$sql
                WHERE executions > 0
                ORDER BY elapsed_time DESC
                FETCH FIRST 20 ROWS ONLY
            """)
            results = cursor.fetchall()
            cursor.close()
            
            return [
                {
                    'sql_id': row[0],
                    'executions': int(row[1]),
                    'elapsed_seconds': float(row[2]),
                    'cpu_seconds': float(row[3]),
                    'buffer_gets': int(row[4]),
                    'disk_reads': int(row[5]),
                    'sql_text_preview': row[6]
                }
                for row in results
            ]
        except Exception as e:
            logger.error(f"Failed to get top SQL statements: {str(e)}")
            return []

    async def analyze_performance(self, metrics: List[DatabaseMetrics]) -> List[PerformanceRecommendation]:
        """Analyze performance metrics and generate recommendations"""
        recommendations = []
        
        for metric in metrics:
            # CPU analysis
            if metric.cpu_utilization > self.thresholds['cpu_critical']:
                recommendations.append(
                    PerformanceRecommendation(
                        database_id=metric.database_id,
                        category="CPU",
                        severity="CRITICAL",
                        title="High CPU Utilization",
                        description=f"CPU utilization is {metric.cpu_utilization:.1f}%, exceeding critical threshold",
                        impact_score=0.9,
                        implementation_effort="LOW",
                        sql_statements=["ALTER DATABASE SET auto_scaling = TRUE;"]
                    )
                )
            
            # Memory analysis
            if metric.memory_utilization > self.thresholds['memory_critical']:
                recommendations.append(
                    PerformanceRecommendation(
                        database_id=metric.database_id,
                        category="MEMORY",
                        severity="CRITICAL",
                        title="High Memory Utilization",
                        description=f"Memory utilization is {metric.memory_utilization:.1f}%, consider scaling up",
                        impact_score=0.8,
                        implementation_effort="MEDIUM",
                        sql_statements=["-- Consider increasing CPU cores to get more memory"]
                    )
                )
            
            # Storage analysis
            if metric.storage_utilization > self.thresholds['storage_critical']:
                recommendations.append(
                    PerformanceRecommendation(
                        database_id=metric.database_id,
                        category="STORAGE",
                        severity="CRITICAL",
                        title="High Storage Utilization",
                        description=f"Storage utilization is {metric.storage_utilization:.1f}%, expand storage immediately",
                        impact_score=0.95,
                        implementation_effort="LOW",
                        sql_statements=["-- Storage will auto-expand, monitor costs"]
                    )
                )
            
            # Session analysis
            if metric.blocked_sessions > 0:
                recommendations.append(
                    PerformanceRecommendation(
                        database_id=metric.database_id,
                        category="CONCURRENCY",
                        severity="WARNING",
                        title="Blocked Sessions Detected",
                        description=f"{metric.blocked_sessions} blocked sessions found, investigate locking",
                        impact_score=0.7,
                        implementation_effort="HIGH",
                        sql_statements=[
                            "SELECT * FROM v$lock WHERE block > 0;",
                            "SELECT * FROM v$session WHERE blocking_session IS NOT NULL;"
                        ]
                    )
                )
            
            # Response time analysis
            if metric.average_response_time > self.thresholds['response_time_critical']:
                recommendations.append(
                    PerformanceRecommendation(
                        database_id=metric.database_id,
                        category="PERFORMANCE",
                        severity="WARNING",
                        title="High Response Time",
                        description=f"Average response time is {metric.average_response_time:.2f}s, optimize queries",
                        impact_score=0.6,
                        implementation_effort="HIGH",
                        sql_statements=[
                            "-- Review top SQL statements for optimization opportunities",
                            "-- Consider adding indexes for frequently accessed data"
                        ]
                    )
                )
        
        return recommendations

    async def generate_fleet_report(self, metrics: List[DatabaseMetrics], 
                                  recommendations: List[PerformanceRecommendation]) -> str:
        """Generate comprehensive fleet performance report"""
        report = f"""
# Autonomous Database Fleet Performance Report
Generated: {datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S UTC')}

## Fleet Summary
- Total Databases: {len(metrics)}
- Databases with Issues: {len([m for m in metrics if any(r.database_id == m.database_id for r in recommendations)])}
- Critical Recommendations: {len([r for r in recommendations if r.severity == 'CRITICAL'])}

## Database Performance Overview
"""
        
        for metric in metrics:
            db_recommendations = [r for r in recommendations if r.database_id == metric.database_id]
            critical_issues = len([r for r in db_recommendations if r.severity == 'CRITICAL'])
            
            report += f"""
### {metric.database_name}
- CPU Utilization: {metric.cpu_utilization:.1f}%
- Memory Utilization: {metric.memory_utilization:.1f}%
- Storage Utilization: {metric.storage_utilization:.1f}%
- Active Sessions: {metric.active_sessions}
- Blocked Sessions: {metric.blocked_sessions}
- Average Response Time: {metric.average_response_time:.2f}s
- Critical Issues: {critical_issues}
"""
        
        if recommendations:
            report += "\n## Recommendations\n"
            for rec in sorted(recommendations, key=lambda x: x.impact_score, reverse=True):
                report += f"""
### {rec.title} - {rec.severity}
- Database: {next(m.database_name for m in metrics if m.database_id == rec.database_id)}
- Category: {rec.category}
- Impact Score: {rec.impact_score:.1f}
- Implementation Effort: {rec.implementation_effort}
- Description: {rec.description}
"""
        
        return report

# Main execution function
async def main():
    """Main monitoring execution"""
    monitor = AutonomousDatabaseFleetMonitor()
    
    try:
        # Monitor fleet
        logger.info("Starting fleet monitoring...")
        metrics = await monitor.monitor_fleet()
        logger.info(f"Collected metrics from {len(metrics)} databases")
        
        # Analyze performance
        recommendations = await monitor.analyze_performance(metrics)
        logger.info(f"Generated {len(recommendations)} recommendations")
        
        # Generate report
        report = await monitor.generate_fleet_report(metrics, recommendations)
        
        # Save report
        with open(f"fleet_report_{datetime.utcnow().strftime('%Y%m%d_%H%M%S')}.md", 'w') as f:
            f.write(report)
        
        logger.info("Fleet monitoring completed successfully")
        
    except Exception as e:
        logger.error(f"Fleet monitoring failed: {str(e)}")
        raise

if __name__ == "__main__":
    asyncio.run(main())

Advanced Performance Optimization Techniques

Autonomous Database provides several advanced optimization features that can be leveraged programmatically. Automatic indexing continuously monitors query patterns and creates or drops indexes based on actual usage patterns. This feature eliminates the traditional DBA task of index management while ensuring optimal query performance.

SQL plan management automatically captures and evolves execution plans, preventing performance regressions when statistics change or new Oracle versions are deployed. The system maintains a repository of proven execution plans and automatically selects the best plan for each SQL statement.

Real-time SQL monitoring provides detailed execution statistics for long-running queries, enabling identification of performance bottlenecks during execution rather than after completion. This capability is essential for optimizing complex analytical workloads and batch processing operations.

Automated Scaling and Cost Optimization

Autonomous Database’s auto-scaling feature dynamically adjusts CPU resources based on workload demands, but understanding the patterns enables better cost optimization. Monitoring CPU utilization patterns over time reveals opportunities for right-sizing base allocations while maintaining auto-scaling for peak periods.

Scheduled scaling operations can be implemented to proactively adjust resources for known workload patterns, such as batch processing windows or business reporting cycles. This approach optimizes costs by scaling down during predictable low-usage periods.

Storage auto-expansion occurs automatically, but monitoring growth patterns enables better capacity planning and cost forecasting. Integration with OCI Cost Management APIs provides automated cost tracking and budget alerting capabilities.

Security and Compliance Automation

Database security automation encompasses multiple layers of protection. Automatic patching ensures systems remain current with security updates without manual intervention. Data encryption occurs automatically for data at rest and in transit, with key rotation handled transparently.

Audit logging automation captures all database activities and integrates with OCI Logging Analytics for security event correlation and threat detection. Automated compliance reporting generates audit trails required for regulatory compliance frameworks.

Access control automation integrates with OCI Identity and Access Management to ensure consistent security policies across the database fleet. Database user lifecycle management can be automated through integration with enterprise identity management systems.

This comprehensive approach to Autonomous Database management enables organizations to operate enterprise