Technology Geek

Advanced OCI Autonomous Database Automation with Fleet Management and Performance Intelligence

Posted on June 18, 2025 by Osama Mustafa in Cloud, OCI

Oracle Cloud Infrastructure’s Autonomous Database represents a paradigm shift in database administration, combining machine learning-driven automation with enterprise-grade performance and security. This comprehensive guide explores advanced fleet management strategies, performance optimization techniques, and automated operational workflows that enable organizations to scale database operations efficiently while maintaining optimal performance.

Autonomous Database Architecture Deep Dive

Autonomous Database operates on a fully managed infrastructure where Oracle handles patching, tuning, scaling, and security updates automatically. The service leverages machine learning algorithms trained on Oracle’s extensive database performance dataset to make real-time optimization decisions without human intervention.
The architecture consists of multiple layers of automation. The infrastructure layer manages compute and storage scaling based on workload demands. The database layer continuously optimizes SQL execution plans, indexes, and memory allocation. The security layer automatically applies patches and implements threat detection mechanisms.
Unlike traditional database services, Autonomous Database provides predictable performance through automatic workload management. The service can handle mixed workloads by automatically prioritizing critical transactions and throttling less important background processes during peak periods.

Resource allocation occurs dynamically across CPU, memory, and I/O subsystems. The machine learning algorithms analyze query patterns and automatically adjust resource distribution to optimize for current workload characteristics while maintaining performance SLAs.

Fleet Management and Automation Strategies

Managing multiple Autonomous Databases across development, testing, and production environments requires sophisticated automation strategies. Fleet management enables consistent configuration, monitoring, and lifecycle management across database instances.

Automated provisioning workflows ensure new database instances follow organizational standards for security, backup policies, and resource allocation. Template-based deployment eliminates configuration drift and reduces manual errors during database creation.

Cross-database monitoring provides unified visibility into performance metrics, resource utilization, and cost optimization opportunities across the entire database fleet. Centralized alerting ensures rapid response to performance degradation or security incidents.

Production Implementation Example

Here’s a comprehensive implementation of automated Autonomous Database fleet management with advanced monitoring and optimization:
Terraform Infrastructure for Database Fleet

# Variables for fleet configuration
variable "database_environments" {
  description = "Database environments configuration"
  type = map(object({
    cpu_core_count          = number
    data_storage_size_in_tbs = number
    display_name           = string
    db_name               = string
    admin_password        = string
    db_workload           = string
    license_model         = string
    whitelisted_ips       = list(string)
    auto_scaling_enabled  = bool
    backup_retention_days = number
  }))
  default = {
    production = {
      cpu_core_count          = 4
      data_storage_size_in_tbs = 2
      display_name           = "Production ADB"
      db_name               = "PRODADB"
      admin_password        = "ComplexPassword123!"
      db_workload           = "OLTP"
      license_model         = "LICENSE_INCLUDED"
      whitelisted_ips       = ["10.0.0.0/16"]
      auto_scaling_enabled  = true
      backup_retention_days = 30
    }
    staging = {
      cpu_core_count          = 2
      data_storage_size_in_tbs = 1
      display_name           = "Staging ADB"
      db_name               = "STAGINGADB"
      admin_password        = "ComplexPassword123!"
      db_workload           = "OLTP"
      license_model         = "LICENSE_INCLUDED"
      whitelisted_ips       = ["10.0.0.0/16"]
      auto_scaling_enabled  = false
      backup_retention_days = 7
    }
  }
}

# Autonomous Database instances
resource "oci_database_autonomous_database" "fleet_databases" {
  for_each = var.database_environments
  
  compartment_id              = var.compartment_id
  cpu_core_count             = each.value.cpu_core_count
  data_storage_size_in_tbs   = each.value.data_storage_size_in_tbs
  db_name                    = each.value.db_name
  display_name               = each.value.display_name
  admin_password             = each.value.admin_password
  db_workload               = each.value.db_workload
  license_model             = each.value.license_model
  is_auto_scaling_enabled   = each.value.auto_scaling_enabled
  
  # Network security
  whitelisted_ips = each.value.whitelisted_ips
  subnet_id      = oci_core_subnet.database_subnet.id
  nsg_ids        = [oci_core_network_security_group.database_nsg.id]
  
  # Backup configuration
  backup_config {
    manual_backup_bucket_name = oci_objectstorage_bucket.backup_bucket[each.key].name
    manual_backup_type       = "OBJECT_STORE"
  }
  
  # Enable advanced features
  operations_insights_status = "ENABLED"
  database_management_status = "ENABLED"
  
  # Tags for fleet management
  defined_tags = {
    "Operations.Environment" = each.key
    "Operations.CostCenter" = "Database"
    "Operations.Owner"      = "DBA-Team"
  }
  
  lifecycle {
    ignore_changes = [
      admin_password,
    ]
  }
}

# Dedicated backup buckets per environment
resource "oci_objectstorage_bucket" "backup_bucket" {
  for_each       = var.database_environments
  compartment_id = var.compartment_id
  name          = "${each.key}-adb-backups"
  namespace     = data.oci_objectstorage_namespace.ns.namespace
  
  retention_rules {
    display_name = "backup-retention"
    duration {
      time_amount = each.value.backup_retention_days
      time_unit   = "DAYS"
    }
    time_rule_locked = formatdate("YYYY-MM-DD'T'hh:mm:ss'Z'", timeadd(timestamp(), "24h"))
  }
  
  object_events_enabled = true
  versioning           = "Enabled"
}

# Database monitoring alarms
resource "oci_monitoring_alarm" "cpu_utilization" {
  for_each                = var.database_environments
  compartment_id         = var.compartment_id
  destinations          = [oci_ons_notification_topic.database_alerts.id]
  display_name          = "${each.value.display_name} - High CPU"
  is_enabled            = true
  metric_compartment_id = var.compartment_id
  namespace             = "oci_autonomous_database"
  
  query = "CpuUtilization[5m]{resourceId = \"${oci_database_autonomous_database.fleet_databases[each.key].id}\"}.mean() > 80"
  
  severity = "WARNING"
  
  suppression {
    time_suppress_from  = "0T08:00:00Z"
    time_suppress_until = "0T09:00:00Z"
  }
  
  repeat_notification_duration = "PT2H"
}

resource "oci_monitoring_alarm" "storage_utilization" {
  for_each                = var.database_environments
  compartment_id         = var.compartment_id
  destinations          = [oci_ons_notification_topic.database_alerts.id]
  display_name          = "${each.value.display_name} - High Storage"
  is_enabled            = true
  metric_compartment_id = var.compartment_id
  namespace             = "oci_autonomous_database"
  
  query = "StorageUtilization[5m]{resourceId = \"${oci_database_autonomous_database.fleet_databases[each.key].id}\"}.mean() > 85"
  
  severity = "CRITICAL"
  repeat_notification_duration = "PT30M"
}

# Network Security Group for database access
resource "oci_core_network_security_group" "database_nsg" {
  compartment_id = var.compartment_id
  vcn_id        = oci_core_vcn.database_vcn.id
  display_name  = "database-nsg"
}

resource "oci_core_network_security_group_security_rule" "database_ingress_https" {
  network_security_group_id = oci_core_network_security_group.database_nsg.id
  direction                 = "INGRESS"
  protocol                  = "6"
  source                   = "10.0.0.0/16"
  source_type              = "CIDR_BLOCK"
  
  tcp_options {
    destination_port_range {
      max = 1522
      min = 1521
    }
  }
}

# Notification topic for database alerts
resource "oci_ons_notification_topic" "database_alerts" {
  compartment_id = var.compartment_id
  name          = "database-fleet-alerts"
  description   = "Alerts for Autonomous Database fleet"
}

Advanced Performance Monitoring Script

#!/usr/bin/env python3
"""
Advanced Autonomous Database Fleet Performance Monitor
Provides automated performance analysis, recommendation generation,
and proactive optimization suggestions.
"""

import oci
import json
import logging
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from typing import Dict, List, Tuple, Optional
import cx_Oracle
import asyncio
import aiohttp
from dataclasses import dataclass
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

@dataclass
class DatabaseMetrics:
    """Database performance metrics container"""
    database_id: str
    database_name: str
    cpu_utilization: float
    memory_utilization: float
    storage_utilization: float
    active_sessions: int
    blocked_sessions: int
    average_response_time: float
    throughput_transactions: float
    wait_events: Dict[str, float]
    top_sql: List[Dict]
    timestamp: datetime

@dataclass
class PerformanceRecommendation:
    """Performance optimization recommendation"""
    database_id: str
    category: str
    severity: str
    title: str
    description: str
    impact_score: float
    implementation_effort: str
    sql_statements: List[str]

class AutonomousDatabaseFleetMonitor:
    def __init__(self, config_file: str = 'config.json'):
        """Initialize the fleet monitoring system"""
        self.config = self._load_config(config_file)
        self.signer = oci.auth.signers.get_resource_principals_signer()
        
        # Initialize OCI clients
        self.db_client = oci.database.DatabaseClient({}, signer=self.signer)
        self.monitoring_client = oci.monitoring.MonitoringClient({}, signer=self.signer)
        self.ons_client = oci.ons.NotificationDataPlaneClient({}, signer=self.signer)
        
        # Performance thresholds
        self.thresholds = {
            'cpu_warning': 70.0,
            'cpu_critical': 85.0,
            'memory_warning': 75.0,
            'memory_critical': 90.0,
            'storage_warning': 80.0,
            'storage_critical': 90.0,
            'response_time_warning': 2.0,
            'response_time_critical': 5.0
        }
        
        # Initialize database connections cache
        self.db_connections = {}

    def _load_config(self, config_file: str) -> Dict:
        """Load configuration from JSON file"""
        try:
            with open(config_file, 'r') as f:
                return json.load(f)
        except FileNotFoundError:
            logger.error(f"Configuration file {config_file} not found")
            return {}

    async def monitor_fleet(self) -> List[DatabaseMetrics]:
        """Monitor all databases in the fleet"""
        databases = await self._discover_databases()
        monitoring_tasks = [
            self._monitor_database(db) for db in databases
        ]
        
        results = await asyncio.gather(*monitoring_tasks, return_exceptions=True)
        
        # Filter out exceptions and return valid metrics
        valid_metrics = [
            result for result in results 
            if isinstance(result, DatabaseMetrics)
        ]
        
        # Log any errors
        for result in results:
            if isinstance(result, Exception):
                logger.error(f"Monitoring error: {str(result)}")
        
        return valid_metrics

    async def _discover_databases(self) -> List[Dict]:
        """Discover all Autonomous Databases in the compartment"""
        try:
            response = self.db_client.list_autonomous_databases(
                compartment_id=self.config['compartment_id'],
                lifecycle_state='AVAILABLE'
            )
            return response.data
        except Exception as e:
            logger.error(f"Failed to discover databases: {str(e)}")
            return []

    async def _monitor_database(self, database: Dict) -> DatabaseMetrics:
        """Monitor individual database performance"""
        db_id = database.id
        db_name = database.display_name
        
        try:
            # Get connection to database
            connection = await self._get_database_connection(database)
            
            # Collect performance metrics
            cpu_util = await self._get_cpu_utilization(db_id)
            memory_util = await self._get_memory_utilization(connection)
            storage_util = await self._get_storage_utilization(db_id)
            session_metrics = await self._get_session_metrics(connection)
            response_time = await self._get_response_time_metrics(connection)
            throughput = await self._get_throughput_metrics(connection)
            wait_events = await self._get_wait_events(connection)
            top_sql = await self._get_top_sql_statements(connection)
            
            return DatabaseMetrics(
                database_id=db_id,
                database_name=db_name,
                cpu_utilization=cpu_util,
                memory_utilization=memory_util,
                storage_utilization=storage_util,
                active_sessions=session_metrics['active'],
                blocked_sessions=session_metrics['blocked'],
                average_response_time=response_time,
                throughput_transactions=throughput,
                wait_events=wait_events,
                top_sql=top_sql,
                timestamp=datetime.utcnow()
            )
            
        except Exception as e:
            logger.error(f"Error monitoring database {db_name}: {str(e)}")
            raise

    async def _get_database_connection(self, database: Dict):
        """Get or create database connection"""
        db_id = database.id
        
        if db_id not in self.db_connections:
            try:
                # Get connection details
                wallet_response = self.db_client.generate_autonomous_database_wallet(
                    autonomous_database_id=db_id,
                    generate_autonomous_database_wallet_details=oci.database.models.GenerateAutonomousDatabaseWalletDetails(
                        password="WalletPassword123!"
                    )
                )
                
                # Create connection (implementation depends on wallet setup)
                # This is a simplified example
                connection_string = f"{database.connection_urls.sql_dev_web_url}"
                
                connection = cx_Oracle.connect(
                    user="ADMIN",
                    password=self.config['admin_password'],
                    dsn=connection_string
                )
                
                self.db_connections[db_id] = connection
                
            except Exception as e:
                logger.error(f"Failed to connect to database {database.display_name}: {str(e)}")
                raise
        
        return self.db_connections[db_id]

    async def _get_cpu_utilization(self, database_id: str) -> float:
        """Get CPU utilization from OCI Monitoring"""
        try:
            end_time = datetime.utcnow()
            start_time = end_time - timedelta(minutes=5)
            
            response = self.monitoring_client.summarize_metrics_data(
                compartment_id=self.config['compartment_id'],
                summarize_metrics_data_details=oci.monitoring.models.SummarizeMetricsDataDetails(
                    namespace="oci_autonomous_database",
                    query=f'CpuUtilization[1m]{{resourceId = "{database_id}"}}.mean()',
                    start_time=start_time,
                    end_time=end_time
                )
            )
            
            if response.data and response.data[0].aggregated_datapoints:
                latest_datapoint = response.data[0].aggregated_datapoints[-1]
                return latest_datapoint.value
            
            return 0.0
            
        except Exception as e:
            logger.error(f"Failed to get CPU utilization: {str(e)}")
            return 0.0

    async def _get_memory_utilization(self, connection) -> float:
        """Get memory utilization from database"""
        try:
            cursor = connection.cursor()
            cursor.execute("""
                SELECT ROUND((1 - (bytes_free / bytes_total)) * 100, 2) as memory_usage_pct
                FROM (
                    SELECT SUM(bytes) as bytes_total
                    FROM v$sgainfo
                    WHERE name = 'Maximum SGA Size'
                ), (
                    SELECT SUM(bytes) as bytes_free
                    FROM v$sgastat
                    WHERE name = 'free memory'
                )
            """)
            result = cursor.fetchone()
            cursor.close()
            return float(result[0]) if result else 0.0
        except Exception as e:
            logger.error(f"Failed to get memory utilization: {str(e)}")
            return 0.0

    async def _get_storage_utilization(self, database_id: str) -> float:
        """Get storage utilization from OCI Monitoring"""
        try:
            end_time = datetime.utcnow()
            start_time = end_time - timedelta(minutes=5)
            
            response = self.monitoring_client.summarize_metrics_data(
                compartment_id=self.config['compartment_id'],
                summarize_metrics_data_details=oci.monitoring.models.SummarizeMetricsDataDetails(
                    namespace="oci_autonomous_database",
                    query=f'StorageUtilization[1m]{{resourceId = "{database_id}"}}.mean()',
                    start_time=start_time,
                    end_time=end_time
                )
            )
            
            if response.data and response.data[0].aggregated_datapoints:
                latest_datapoint = response.data[0].aggregated_datapoints[-1]
                return latest_datapoint.value
            
            return 0.0
            
        except Exception as e:
            logger.error(f"Failed to get storage utilization: {str(e)}")
            return 0.0

    async def _get_session_metrics(self, connection) -> Dict[str, int]:
        """Get session metrics from database"""
        try:
            cursor = connection.cursor()
            cursor.execute("""
                SELECT 
                    COUNT(CASE WHEN status = 'ACTIVE' THEN 1 END) as active_sessions,
                    COUNT(CASE WHEN blocking_session IS NOT NULL THEN 1 END) as blocked_sessions
                FROM v$session
                WHERE type = 'USER'
            """)
            result = cursor.fetchone()
            cursor.close()
            
            return {
                'active': int(result[0]) if result[0] else 0,
                'blocked': int(result[1]) if result[1] else 0
            }
        except Exception as e:
            logger.error(f"Failed to get session metrics: {str(e)}")
            return {'active': 0, 'blocked': 0}

    async def _get_response_time_metrics(self, connection) -> float:
        """Get average response time metrics"""
        try:
            cursor = connection.cursor()
            cursor.execute("""
                SELECT AVG(elapsed_time) / 1000000 as avg_response_time_seconds
                FROM v$sql
                WHERE last_active_time > SYSDATE - 1/24
                AND executions > 0
            """)
            result = cursor.fetchone()
            cursor.close()
            return float(result[0]) if result and result[0] else 0.0
        except Exception as e:
            logger.error(f"Failed to get response time metrics: {str(e)}")
            return 0.0

    async def _get_throughput_metrics(self, connection) -> float:
        """Get transaction throughput metrics"""
        try:
            cursor = connection.cursor()
            cursor.execute("""
                SELECT value
                FROM v$sysstat
                WHERE name = 'user commits'
            """)
            result = cursor.fetchone()
            cursor.close()
            return float(result[0]) if result else 0.0
        except Exception as e:
            logger.error(f"Failed to get throughput metrics: {str(e)}")
            return 0.0

    async def _get_wait_events(self, connection) -> Dict[str, float]:
        """Get top wait events"""
        try:
            cursor = connection.cursor()
            cursor.execute("""
                SELECT event, time_waited_micro / 1000000 as time_waited_seconds
                FROM v$system_event
                WHERE wait_class != 'Idle'
                ORDER BY time_waited_micro DESC
                FETCH FIRST 10 ROWS ONLY
            """)
            results = cursor.fetchall()
            cursor.close()
            
            return {row[0]: float(row[1]) for row in results}
        except Exception as e:
            logger.error(f"Failed to get wait events: {str(e)}")
            return {}

    async def _get_top_sql_statements(self, connection) -> List[Dict]:
        """Get top SQL statements by various metrics"""
        try:
            cursor = connection.cursor()
            cursor.execute("""
                SELECT 
                    sql_id,
                    executions,
                    elapsed_time / 1000000 as elapsed_seconds,
                    cpu_time / 1000000 as cpu_seconds,
                    buffer_gets,
                    disk_reads,
                    SUBSTR(sql_text, 1, 100) as sql_text_preview
                FROM v$sql
                WHERE executions > 0
                ORDER BY elapsed_time DESC
                FETCH FIRST 20 ROWS ONLY
            """)
            results = cursor.fetchall()
            cursor.close()
            
            return [
                {
                    'sql_id': row[0],
                    'executions': int(row[1]),
                    'elapsed_seconds': float(row[2]),
                    'cpu_seconds': float(row[3]),
                    'buffer_gets': int(row[4]),
                    'disk_reads': int(row[5]),
                    'sql_text_preview': row[6]
                }
                for row in results
            ]
        except Exception as e:
            logger.error(f"Failed to get top SQL statements: {str(e)}")
            return []

    async def analyze_performance(self, metrics: List[DatabaseMetrics]) -> List[PerformanceRecommendation]:
        """Analyze performance metrics and generate recommendations"""
        recommendations = []
        
        for metric in metrics:
            # CPU analysis
            if metric.cpu_utilization > self.thresholds['cpu_critical']:
                recommendations.append(
                    PerformanceRecommendation(
                        database_id=metric.database_id,
                        category="CPU",
                        severity="CRITICAL",
                        title="High CPU Utilization",
                        description=f"CPU utilization is {metric.cpu_utilization:.1f}%, exceeding critical threshold",
                        impact_score=0.9,
                        implementation_effort="LOW",
                        sql_statements=["ALTER DATABASE SET auto_scaling = TRUE;"]
                    )
                )
            
            # Memory analysis
            if metric.memory_utilization > self.thresholds['memory_critical']:
                recommendations.append(
                    PerformanceRecommendation(
                        database_id=metric.database_id,
                        category="MEMORY",
                        severity="CRITICAL",
                        title="High Memory Utilization",
                        description=f"Memory utilization is {metric.memory_utilization:.1f}%, consider scaling up",
                        impact_score=0.8,
                        implementation_effort="MEDIUM",
                        sql_statements=["-- Consider increasing CPU cores to get more memory"]
                    )
                )
            
            # Storage analysis
            if metric.storage_utilization > self.thresholds['storage_critical']:
                recommendations.append(
                    PerformanceRecommendation(
                        database_id=metric.database_id,
                        category="STORAGE",
                        severity="CRITICAL",
                        title="High Storage Utilization",
                        description=f"Storage utilization is {metric.storage_utilization:.1f}%, expand storage immediately",
                        impact_score=0.95,
                        implementation_effort="LOW",
                        sql_statements=["-- Storage will auto-expand, monitor costs"]
                    )
                )
            
            # Session analysis
            if metric.blocked_sessions > 0:
                recommendations.append(
                    PerformanceRecommendation(
                        database_id=metric.database_id,
                        category="CONCURRENCY",
                        severity="WARNING",
                        title="Blocked Sessions Detected",
                        description=f"{metric.blocked_sessions} blocked sessions found, investigate locking",
                        impact_score=0.7,
                        implementation_effort="HIGH",
                        sql_statements=[
                            "SELECT * FROM v$lock WHERE block > 0;",
                            "SELECT * FROM v$session WHERE blocking_session IS NOT NULL;"
                        ]
                    )
                )
            
            # Response time analysis
            if metric.average_response_time > self.thresholds['response_time_critical']:
                recommendations.append(
                    PerformanceRecommendation(
                        database_id=metric.database_id,
                        category="PERFORMANCE",
                        severity="WARNING",
                        title="High Response Time",
                        description=f"Average response time is {metric.average_response_time:.2f}s, optimize queries",
                        impact_score=0.6,
                        implementation_effort="HIGH",
                        sql_statements=[
                            "-- Review top SQL statements for optimization opportunities",
                            "-- Consider adding indexes for frequently accessed data"
                        ]
                    )
                )
        
        return recommendations

    async def generate_fleet_report(self, metrics: List[DatabaseMetrics], 
                                  recommendations: List[PerformanceRecommendation]) -> str:
        """Generate comprehensive fleet performance report"""
        report = f"""
# Autonomous Database Fleet Performance Report
Generated: {datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S UTC')}

## Fleet Summary
- Total Databases: {len(metrics)}
- Databases with Issues: {len([m for m in metrics if any(r.database_id == m.database_id for r in recommendations)])}
- Critical Recommendations: {len([r for r in recommendations if r.severity == 'CRITICAL'])}

## Database Performance Overview
"""
        
        for metric in metrics:
            db_recommendations = [r for r in recommendations if r.database_id == metric.database_id]
            critical_issues = len([r for r in db_recommendations if r.severity == 'CRITICAL'])
            
            report += f"""
### {metric.database_name}
- CPU Utilization: {metric.cpu_utilization:.1f}%
- Memory Utilization: {metric.memory_utilization:.1f}%
- Storage Utilization: {metric.storage_utilization:.1f}%
- Active Sessions: {metric.active_sessions}
- Blocked Sessions: {metric.blocked_sessions}
- Average Response Time: {metric.average_response_time:.2f}s
- Critical Issues: {critical_issues}
"""
        
        if recommendations:
            report += "\n## Recommendations\n"
            for rec in sorted(recommendations, key=lambda x: x.impact_score, reverse=True):
                report += f"""
### {rec.title} - {rec.severity}
- Database: {next(m.database_name for m in metrics if m.database_id == rec.database_id)}
- Category: {rec.category}
- Impact Score: {rec.impact_score:.1f}
- Implementation Effort: {rec.implementation_effort}
- Description: {rec.description}
"""
        
        return report

# Main execution function
async def main():
    """Main monitoring execution"""
    monitor = AutonomousDatabaseFleetMonitor()
    
    try:
        # Monitor fleet
        logger.info("Starting fleet monitoring...")
        metrics = await monitor.monitor_fleet()
        logger.info(f"Collected metrics from {len(metrics)} databases")
        
        # Analyze performance
        recommendations = await monitor.analyze_performance(metrics)
        logger.info(f"Generated {len(recommendations)} recommendations")
        
        # Generate report
        report = await monitor.generate_fleet_report(metrics, recommendations)
        
        # Save report
        with open(f"fleet_report_{datetime.utcnow().strftime('%Y%m%d_%H%M%S')}.md", 'w') as f:
            f.write(report)
        
        logger.info("Fleet monitoring completed successfully")
        
    except Exception as e:
        logger.error(f"Fleet monitoring failed: {str(e)}")
        raise

if __name__ == "__main__":
    asyncio.run(main())

Advanced Performance Optimization Techniques

Autonomous Database provides several advanced optimization features that can be leveraged programmatically. Automatic indexing continuously monitors query patterns and creates or drops indexes based on actual usage patterns. This feature eliminates the traditional DBA task of index management while ensuring optimal query performance.

SQL plan management automatically captures and evolves execution plans, preventing performance regressions when statistics change or new Oracle versions are deployed. The system maintains a repository of proven execution plans and automatically selects the best plan for each SQL statement.

Real-time SQL monitoring provides detailed execution statistics for long-running queries, enabling identification of performance bottlenecks during execution rather than after completion. This capability is essential for optimizing complex analytical workloads and batch processing operations.

Automated Scaling and Cost Optimization

Autonomous Database’s auto-scaling feature dynamically adjusts CPU resources based on workload demands, but understanding the patterns enables better cost optimization. Monitoring CPU utilization patterns over time reveals opportunities for right-sizing base allocations while maintaining auto-scaling for peak periods.

Scheduled scaling operations can be implemented to proactively adjust resources for known workload patterns, such as batch processing windows or business reporting cycles. This approach optimizes costs by scaling down during predictable low-usage periods.

Storage auto-expansion occurs automatically, but monitoring growth patterns enables better capacity planning and cost forecasting. Integration with OCI Cost Management APIs provides automated cost tracking and budget alerting capabilities.

Security and Compliance Automation

Database security automation encompasses multiple layers of protection. Automatic patching ensures systems remain current with security updates without manual intervention. Data encryption occurs automatically for data at rest and in transit, with key rotation handled transparently.

Audit logging automation captures all database activities and integrates with OCI Logging Analytics for security event correlation and threat detection. Automated compliance reporting generates audit trails required for regulatory compliance frameworks.

Access control automation integrates with OCI Identity and Access Management to ensure consistent security policies across the database fleet. Database user lifecycle management can be automated through integration with enterprise identity management systems.

This comprehensive approach to Autonomous Database management enables organizations to operate enterprise

Building Enterprise Event-Driven Architectures with OCI Functions and Streaming

Posted on June 12, 2025 by Osama Mustafa in Cloud, OCI

Oracle Cloud Infrastructure Functions provides a powerful serverless computing platform that integrates seamlessly with OCI’s event-driven services. This deep-dive explores advanced patterns for building resilient, scalable event-driven architectures using OCI Functions, Streaming, Events, and Notifications services with real-time data processing capabilities.

OCI Functions Architecture and Event Integration

OCI Functions operates on a containerized execution model where each function runs in isolated containers managed by the Fn Project runtime. The service automatically handles scaling, from zero to thousands of concurrent executions, based on incoming event volume.

The event integration architecture centers around multiple trigger mechanisms. HTTP triggers provide direct REST API endpoints for synchronous invocations. OCI Events service enables asynchronous function execution based on resource state changes across OCI services. Streaming triggers process high-volume data streams in real-time, while Object Storage triggers respond to bucket operations.

Unlike traditional serverless platforms, OCI Functions provides deep integration with Oracle’s enterprise services stack, including Autonomous Database, Analytics Cloud, and Integration Cloud. This native integration eliminates the need for complex authentication mechanisms and network configurations typically required in multi-cloud architectures.

Advanced Streaming Integration Patterns

OCI Streaming service provides Apache Kafka-compatible message streaming with enterprise-grade durability and performance. Functions can consume streaming data using multiple consumption patterns, each optimized for specific use cases.
Single-partition consumption works well for ordered processing requirements where message sequence matters. The function processes messages sequentially from a single partition, ensuring strict ordering but limiting throughput to single-function concurrency.
Multi-partition consumption enables parallel processing across multiple partitions, dramatically increasing throughput for scenarios where message ordering within the entire stream isn’t critical. Each partition can trigger separate function instances, enabling horizontal scaling based on partition count.
Batch processing consumption accumulates messages over configurable time windows or message count thresholds before triggering function execution. This pattern optimizes for cost efficiency and reduces per-invocation overhead for high-volume scenarios.

Production Implementation Example

Here’s a comprehensive implementation of a real-time fraud detection system using OCI Functions with streaming integration:

Infrastructure as Code Setup

# fn.yaml - Function Configuration
schema_version: 20180708
name: fraud-detection-app
version: 0.0.1
runtime: python
build_image: fnproject/python:3.9-dev
run_image: fnproject/python:3.9
entrypoint: /python/bin/fdk /function/func.py handler
memory: 512
timeout: 300
config:
  STREAM_OCID: ${STREAM_OCID}
  DB_CONNECTION_STRING: ${DB_CONNECTION_STRING}
  NOTIFICATION_TOPIC_OCID: ${NOTIFICATION_TOPIC_OCID}
  COMPARTMENT_OCID: ${COMPARTMENT_OCID}
triggers:
  - name: fraud-detection-trigger
    type: oracle-streaming
    source: ${STREAM_OCID}
    config:
      batchSize: 10
      parallelism: 5
      startingPosition: LATEST

Function Implementation with Advanced Error Handling

import io
import json
import logging
import oci
import cx_Oracle
from datetime import datetime, timedelta
from typing import Dict, List, Any
import asyncio
import aiohttp

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class FraudDetectionProcessor:
    def __init__(self):
        self.signer = oci.auth.signers.get_resource_principals_signer()
        self.streaming_client = oci.streaming.StreamClient({}, signer=self.signer)
        self.ons_client = oci.ons.NotificationDataPlaneClient({}, signer=self.signer)
        self.monitoring_client = oci.monitoring.MonitoringClient({}, signer=self.signer)
        
        # Database connection pool
        self.connection_pool = self._create_db_pool()
        
        # Fraud detection models
        self.velocity_threshold = 5  # transactions per minute
        self.amount_threshold = 1000.0
        self.geo_velocity_threshold = 100  # km/hour
        
    def _create_db_pool(self):
        """Create database connection pool for high concurrency"""
        try:
            pool = cx_Oracle.create_pool(
                user=os.environ['DB_USER'],
                password=os.environ['DB_PASSWORD'],
                dsn=os.environ['DB_CONNECTION_STRING'],
                min=2,
                max=10,
                increment=1,
                threaded=True
            )
            return pool
        except Exception as e:
            logger.error(f"Failed to create DB pool: {str(e)}")
            raise

    async def process_transaction_batch(self, transactions: List[Dict]) -> List[Dict]:
        """Process batch of transactions for fraud detection"""
        results = []
        
        # Process transactions concurrently
        tasks = [self._analyze_transaction(tx) for tx in transactions]
        analysis_results = await asyncio.gather(*tasks, return_exceptions=True)
        
        for i, result in enumerate(analysis_results):
            if isinstance(result, Exception):
                logger.error(f"Error processing transaction {transactions[i]['transaction_id']}: {str(result)}")
                results.append({
                    'transaction_id': transactions[i]['transaction_id'],
                    'status': 'error',
                    'error': str(result)
                })
            else:
                results.append(result)
        
        return results

    async def _analyze_transaction(self, transaction: Dict) -> Dict:
        """Analyze individual transaction for fraud indicators"""
        transaction_id = transaction['transaction_id']
        user_id = transaction['user_id']
        amount = float(transaction['amount'])
        location = transaction.get('location', {})
        timestamp = datetime.fromisoformat(transaction['timestamp'])
        
        fraud_score = 0
        fraud_indicators = []
        
        # Velocity analysis
        velocity_score = await self._check_velocity_fraud(user_id, timestamp)
        fraud_score += velocity_score
        if velocity_score > 0:
            fraud_indicators.append('high_velocity')
        
        # Amount analysis
        if amount > self.amount_threshold:
            amount_score = min((amount / self.amount_threshold) * 10, 50)
            fraud_score += amount_score
            fraud_indicators.append('high_amount')
        
        # Geographic analysis
        geo_score = await self._check_geographic_fraud(user_id, location, timestamp)
        fraud_score += geo_score
        if geo_score > 0:
            fraud_indicators.append('geographic_anomaly')
        
        # Pattern analysis
        pattern_score = await self._check_pattern_fraud(user_id, transaction)
        fraud_score += pattern_score
        if pattern_score > 0:
            fraud_indicators.append('suspicious_pattern')
        
        # Determine fraud status
        if fraud_score >= 70:
            status = 'blocked'
        elif fraud_score >= 40:
            status = 'review'
        else:
            status = 'approved'
        
        result = {
            'transaction_id': transaction_id,
            'user_id': user_id,
            'fraud_score': fraud_score,
            'fraud_indicators': fraud_indicators,
            'status': status,
            'processed_at': datetime.utcnow().isoformat()
        }
        
        # Store results and trigger alerts if needed
        await self._store_analysis_result(result)
        if status in ['blocked', 'review']:
            await self._trigger_fraud_alert(result, transaction)
        
        return result

    async def _check_velocity_fraud(self, user_id: str, timestamp: datetime) -> float:
        """Check transaction velocity for fraud indicators"""
        try:
            connection = self.connection_pool.acquire()
            cursor = connection.cursor()
            
            # Check transactions in last 5 minutes
            time_window = timestamp - timedelta(minutes=5)
            
            cursor.execute("""
                SELECT COUNT(*) 
                FROM transactions 
                WHERE user_id = :user_id 
                AND transaction_time > :time_window
                AND transaction_time <= :current_time
            """, {
                'user_id': user_id,
                'time_window': time_window,
                'current_time': timestamp
            })
            
            count = cursor.fetchone()[0]
            cursor.close()
            self.connection_pool.release(connection)
            
            if count >= self.velocity_threshold:
                return min(count * 5, 30)  # Cap at 30 points
            return 0
            
        except Exception as e:
            logger.error(f"Velocity check error for user {user_id}: {str(e)}")
            return 0

    async def _check_geographic_fraud(self, user_id: str, location: Dict, timestamp: datetime) -> float:
        """Check for impossible geographic velocity"""
        if not location or 'latitude' not in location:
            return 0
            
        try:
            connection = self.connection_pool.acquire()
            cursor = connection.cursor()
            
            # Get last transaction location within 2 hours
            time_window = timestamp - timedelta(hours=2)
            
            cursor.execute("""
                SELECT latitude, longitude, transaction_time
                FROM transactions 
                WHERE user_id = :user_id 
                AND transaction_time > :time_window
                AND transaction_time < :current_time
                AND latitude IS NOT NULL
                ORDER BY transaction_time DESC
                FETCH FIRST 1 ROW ONLY
            """, {
                'user_id': user_id,
                'time_window': time_window,
                'current_time': timestamp
            })
            
            result = cursor.fetchone()
            cursor.close()
            self.connection_pool.release(connection)
            
            if result:
                last_lat, last_lon, last_time = result
                distance = self._calculate_distance(
                    last_lat, last_lon,
                    location['latitude'], location['longitude']
                )
                
                time_diff = (timestamp - last_time).total_seconds() / 3600  # hours
                if time_diff > 0:
                    velocity = distance / time_diff  # km/hour
                    if velocity > self.geo_velocity_threshold:
                        return min(velocity / 10, 40)  # Cap at 40 points
            
            return 0
            
        except Exception as e:
            logger.error(f"Geographic check error for user {user_id}: {str(e)}")
            return 0

    def _calculate_distance(self, lat1: float, lon1: float, lat2: float, lon2: float) -> float:
        """Calculate distance between two points using Haversine formula"""
        from math import radians, cos, sin, asin, sqrt
        
        lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
        dlat = lat2 - lat1
        dlon = lon2 - lon1
        a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
        return 2 * asin(sqrt(a)) * 6371  # Earth radius in km

    async def _check_pattern_fraud(self, user_id: str, transaction: Dict) -> float:
        """Check for suspicious transaction patterns"""
        try:
            connection = self.connection_pool.acquire()
            cursor = connection.cursor()
            
            # Check for round-number bias (common fraud indicator)
            amount = float(transaction['amount'])
            if amount % 100 == 0 and amount >= 500:
                return 15
            
            # Check for repeated exact amounts
            cursor.execute("""
                SELECT COUNT(*) 
                FROM transactions 
                WHERE user_id = :user_id 
                AND amount = :amount
                AND transaction_time > SYSDATE - 7
            """, {
                'user_id': user_id,
                'amount': amount
            })
            
            repeat_count = cursor.fetchone()[0]
            cursor.close()
            self.connection_pool.release(connection)
            
            if repeat_count >= 3:
                return min(repeat_count * 5, 25)
            
            return 0
            
        except Exception as e:
            logger.error(f"Pattern check error for user {user_id}: {str(e)}")
            return 0

    async def _store_analysis_result(self, result: Dict):
        """Store fraud analysis result in database"""
        try:
            connection = self.connection_pool.acquire()
            cursor = connection.cursor()
            
            cursor.execute("""
                INSERT INTO fraud_analysis 
                (transaction_id, user_id, fraud_score, fraud_indicators, 
                 status, processed_at, created_at)
                VALUES (:transaction_id, :user_id, :fraud_score, 
                        :fraud_indicators, :status, :processed_at, SYSDATE)
            """, {
                'transaction_id': result['transaction_id'],
                'user_id': result['user_id'],
                'fraud_score': result['fraud_score'],
                'fraud_indicators': ','.join(result['fraud_indicators']),
                'status': result['status'],
                'processed_at': result['processed_at']
            })
            
            connection.commit()
            cursor.close()
            self.connection_pool.release(connection)
            
        except Exception as e:
            logger.error(f"Failed to store analysis result: {str(e)}")

    async def _trigger_fraud_alert(self, result: Dict, transaction: Dict):
        """Trigger fraud alert through OCI Notifications"""
        try:
            message = {
                'alert_type': 'fraud_detection',
                'transaction_id': result['transaction_id'],
                'user_id': result['user_id'],
                'fraud_score': result['fraud_score'],
                'status': result['status'],
                'amount': transaction['amount'],
                'indicators': result['fraud_indicators'],
                'timestamp': result['processed_at']
            }
            
            # Publish to ONS topic
            publish_result = self.ons_client.publish_message(
                topic_id=os.environ['NOTIFICATION_TOPIC_OCID'],
                message_details=oci.ons.models.MessageDetails(
                    body=json.dumps(message),
                    title=f"Fraud Alert - {result['status'].upper()}"
                )
            )
            
            logger.info(f"Fraud alert sent for transaction {result['transaction_id']}")
            
            # Send custom metrics
            await self._send_metrics(result)
            
        except Exception as e:
            logger.error(f"Failed to send fraud alert: {str(e)}")

    async def _send_metrics(self, result: Dict):
        """Send custom metrics to OCI Monitoring"""
        try:
            metric_data = [
                oci.monitoring.models.MetricDataDetails(
                    namespace="fraud_detection",
                    compartment_id=os.environ['COMPARTMENT_OCID'],
                    name="fraud_score",
                    dimensions={'status': result['status']},
                    datapoints=[
                        oci.monitoring.models.Datapoint(
                            timestamp=datetime.utcnow(),
                            value=result['fraud_score']
                        )
                    ]
                )
            ]
            
            self.monitoring_client.post_metric_data(
                post_metric_data_details=oci.monitoring.models.PostMetricDataDetails(
                    metric_data=metric_data
                )
            )
            
        except Exception as e:
            logger.error(f"Failed to send metrics: {str(e)}")

# Function handler
def handler(ctx, data: io.BytesIO = None):
    """Main function handler for OCI Functions"""
    processor = FraudDetectionProcessor()
    
    try:
        # Parse streaming data
        streaming_data = json.loads(data.getvalue())
        transactions = []
        
        # Extract transactions from stream messages
        for message in streaming_data.get('messages', []):
            transaction_data = json.loads(message['value'])
            transactions.append(transaction_data)
        
        # Process transactions
        loop = asyncio.new_event_loop()
        asyncio.set_event_loop(loop)
        results = loop.run_until_complete(
            processor.process_transaction_batch(transactions)
        )
        loop.close()
        
        # Prepare response
        response = {
            'processed_count': len(results),
            'results': results,
            'processing_time': datetime.utcnow().isoformat()
        }
        
        logger.info(f"Processed {len(results)} transactions")
        return response
        
    except Exception as e:
        logger.error(f"Function execution error: {str(e)}")
        return {
            'error': str(e),
            'timestamp': datetime.utcnow().isoformat()
        }

Deployment and Configuration Scripts

#!/bin/bash
# deploy.sh - Automated deployment script

set -e

# Configuration
APP_NAME="fraud-detection-app"
FUNCTION_NAME="fraud-processor"
COMPARTMENT_OCID="your-compartment-ocid"
STREAM_OCID="your-stream-ocid"
NOTIFICATION_TOPIC_OCID="your-notification-topic-ocid"

echo "Deploying fraud detection function..."

# Create application if it doesn't exist
fn create app $APP_NAME --annotation oracle.com/oci/subnetIds='["subnet-ocid"]' || true

# Configure environment variables
fn config app $APP_NAME STREAM_OCID $STREAM_OCID
fn config app $APP_NAME NOTIFICATION_TOPIC_OCID $NOTIFICATION_TOPIC_OCID
fn config app $APP_NAME COMPARTMENT_OCID $COMPARTMENT_OCID
fn config app $APP_NAME DB_CONNECTION_STRING $DB_CONNECTION_STRING

# Deploy function
fn deploy --app $APP_NAME --no-bump

# Create trigger for streaming
echo "Creating streaming trigger..."
oci fn trigger create \
    --display-name "fraud-detection-trigger" \
    --function-id $(fn inspect fn $APP_NAME $FUNCTION_NAME | jq -r '.id') \
    --type streaming \
    --source-details '{"streamId":"'$STREAM_OCID'","batchSizeInKbs":64,"batchTimeInSeconds":5}'

echo "Deployment completed successfully!"

Monitoring and Observability

Production serverless architectures require comprehensive monitoring and observability. OCI Functions integrates with multiple observability services to provide complete visibility into function performance and business metrics.

Function-level metrics automatically track invocation count, duration, errors, and memory utilization. These metrics feed into custom dashboards for operational monitoring and capacity planning.

Distributed tracing capabilities track request flows across multiple functions and services, essential for debugging complex event-driven workflows. Integration with OCI Application Performance Monitoring provides detailed transaction traces with performance bottleneck identification.

Custom business metrics can be published using the OCI Monitoring service, enabling tracking of domain-specific KPIs like fraud detection rates, processing latency, and accuracy metrics.

Error Handling and Resilience Patterns

Enterprise-grade serverless applications require robust error handling and resilience patterns. Dead letter queues capture failed events for later processing or manual investigation. Circuit breaker patterns prevent cascade failures by temporarily disabling failing downstream services.

Exponential backoff with jitter implements reliable retry mechanisms for transient failures. The implementation includes configurable retry limits and backoff multipliers to balance between quick recovery and system stability.

Bulkhead isolation separates different workload types using separate function applications and resource pools, preventing resource contention between critical and non-critical workloads.

This comprehensive approach to OCI Functions enables building production-grade, event-driven architectures that can handle enterprise-scale workloads with high reliability and performance requirements.

AWS Systems Manager Parameter Store: Secure Configuration Management and Automation

Posted on June 10, 2025 by Osama Mustafa in AWS, Cloud

Configuration management is a critical aspect of modern cloud infrastructure, and AWS Systems Manager Parameter Store provides an elegant solution for storing, retrieving, and managing configuration data securely. This centralized service eliminates the need to hardcode sensitive information in your applications while enabling dynamic configuration management across your AWS environment.

Understanding AWS Systems Manager Parameter Store

AWS Systems Manager Parameter Store is a secure, hierarchical storage service for configuration data and secrets management. It integrates seamlessly with other AWS services and provides fine-grained access control through IAM policies. The service supports both standard and advanced parameters, with advanced parameters offering enhanced capabilities like larger storage size, parameter policies, and intelligent tiering.

The service organizes parameters in a hierarchical structure using forward slashes, similar to a file system. This organization allows for logical grouping of related parameters and enables bulk operations on parameter trees. For example, you might organize database connection strings under /myapp/database/ and API keys under /myapp/api/.

Key Features and Capabilities

Parameter Store offers several parameter types to meet different use cases. String parameters store plain text values, while StringList parameters contain comma-separated values. SecureString parameters encrypt sensitive data using AWS Key Management Service (KMS), ensuring that secrets remain protected both at rest and in transit.

The service provides version control for parameters, maintaining a history of changes and allowing rollback to previous versions when needed. This versioning capability is particularly valuable in production environments where configuration changes need to be tracked and potentially reversed.

Parameter policies add another layer of sophistication, enabling automatic parameter expiration, notification policies, and lifecycle management. These policies help enforce security best practices and reduce operational overhead.

Practical Implementation: Multi-Environment Application Configuration

Let’s explore a comprehensive example that demonstrates Parameter Store’s capabilities in a real-world scenario. We’ll build a microservices application that uses Parameter Store for configuration management across development, staging, and production environments.

Setting Up the Parameter Hierarchy

First, we’ll establish a logical parameter hierarchy for our application:

# Database configuration parameters
aws ssm put-parameter \
    --name "/myapp/dev/database/host" \
    --value "dev-db.internal.company.com" \
    --type "String" \
    --description "Development database host"

aws ssm put-parameter \
    --name "/myapp/dev/database/port" \
    --value "5432" \
    --type "String" \
    --description "Development database port"

aws ssm put-parameter \
    --name "/myapp/dev/database/username" \
    --value "dev_user" \
    --type "String" \
    --description "Development database username"

aws ssm put-parameter \
    --name "/myapp/dev/database/password" \
    --value "dev_secure_password_123" \
    --type "SecureString" \
    --key-id "alias/parameter-store-key" \
    --description "Development database password"

# API configuration parameters
aws ssm put-parameter \
    --name "/myapp/dev/api/rate_limit" \
    --value "1000" \
    --type "String" \
    --description "API rate limit for development"

aws ssm put-parameter \
    --name "/myapp/dev/api/timeout" \
    --value "30" \
    --type "String" \
    --description "API timeout in seconds"

aws ssm put-parameter \
    --name "/myapp/dev/external/payment_api_key" \
    --value "sk_test_123456789" \
    --type "SecureString" \
    --key-id "alias/parameter-store-key" \
    --description "Payment gateway API key"

Python Application Integration

Here’s a Python application that demonstrates how to retrieve and use these parameters:

import boto3
import json
from botocore.exceptions import ClientError
from typing import Dict, Any, Optional

class ConfigurationManager:
    def __init__(self, environment: str = "dev", region: str = "us-east-1"):
        self.ssm_client = boto3.client('ssm', region_name=region)
        self.environment = environment
        self.parameter_cache = {}
        
    def get_parameter(self, parameter_name: str, decrypt: bool = True) -> Optional[str]:
        """
        Retrieve a single parameter from Parameter Store
        """
        try:
            response = self.ssm_client.get_parameter(
                Name=parameter_name,
                WithDecryption=decrypt
            )
            return response['Parameter']['Value']
        except ClientError as e:
            print(f"Error retrieving parameter {parameter_name}: {e}")
            return None
    
    def get_parameters_by_path(self, path: str, decrypt: bool = True) -> Dict[str, Any]:
        """
        Retrieve all parameters under a specific path
        """
        try:
            paginator = self.ssm_client.get_paginator('get_parameters_by_path')
            parameters = {}
            
            for page in paginator.paginate(
                Path=path,
                Recursive=True,
                WithDecryption=decrypt
            ):
                for param in page['Parameters']:
                    # Remove the path prefix and convert to nested dict
                    key = param['Name'].replace(path, '').lstrip('/')
                    parameters[key] = param['Value']
            
            return parameters
        except ClientError as e:
            print(f"Error retrieving parameters by path {path}: {e}")
            return {}
    
    def get_application_config(self) -> Dict[str, Any]:
        """
        Load complete application configuration
        """
        base_path = f"/myapp/{self.environment}"
        
        # Get all parameters for the environment
        all_params = self.get_parameters_by_path(base_path)
        
        # Organize into logical groups
        config = {
            'database': {
                'host': all_params.get('database/host'),
                'port': int(all_params.get('database/port', 5432)),
                'username': all_params.get('database/username'),
                'password': all_params.get('database/password')
            },
            'api': {
                'rate_limit': int(all_params.get('api/rate_limit', 100)),
                'timeout': int(all_params.get('api/timeout', 30))
            },
            'external': {
                'payment_api_key': all_params.get('external/payment_api_key')
            }
        }
        
        return config
    
    def update_parameter(self, parameter_name: str, value: str, 
                        parameter_type: str = "String", overwrite: bool = True):
        """
        Update or create a parameter
        """
        try:
            self.ssm_client.put_parameter(
                Name=parameter_name,
                Value=value,
                Type=parameter_type,
                Overwrite=overwrite
            )
            print(f"Successfully updated parameter: {parameter_name}")
        except ClientError as e:
            print(f"Error updating parameter {parameter_name}: {e}")

# Example usage in a Flask application
from flask import Flask, jsonify
import os

app = Flask(__name__)

# Initialize configuration manager
config_manager = ConfigurationManager(
    environment=os.getenv('ENVIRONMENT', 'dev')
)

# Load configuration at startup
app_config = config_manager.get_application_config()

@app.route('/health')
def health_check():
    return jsonify({
        'status': 'healthy',
        'environment': config_manager.environment,
        'database_host': app_config['database']['host']
    })

@app.route('/config')
def get_config():
    # Return non-sensitive configuration
    safe_config = {
        'database': {
            'host': app_config['database']['host'],
            'port': app_config['database']['port']
        },
        'api': app_config['api']
    }
    return jsonify(safe_config)

if __name__ == '__main__':
    app.run(debug=True)

Infrastructure as Code with CloudFormation

Here’s a CloudFormation template that creates the parameter hierarchy and associated IAM roles:

AWSTemplateFormatVersion: '2010-09-09'
Description: 'Parameter Store configuration for multi-environment application'

Parameters:
  Environment:
    Type: String
    Default: dev
    AllowedValues: [dev, staging, prod]
    Description: Environment name
  
  ApplicationName:
    Type: String
    Default: myapp
    Description: Application name

Resources:
  # KMS Key for SecureString parameters
  ParameterStoreKMSKey:
    Type: AWS::KMS::Key
    Properties:
      Description: KMS Key for Parameter Store SecureString parameters
      KeyPolicy:
        Statement:
          - Sid: Enable IAM User Permissions
            Effect: Allow
            Principal:
              AWS: !Sub 'arn:aws:iam::${AWS::AccountId}:root'
            Action: 'kms:*'
            Resource: '*'
          - Sid: Allow Parameter Store
            Effect: Allow
            Principal:
              Service: ssm.amazonaws.com
            Action:
              - kms:Decrypt
              - kms:DescribeKey
            Resource: '*'

  ParameterStoreKMSKeyAlias:
    Type: AWS::KMS::Alias
    Properties:
      AliasName: !Sub 'alias/${ApplicationName}-parameter-store-key'
      TargetKeyId: !Ref ParameterStoreKMSKey

  # Database configuration parameters
  DatabaseHostParameter:
    Type: AWS::SSM::Parameter
    Properties:
      Name: !Sub '/${ApplicationName}/${Environment}/database/host'
      Type: String
      Value: !Sub '${Environment}-db.internal.company.com'
      Description: !Sub 'Database host for ${Environment} environment'

  DatabasePortParameter:
    Type: AWS::SSM::Parameter
    Properties:
      Name: !Sub '/${ApplicationName}/${Environment}/database/port'
      Type: String
      Value: '5432'
      Description: Database port

  DatabaseUsernameParameter:
    Type: AWS::SSM::Parameter
    Properties:
      Name: !Sub '/${ApplicationName}/${Environment}/database/username'
      Type: String
      Value: !Sub '${Environment}_user'
      Description: Database username

  DatabasePasswordParameter:
    Type: AWS::SSM::Parameter
    Properties:
      Name: !Sub '/${ApplicationName}/${Environment}/database/password'
      Type: SecureString
      Value: !Sub '${Environment}_secure_password_123'
      KeyId: !Ref ParameterStoreKMSKey
      Description: Database password

  # API configuration parameters
  APIRateLimitParameter:
    Type: AWS::SSM::Parameter
    Properties:
      Name: !Sub '/${ApplicationName}/${Environment}/api/rate_limit'
      Type: String
      Value: '1000'
      Description: API rate limit

  APITimeoutParameter:
    Type: AWS::SSM::Parameter
    Properties:
      Name: !Sub '/${ApplicationName}/${Environment}/api/timeout'
      Type: String
      Value: '30'
      Description: API timeout in seconds

  # IAM Role for application to access parameters
  ApplicationRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub '${ApplicationName}-${Environment}-parameter-access-role'
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: 
                - ec2.amazonaws.com
                - ecs-tasks.amazonaws.com
                - lambda.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: ParameterStoreAccess
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - ssm:GetParameter
                  - ssm:GetParameters
                  - ssm:GetParametersByPath
                Resource: 
                  - !Sub 'arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter/${ApplicationName}/${Environment}/*'
              - Effect: Allow
                Action:
                  - kms:Decrypt
                Resource: !GetAtt ParameterStoreKMSKey.Arn

  # Instance Profile for EC2 instances
  ApplicationInstanceProfile:
    Type: AWS::IAM::InstanceProfile
    Properties:
      Roles:
        - !Ref ApplicationRole

Outputs:
  ApplicationRoleArn:
    Description: ARN of the application role
    Value: !GetAtt ApplicationRole.Arn
    Export:
      Name: !Sub '${ApplicationName}-${Environment}-role-arn'
  
  KMSKeyId:
    Description: KMS Key ID for SecureString parameters
    Value: !Ref ParameterStoreKMSKey
    Export:
      Name: !Sub '${ApplicationName}-${Environment}-kms-key'

Advanced Automation with Parameter Policies

Parameter Store also supports parameter policies for advanced lifecycle management:

# Create a parameter with expiration policy
aws ssm put-parameter \
    --name "/myapp/dev/temp/session_token" \
    --value "temp_token_12345" \
    --type "SecureString" \
    --policies '[
        {
            "Type": "Expiration",
            "Version": "1.0",
            "Attributes": {
                "Timestamp": "2024-12-31T23:59:59.000Z"
            }
        }
    ]'

# Create a parameter with notification policy
aws ssm put-parameter \
    --name "/myapp/prod/database/password" \
    --value "prod_password_456" \
    --type "SecureString" \
    --policies '[
        {
            "Type": "ExpirationNotification",
            "Version": "1.0",
            "Attributes": {
                "Before": "30",
                "Unit": "Days"
            }
        }
    ]'

Security Best Practices and Considerations

When implementing Parameter Store in production environments, several security considerations are crucial. Always use SecureString parameters for sensitive data like passwords, API keys, and tokens. Implement least-privilege IAM policies that grant access only to the specific parameters and paths required by each service or role.

Use separate KMS keys for different environments and applications to maintain proper isolation. Regularly rotate sensitive parameters and implement parameter policies to enforce expiration dates. Monitor parameter access through CloudTrail to track who accessed which parameters and when.

Consider implementing parameter validation in your applications to ensure that retrieved values meet expected formats and constraints. This validation helps prevent configuration errors that could lead to service disruptions.

Cost Optimization and Performance

Parameter Store offers both standard and advanced parameters, with different pricing models. Standard parameters are free up to 10,000 parameters, while advanced parameters provide additional features at a cost. Choose the appropriate tier based on your requirements.

Implement intelligent caching in your applications to reduce API calls and improve performance. Cache parameters with reasonable TTL values, and implement cache invalidation strategies for critical configuration changes.

Use batch operations like get_parameters_by_path to retrieve multiple related parameters in a single API call, reducing latency and improving efficiency.

Conclusion

AWS Systems Manager Parameter Store provides a robust foundation for configuration management and secrets handling in cloud-native applications. Its integration with other AWS services, fine-grained access control, and advanced features like parameter policies make it an excellent choice for managing application configuration at scale.

By implementing the patterns and practices demonstrated in this guide, you can build more secure, maintainable, and scalable applications that properly separate configuration from code. The hierarchical organization, version control, and encryption capabilities ensure that your configuration management strategy can grow and evolve with your application needs.

Whether you’re building a simple web application or a complex microservices architecture, Parameter Store provides the tools and flexibility needed to manage configuration data securely and efficiently across multiple environments and use cases.

Advanced OCI Container Engine (OKE) with Network Security and Observability

Posted on June 8, 2025 by Osama Mustafa in Cloud, OCI

Oracle Cloud Infrastructure Container Engine for Kubernetes (OKE) provides enterprise-grade Kubernetes clusters with deep integration into OCI’s native services. This comprehensive guide explores advanced OKE configurations, focusing on network security policies, observability integration, and automated deployment strategies that enterprise teams need for production workloads.

OKE Architecture Deep Dive

OKE operates on a managed control plane architecture where Oracle handles the Kubernetes master nodes, etcd, and API server components. This design eliminates operational overhead while providing high availability across multiple availability domains.

The service integrates seamlessly with OCI’s networking fabric, allowing granular control over pod-to-pod communication, ingress traffic management, and service mesh implementations. Unlike managed Kubernetes services from other providers, OKE provides native integration with Oracle’s enterprise security stack, including Identity and Access Management (IAM), Key Management Service (KMS), and Web Application Firewall (WAF).

Worker nodes run on OCI Compute instances, providing flexibility in choosing instance shapes, including bare metal, GPU-enabled, and ARM-based Ampere processors. The networking layer supports both flannel and OCI VCN-native pod networking, enabling direct integration with existing network security policies.

Advanced Networking Configuration

OKE’s network architecture supports multiple pod networking modes. The VCN-native pod networking mode assigns each pod an IP address from your VCN’s CIDR range, enabling direct application of network security lists and route tables to pod traffic.

This approach provides several advantages over traditional overlay networking. Security policies become more granular since you can apply network security lists directly to pod traffic. Network troubleshooting becomes simpler as pod traffic flows through standard OCI networking constructs. Integration with existing network monitoring tools works seamlessly since pod traffic appears as regular VCN traffic.

Load balancing integrates deeply with OCI’s Load Balancing service, supporting both Layer 4 and Layer 7 load balancing with SSL termination, session persistence, and health checking capabilities.

Production-Ready Implementation Example

Here’s a comprehensive example that demonstrates deploying a highly available OKE cluster with advanced security and monitoring configurations:

Terraform Configuration for OKE Cluster

# OKE Cluster with Enhanced Security
resource "oci_containerengine_cluster" "production_cluster" {
  compartment_id     = var.compartment_id
  kubernetes_version = var.kubernetes_version
  name              = "production-oke-cluster"
  vcn_id            = oci_core_vcn.oke_vcn.id

  endpoint_config {
    is_public_ip_enabled = false
    subnet_id           = oci_core_subnet.oke_api_subnet.id
    nsg_ids             = [oci_core_network_security_group.oke_api_nsg.id]
  }

  cluster_pod_network_options {
    cni_type = "OCI_VCN_IP_NATIVE"
  }

  options {
    service_lb_subnet_ids = [oci_core_subnet.oke_lb_subnet.id]
    
    kubernetes_network_config {
      pods_cidr     = "10.244.0.0/16"
      services_cidr = "10.96.0.0/16"
    }

    add_ons {
      is_kubernetes_dashboard_enabled = false
      is_tiller_enabled              = false
    }

    admission_controller_options {
      is_pod_security_policy_enabled = true
    }
  }

  kms_key_id = oci_kms_key.oke_encryption_key.id
}

# Node Pool with Mixed Instance Types
resource "oci_containerengine_node_pool" "production_node_pool" {
  cluster_id         = oci_containerengine_cluster.production_cluster.id
  compartment_id     = var.compartment_id
  kubernetes_version = var.kubernetes_version
  name              = "production-workers"

  node_config_details {
    placement_configs {
      availability_domain = data.oci_identity_availability_domains.ads.availability_domains[0].name
      subnet_id          = oci_core_subnet.oke_worker_subnet.id
    }
    placement_configs {
      availability_domain = data.oci_identity_availability_domains.ads.availability_domains[1].name
      subnet_id          = oci_core_subnet.oke_worker_subnet.id
    }
    
    size                    = 3
    nsg_ids                = [oci_core_network_security_group.oke_worker_nsg.id]
    is_pv_encryption_in_transit_enabled = true
  }

  node_shape = "VM.Standard.E4.Flex"
  
  node_shape_config {
    ocpus         = 2
    memory_in_gbs = 16
  }

  node_source_details {
    image_id                = data.oci_containerengine_node_pool_option.oke_node_pool_option.sources[0].image_id
    source_type            = "IMAGE"
    boot_volume_size_in_gbs = 100
  }

  initial_node_labels {
    key   = "environment"
    value = "production"
  }

  ssh_public_key = var.ssh_public_key
}

# Network Security Group for API Server
resource "oci_core_network_security_group" "oke_api_nsg" {
  compartment_id = var.compartment_id
  vcn_id        = oci_core_vcn.oke_vcn.id
  display_name  = "oke-api-nsg"
}

resource "oci_core_network_security_group_security_rule" "oke_api_ingress" {
  network_security_group_id = oci_core_network_security_group.oke_api_nsg.id
  direction                 = "INGRESS"
  protocol                  = "6"
  source                   = "10.0.0.0/16"
  source_type              = "CIDR_BLOCK"
  
  tcp_options {
    destination_port_range {
      max = 6443
      min = 6443
    }
  }
}

# Network Security Group for Worker Nodes
resource "oci_core_network_security_group" "oke_worker_nsg" {
  compartment_id = var.compartment_id
  vcn_id        = oci_core_vcn.oke_vcn.id
  display_name  = "oke-worker-nsg"
}

# Allow pod-to-pod communication
resource "oci_core_network_security_group_security_rule" "worker_pod_communication" {
  network_security_group_id = oci_core_network_security_group.oke_worker_nsg.id
  direction                 = "INGRESS"
  protocol                  = "all"
  source                   = oci_core_network_security_group.oke_worker_nsg.id
  source_type              = "NETWORK_SECURITY_GROUP"
}

# KMS Key for Cluster Encryption
resource "oci_kms_key" "oke_encryption_key" {
  compartment_id = var.compartment_id
  display_name   = "oke-cluster-encryption-key"
  
  key_shape {
    algorithm = "AES"
    length    = 256
  }
  
  management_endpoint = oci_kms_vault.oke_vault.management_endpoint
}

resource "oci_kms_vault" "oke_vault" {
  compartment_id = var.compartment_id
  display_name   = "oke-vault"
  vault_type     = "DEFAULT"
}

Kubernetes Manifests with Network Policies



# Network Policy for Application Isolation
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: webapp-network-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: webapp
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    - podSelector:
        matchLabels:
          app: webapp-frontend
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: database
    ports:
    - protocol: TCP
      port: 5432
  - to: []
    ports:
    - protocol: TCP
      port: 443
    - protocol: TCP
      port: 53
    - protocol: UDP
      port: 53

---
# Pod Security Policy
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
  name: restricted-psp
spec:
  privileged: false
  allowPrivilegeEscalation: false
  requiredDropCapabilities:
    - ALL
  volumes:
    - 'configMap'
    - 'emptyDir'
    - 'projected'
    - 'secret'
    - 'downwardAPI'
    - 'persistentVolumeClaim'
  runAsUser:
    rule: 'MustRunAsNonRoot'
  seLinux:
    rule: 'RunAsAny'
  fsGroup:
    rule: 'RunAsAny'

---
# Deployment with Security Context
apiVersion: apps/v1
kind: Deployment
metadata:
  name: secure-webapp
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: webapp
  template:
    metadata:
      labels:
        app: webapp
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534
        fsGroup: 65534
      containers:
      - name: webapp
        image: nginx:1.21-alpine
        ports:
        - containerPort: 8080
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
            - ALL
        resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 250m
            memory: 256Mi
        volumeMounts:
        - name: tmp-volume
          mountPath: /tmp
        - name: cache-volume
          mountPath: /var/cache/nginx
      volumes:
      - name: tmp-volume
        emptyDir: {}
      - name: cache-volume
        emptyDir: {}

Monitoring and Observability Integration

OKE integrates natively with OCI Monitoring, Logging, and Logging Analytics services. This integration provides comprehensive observability without requiring additional third-party tools or complex configurations.

The monitoring integration automatically collects cluster-level metrics including CPU utilization, memory consumption, network throughput, and storage IOPS across all worker nodes. Custom metrics can be published using the OCI Monitoring SDK, enabling application-specific dashboards and alerting rules.

Logging integration captures both system logs from Kubernetes components and application logs from pods. The unified logging agent automatically forwards logs to OCI Logging service, where they can be searched, filtered, and analyzed using structured queries.

Security Best Practices Implementation

Enterprise OKE deployments require multiple layers of security controls. Network-level security starts with proper subnet segmentation, placing API servers in private subnets accessible only through bastion hosts or VPN connections.

Pod Security Policies enforce runtime security constraints, preventing privileged containers and restricting volume types. Network policies provide microsegmentation within the cluster, controlling pod-to-pod communication based on labels and namespaces.

Image security scanning integrates with OCI Container Registry’s vulnerability scanning capabilities, automatically checking container images for known vulnerabilities before deployment.

Automated CI/CD Integration

OKE clusters integrate seamlessly with OCI DevOps service for automated application deployment pipelines. The integration supports GitOps workflows, blue-green deployments, and automated rollback mechanisms.

Pipeline configurations can reference OCI Vault secrets for secure credential management, ensuring sensitive information never appears in deployment manifests or pipeline configurations.

Performance Optimization Strategies

Production OKE deployments benefit from several performance optimization techniques. Node pool configurations should match application requirements, using compute-optimized instances for CPU-intensive workloads and memory-optimized instances for data processing applications.

Pod disruption budgets ensure application availability during cluster maintenance operations. Horizontal Pod Autoscaling automatically adjusts replica counts based on CPU or memory utilization, while Cluster Autoscaling adds or removes worker nodes based on resource demands.

This comprehensive approach to OKE deployment provides enterprise-grade container orchestration with robust security, monitoring, and automation capabilities, enabling organizations to run production workloads confidently in Oracle Cloud Infrastructure.

DELETE All the VCNs in THE OCI Using BASH SCRIPT

Posted on May 27, 2025May 27, 2025 by Osama Mustafa in Cloud, OCI

The script below will allow you to list all VCNs in OCI and delete all attached resources to the COMPARTMENT_OCID.

Note: I wrote the scripts to perform the tasks mentioned below, which can be updated and expanded based on the needs. Feel free to do that and say the source

Complete Resource Deletion Chain: The script now handles the proper order of deletion:

Compute instances first
Clean route tables and security lists
Load balancers
Gateways (NAT, Internet, Service, DRG attachments)
Subnets
Custom security lists, route tables, and DHCP options
Finally, the VCN itself

#!/bin/bash

# ✅ Set this to the target compartment OCID
COMPARTMENT_OCID="Set Your OCID Here"

# (Optional) Force region
export OCI_CLI_REGION=me-jeddah-1

echo "📍 Region: $OCI_CLI_REGION"
echo "📦 Compartment: $COMPARTMENT_OCID"
echo "⚠️  WARNING: This will delete ALL VCNs and related resources in the compartment!"
echo "Press Ctrl+C within 10 seconds to cancel..."
sleep 10

# Function to wait for resource deletion
wait_for_deletion() {
    local resource_id=$1
    local resource_type=$2
    local max_attempts=30
    local attempt=1
    
    echo "    ⏳ Waiting for $resource_type deletion..."
    while [ $attempt -le $max_attempts ]; do
        if ! oci network $resource_type get --${resource_type//-/}-id "$resource_id" &>/dev/null; then
            echo "    ✅ $resource_type deleted successfully"
            return 0
        fi
        sleep 10
        ((attempt++))
    done
    echo "    ⚠️  Timeout waiting for $resource_type deletion"
    return 1
}

# Function to check if resource is default
is_default_resource() {
    local resource_id=$1
    local resource_type=$2
    
    case $resource_type in
        "security-list")
            result=$(oci network security-list get --security-list-id "$resource_id" --query "data.\"display-name\"" --raw-output 2>/dev/null)
            [[ "$result" == "Default Security List"* ]]
            ;;
        "route-table")
            result=$(oci network route-table get --rt-id "$resource_id" --query "data.\"display-name\"" --raw-output 2>/dev/null)
            [[ "$result" == "Default Route Table"* ]]
            ;;
        "dhcp-options")
            result=$(oci network dhcp-options get --dhcp-id "$resource_id" --query "data.\"display-name\"" --raw-output 2>/dev/null)
            [[ "$result" == "Default DHCP Options"* ]]
            ;;
        *)
            false
            ;;
    esac
}

# Function to clean all route tables in a VCN
clean_all_route_tables() {
    local VCN_ID=$1
    echo "  🧹 Cleaning all route tables..."
    
    local RT_IDS=$(oci network route-table list \
        --compartment-id "$COMPARTMENT_OCID" \
        --vcn-id "$VCN_ID" \
        --query "data[?\"lifecycle-state\" == 'AVAILABLE'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for RT_ID in $RT_IDS; do
        if [ -n "$RT_ID" ]; then
            echo "    🔧 Clearing routes in route table: $RT_ID"
            oci network route-table update --rt-id "$RT_ID" --route-rules '[]' --force &>/dev/null || true
        fi
    done
    
    # Wait a bit for route updates to propagate
    sleep 5
}

# Function to clean all security lists in a VCN
clean_all_security_lists() {
    local VCN_ID=$1
    echo "  🧹 Cleaning all security lists..."
    
    local SL_IDS=$(oci network security-list list \
        --compartment-id "$COMPARTMENT_OCID" \
        --vcn-id "$VCN_ID" \
        --query "data[?\"lifecycle-state\" == 'AVAILABLE'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for SL_ID in $SL_IDS; do
        if [ -n "$SL_ID" ]; then
            echo "    🔧 Clearing rules in security list: $SL_ID"
            oci network security-list update \
                --security-list-id "$SL_ID" \
                --egress-security-rules '[]' \
                --ingress-security-rules '[]' \
                --force &>/dev/null || true
        fi
    done
    
    # Wait a bit for security list updates to propagate
    sleep 5
}

# Function to delete compute instances in subnets
delete_compute_instances() {
    local VCN_ID=$1
    echo "  🖥️  Checking for compute instances..."
    
    local INSTANCES=$(oci compute instance list \
        --compartment-id "$COMPARTMENT_OCID" \
        --query "data[?\"lifecycle-state\" != 'TERMINATED'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for INSTANCE_ID in $INSTANCES; do
        if [ -n "$INSTANCE_ID" ]; then
            # Check if instance is in this VCN
            local INSTANCE_VCN=$(oci compute instance list-vnics \
                --instance-id "$INSTANCE_ID" \
                --query "data[0].\"vcn-id\"" \
                --raw-output 2>/dev/null)
            
            if [[ "$INSTANCE_VCN" == "$VCN_ID" ]]; then
                echo "    🔻 Terminating compute instance: $INSTANCE_ID"
                oci compute instance terminate --instance-id "$INSTANCE_ID" --force &>/dev/null || true
            fi
        fi
    done
}

# Main cleanup function for a single VCN
cleanup_vcn() {
    local VCN_ID=$1
    echo -e "\n🧹 Cleaning resources for VCN: $VCN_ID"
    
    # Step 1: Delete compute instances first
    delete_compute_instances "$VCN_ID"
    
    # Step 2: Clean route tables and security lists
    clean_all_route_tables "$VCN_ID"
    clean_all_security_lists "$VCN_ID"
    
    # Step 3: Delete Load Balancers
    echo "  🔻 Deleting load balancers..."
    local LBS=$(oci lb load-balancer list \
        --compartment-id "$COMPARTMENT_OCID" \
        --query "data[?\"lifecycle-state\" == 'ACTIVE'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for LB_ID in $LBS; do
        if [ -n "$LB_ID" ]; then
            echo "    🔻 Deleting Load Balancer: $LB_ID"
            oci lb load-balancer delete --load-balancer-id "$LB_ID" --force &>/dev/null || true
        fi
    done
    
    # Step 4: Delete NAT Gateways
    echo "  🔻 Deleting NAT gateways..."
    local NAT_GWS=$(oci network nat-gateway list \
        --compartment-id "$COMPARTMENT_OCID" \
        --vcn-id "$VCN_ID" \
        --query "data[?\"lifecycle-state\" == 'AVAILABLE'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for NAT_ID in $NAT_GWS; do
        if [ -n "$NAT_ID" ]; then
            echo "    🔻 Deleting NAT Gateway: $NAT_ID"
            oci network nat-gateway delete --nat-gateway-id "$NAT_ID" --force &>/dev/null || true
        fi
    done
    
    # Step 5: Delete DRG Attachments
    echo "  🔻 Deleting DRG attachments..."
    local DRG_ATTACHMENTS=$(oci network drg-attachment list \
        --compartment-id "$COMPARTMENT_OCID" \
        --query "data[?\"vcn-id\" == '$VCN_ID' && \"lifecycle-state\" == 'ATTACHED'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for DRG_ATTACHMENT_ID in $DRG_ATTACHMENTS; do
        if [ -n "$DRG_ATTACHMENT_ID" ]; then
            echo "    🔻 Deleting DRG Attachment: $DRG_ATTACHMENT_ID"
            oci network drg-attachment delete --drg-attachment-id "$DRG_ATTACHMENT_ID" --force &>/dev/null || true
        fi
    done
    
    # Step 6: Delete Internet Gateways
    echo "  🔻 Deleting internet gateways..."
    local IGWS=$(oci network internet-gateway list \
        --compartment-id "$COMPARTMENT_OCID" \
        --vcn-id "$VCN_ID" \
        --query "data[?\"lifecycle-state\" == 'AVAILABLE'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for IGW_ID in $IGWS; do
        if [ -n "$IGW_ID" ]; then
            echo "    🔻 Deleting Internet Gateway: $IGW_ID"
            oci network internet-gateway delete --ig-id "$IGW_ID" --force &>/dev/null || true
        fi
    done
    
    # Step 7: Delete Service Gateways
    echo "  🔻 Deleting service gateways..."
    local SGWS=$(oci network service-gateway list \
        --compartment-id "$COMPARTMENT_OCID" \
        --vcn-id "$VCN_ID" \
        --query "data[?\"lifecycle-state\" == 'AVAILABLE'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for SGW_ID in $SGWS; do
        if [ -n "$SGW_ID" ]; then
            echo "    🔻 Deleting Service Gateway: $SGW_ID"
            oci network service-gateway delete --service-gateway-id "$SGW_ID" --force &>/dev/null || true
        fi
    done
    
    # Step 8: Wait for gateways to be deleted
    echo "  ⏳ Waiting for gateways to be deleted..."
    sleep 30
    
    # Step 9: Delete Subnets
    echo "  🔻 Deleting subnets..."
    local SUBNETS=$(oci network subnet list \
        --compartment-id "$COMPARTMENT_OCID" \
        --vcn-id "$VCN_ID" \
        --query "data[?\"lifecycle-state\" == 'AVAILABLE'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for SUBNET_ID in $SUBNETS; do
        if [ -n "$SUBNET_ID" ]; then
            echo "    🔻 Deleting Subnet: $SUBNET_ID"
            oci network subnet delete --subnet-id "$SUBNET_ID" --force &>/dev/null || true
        fi
    done
    
    # Step 10: Wait for subnets to be deleted
    echo "  ⏳ Waiting for subnets to be deleted..."
    sleep 30
    
    # Step 11: Delete non-default Security Lists
    echo "  🔻 Deleting custom security lists..."
    local SL_IDS=$(oci network security-list list \
        --compartment-id "$COMPARTMENT_OCID" \
        --vcn-id "$VCN_ID" \
        --query "data[?\"lifecycle-state\" == 'AVAILABLE'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for SL_ID in $SL_IDS; do
        if [ -n "$SL_ID" ] && ! is_default_resource "$SL_ID" "security-list"; then
            echo "    🔻 Deleting Security List: $SL_ID"
            oci network security-list delete --security-list-id "$SL_ID" --force &>/dev/null || true
        fi
    done
    
    # Step 12: Delete non-default Route Tables
    echo "  🔻 Deleting custom route tables..."
    local RT_IDS=$(oci network route-table list \
        --compartment-id "$COMPARTMENT_OCID" \
        --vcn-id "$VCN_ID" \
        --query "data[?\"lifecycle-state\" == 'AVAILABLE'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for RT_ID in $RT_IDS; do
        if [ -n "$RT_ID" ] && ! is_default_resource "$RT_ID" "route-table"; then
            echo "    🔻 Deleting Route Table: $RT_ID"
            oci network route-table delete --rt-id "$RT_ID" --force &>/dev/null || true
        fi
    done
    
    # Step 13: Delete non-default DHCP Options
    echo "  🔻 Deleting custom DHCP options..."
    local DHCP_IDS=$(oci network dhcp-options list \
        --compartment-id "$COMPARTMENT_OCID" \
        --vcn-id "$VCN_ID" \
        --query "data[?\"lifecycle-state\" == 'AVAILABLE'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for DHCP_ID in $DHCP_IDS; do
        if [ -n "$DHCP_ID" ] && ! is_default_resource "$DHCP_ID" "dhcp-options"; then
            echo "    🔻 Deleting DHCP Options: $DHCP_ID"
            oci network dhcp-options delete --dhcp-id "$DHCP_ID" --force &>/dev/null || true
        fi
    done
    
    # Step 14: Wait before attempting VCN deletion
    echo "  ⏳ Waiting for all resources to be cleaned up..."
    sleep 60
    
    # Step 15: Finally, delete the VCN
    echo "  🔻 Deleting VCN: $VCN_ID"
    local max_attempts=5
    local attempt=1
    
    while [ $attempt -le $max_attempts ]; do
        if oci network vcn delete --vcn-id "$VCN_ID" --force &>/dev/null; then
            echo "    ✅ VCN deletion initiated successfully"
            break
        else
            echo "    ⚠️  VCN deletion attempt $attempt failed, retrying in 30 seconds..."
            sleep 30
            ((attempt++))
        fi
    done
    
    if [ $attempt -gt $max_attempts ]; then
        echo "    ❌ Failed to delete VCN after $max_attempts attempts"
        echo "    💡 You may need to manually check for remaining dependencies"
    fi
}

# Main execution
echo -e "\n🚀 Starting VCN cleanup process..."

# Fetch all VCNs in the compartment
VCN_IDS=$(oci network vcn list \
    --compartment-id "$COMPARTMENT_OCID" \
    --query "data[?\"lifecycle-state\" == 'AVAILABLE'].id" \
    --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)

if [ -z "$VCN_IDS" ]; then
    echo "📭 No VCNs found in compartment $COMPARTMENT_OCID"
    exit 0
fi

echo "📋 Found VCNs to delete:"
for VCN_ID in $VCN_IDS; do
    VCN_NAME=$(oci network vcn get --vcn-id "$VCN_ID" --query "data.\"display-name\"" --raw-output 2>/dev/null)
    echo "  - $VCN_NAME ($VCN_ID)"
done

# Process each VCN
for VCN_ID in $VCN_IDS; do
    if [ -n "$VCN_ID" ]; then
        cleanup_vcn "$VCN_ID"
    fi
done

echo -e "\n✅ Cleanup complete for compartment: $COMPARTMENT_OCID"
echo "🔍 You may want to verify in the OCI Console that all resources have been deleted."

Output example

Regards

Osama Mustafa

Implementing OCI Logging Analytics for Proactive Incident Detection

Posted on April 1, 2025 by Osama Mustafa in Cloud, OCI

Oracle Cloud Infrastructure (OCI) Logging Analytics is a powerful service that helps organizations aggregate, analyze, and act on log data from across their OCI resources. In this guide, we’ll walk through setting up Logging Analytics to detect and alert on suspicious activities, using Terraform for automation and a real-world example for context.

Step 1: Enable OCI Logging Analytics

Navigate to the OCI Console:
Go to Observability & Management > Logging Analytics.

2. Create a Log Group:

oci logging-analytics log-group create \
  --compartment-id <your-compartment-ocid> \
  --display-name "Security-Logs" \
  --description "Logs for security monitoring"

Step 2: Ingest Logs from OCI Audit Service
Configure the OCI Audit service to forward logs to Logging Analytics:

Create a Service Connector:

resource "oci_sch_service_connector" "audit_to_la" {
  compartment_id = var.compartment_ocid
  display_name  = "Audit-to-Logging-Analytics"
  source {
    kind = "logging"
    log_sources {
      compartment_id = var.tenant_ocid
      log_group_id   = oci_logging_log_group.audit_logs.id
    }
  }
  target {
    kind = "loggingAnalytics"
    log_group_id = oci_logging_analytics_log_group.security_logs.id
  }
}

Step 3: Create Custom Detection Rules

Example: Detect repeated failed login attempts (brute-force attacks).

Use OCI Query Language (OCIQL):

SELECT * 
FROM AuditLogs 
WHERE eventName = 'Login' AND action = 'FAIL' 
GROUP BY actorName 
HAVING COUNT(*) > 5

Set Up Alerts:
Configure an OCI Notification topic to trigger emails or PagerDuty alerts when the rule matches.

Step 4: Visualize with Dashboards

Create a dashboard to monitor security events:

Metrics: Failed logins, API calls from unusual IPs.

Enjoy
Osama

Building a Serverless Event-Driven Architecture with AWS EventBridge, SQS, and Lambda

Posted on April 1, 2025 by Osama Mustafa in AWS, Cloud

In this blog, we’ll design a system where:

Events (e.g., order placements, file uploads) are published to EventBridge.
SQS queues act as durable buffers for downstream processing.
Lambda functions consume events and take action (e.g., send notifications, update databases).

Architecture Overview

![EventBridge → SQS → Lambda Architecture]
(Visual: Producers → EventBridge → SQS → Lambda Consumers)

Event Producers (e.g., API Gateway, S3, custom apps) emit events.
EventBridge routes events to targets (e.g., SQS queues).
SQS ensures reliable delivery and decoupling.
Lambda processes events asynchronously.

Step-by-Step Implementation

1. Set Up an EventBridge Event Bus

Create a custom event bus (or use the default one):

aws events create-event-bus --name MyEventBus

2. Define an Event Rule to Route Events to SQS

Create a rule to forward events matching a pattern (e.g., order_placed) to an SQS queue:

aws events put-rule \
  --name "OrderPlacedRule" \
  --event-pattern '{"detail-type": ["order_placed"]}' \
  --event-bus-name "MyEventBus"

3. Create an SQS Queue and Link It to EventBridge

Create a queue and grant EventBridge permission to send messages:

aws sqs create-queue --queue-name OrderProcessingQueue

Attach the queue as a target to the EventBridge rule:

aws events put-targets \
  --rule "OrderPlacedRule" \
  --targets "Id"="OrderQueueTarget","Arn"="arn:aws:sqs:us-east-1:123456789012:OrderProcessingQueue" \
  --event-bus-name "MyEventBus"

4. Write a Lambda Function to Process SQS Messages

Create a Lambda function (process_order.py) to poll the queue and process orders:

import json
import boto3

def lambda_handler(event, context):
    for record in event['Records']:
        message = json.loads(record['body'])
        order_id = message['detail']['orderId']
        
        print(f"Processing order: {order_id}")
        # Add business logic (e.g., update DynamoDB, send SNS notification)
        
    return {"status": "processed"}

5. Configure SQS as a Lambda Trigger

In the AWS Console:

Go to Lambda → Add Trigger → SQS.
Select OrderProcessingQueue and set batch size (e.g., 10 messages per invocation).

6. Test the Flow

Emit a test event to EventBridge:

aws events put-events \
  --entries '[{
    "EventBusName": "MyEventBus",
    "Source": "my.app",
    "DetailType": "order_placed",
    "Detail": "{ \"orderId\": \"123\", \"amount\": 50 }"
  }]'

Verify the flow:

EventBridge routes the event to SQS.
Lambda picks up the message and logs:

Processing order: 123

Use Cases

Order processing (e.g., e-commerce workflows).
File upload pipelines (e.g., resize images after S3 upload).
Notifications (e.g., send emails/SMS for system events).

Enjoy
Thank you
Osama

Real-Time Data Processing with AWS Kinesis, Lambda, and DynamoDB

Posted on March 28, 2025 by Osama Mustafa in AWS, Cloud

Many applications today require real-time data processing—whether it’s for analytics, monitoring, or triggering actions. AWS provides powerful services like Amazon Kinesis for streaming data, AWS Lambda for serverless processing, and DynamoDB for scalable storage.

In this blog, we’ll build a real-time data pipeline that:

Ingests streaming data (e.g., clickstream, IoT sensor data, or logs) using Kinesis Data Streams.
Processes records in real-time using Lambda.
Stores aggregated results in DynamoDB for querying.

Architecture Overview

![AWS Kinesis + Lambda + DynamoDB Architecture]
(Visual: Kinesis → Lambda → DynamoDB)

Kinesis Data Stream – Captures high-velocity data.
Lambda Function – Processes records as they arrive.
DynamoDB Table – Stores aggregated results (e.g., counts, metrics).

Step-by-Step Implementation

1. Set Up a Kinesis Data Stream

Create a Kinesis stream to ingest data:

aws kinesis create-stream --stream-name ClickStream --shard-count 1

Producers (e.g., web apps, IoT devices) can send data like:

{
  "userId": "user123",
  "action": "click",
  "timestamp": "2024-05-20T12:00:00Z"
}

2. Create a Lambda Function to Process Streams

Write a Python Lambda function (process_stream.py) to:

Read records from Kinesis.
Aggregate data (e.g., count clicks per user).
Update DynamoDB.

import json
import boto3

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('UserClicks')

def lambda_handler(event, context):
    for record in event['Records']:
        payload = json.loads(record['kinesis']['data'])
        user_id = payload['userId']
        
        # Update DynamoDB (increment click count)
        table.update_item(
            Key={'userId': user_id},
            UpdateExpression="ADD clicks :incr",
            ExpressionAttributeValues={':incr': 1}
        )
    return {"status": "success"}

3. Configure Lambda as a Kinesis Consumer

In the AWS Console:

Go to Lambda → Create Function → Python.
Add Kinesis as the trigger (select your stream).
Set batch size (e.g., 100 records per invocation).

4. Set Up DynamoDB for Aggregations

Create a table with userId as the primary key:

aws dynamodb create-table \
    --table-name UserClicks \
    --attribute-definitions AttributeName=userId,AttributeType=S \
    --key-schema AttributeName=userId,KeyType=HASH \
    --billing-mode PAY_PER_REQUEST

5. Test the Pipeline

Send test data to Kinesis:

aws kinesis put-record \
    --stream-name ClickStream \
    --data '{"userId": "user123", "action": "click"}' \
    --partition-key user123

Check DynamoDB for aggregated results:

aws dynamodb get-item --table-name UserClicks --key '{"userId": {"S": "user123"}}'

Output:

{ "userId": "user123", "clicks": 1 }

Use Cases

Real-time analytics (e.g., dashboard for user activity).
Fraud detection (trigger alerts for unusual patterns).
IoT monitoring (process sensor data in real-time).

Enjoy
Thank you
Osama

Building a Scalable Web Application Using AWS Lambda, API Gateway, and DynamoDB

Posted on March 20, 2025March 20, 2025 by Osama Mustafa in AWS, Cloud

s?

Let’s imagine we want to build a To-Do List Application where users can:

Add tasks to their list.
View all tasks.
Mark tasks as completed.

We’ll use the following architecture:

API Gateway to handle HTTP requests.
Lambda Functions to process business logic.
DynamoDB to store task data.

Step 1: Setting Up DynamoDB

First, we need a database to store our tasks. DynamoDB is an excellent choice because it scales automatically and provides low-latency access.

Creating a DynamoDB Table

Open the AWS Management Console and navigate to DynamoDB .
Click Create Table .
- Table Name : TodoList
- Primary Key : id (String)
Enable Auto Scaling for read/write capacity units to ensure the table scales based on demand.

Sample Table Structure

id (Primary Key)	task_name	status
1	Buy groceries	Pending
2	Read a book	Completed

Step 2: Creating Lambda Functions

Next, we’ll create Lambda functions to handle CRUD operations for our To-Do List application.

Lambda Function: Create Task

This function will insert a new task into the TodoList table.

import json
import boto3

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('TodoList')

def lambda_handler(event, context):
    # Extract task details from the event
    task_name = event['task_name']
    
    # Generate a unique ID for the task
    import uuid
    task_id = str(uuid.uuid4())
    
    # Insert the task into DynamoDB
    table.put_item(
        Item={
            'id': task_id,
            'task_name': task_name,
            'status': 'Pending'
        }
    )
    
    return {
        'statusCode': 200,
        'body': json.dumps({'message': 'Task created successfully!', 'task_id': task_id})
    }

Lambda Function: Get All Tasks

This function retrieves all tasks from the TodoList table.

import json
import boto3

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('TodoList')

def lambda_handler(event, context):
    # Scan the DynamoDB table
    response = table.scan()
    
    # Return the list of tasks
    return {
        'statusCode': 200,
        'body': json.dumps(response['Items'])
    }

Lambda Function: Update Task Status

This function updates the status of a task (e.g., mark as completed).

import json
import boto3

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('TodoList')

def lambda_handler(event, context):
    # Extract task ID and new status from the event
    task_id = event['id']
    new_status = event['status']
    
    # Update the task in DynamoDB
    table.update_item(
        Key={'id': task_id},
        UpdateExpression='SET #status = :new_status',
        ExpressionAttributeNames={'#status': 'status'},
        ExpressionAttributeValues={':new_status': new_status}
    )
    
    return {
        'statusCode': 200,
        'body': json.dumps({'message': 'Task updated successfully!'})
    }

Step 3: Configuring API Gateway

Now that we have our Lambda functions, we’ll expose them via API Gateway.

Steps to Set Up API Gateway

Open the AWS Management Console and navigate to API Gateway .
Click Create API and select HTTP API .
Define the following routes:
- POST /tasks : Maps to the “Create Task” Lambda function.
- GET /tasks : Maps to the “Get All Tasks” Lambda function.
- PUT /tasks/{id} : Maps to the “Update Task Status” Lambda function.
Deploy the API and note the endpoint URL.

Step 4: Testing the Application

Once everything is set up, you can test the application using tools like Postman or cURL .

Example Requests

Create a Task

curl -X POST https://<api-id>.execute-api.<region>.amazonaws.com/tasks \
-H "Content-Type: application/json" \
-d '{"task_name": "Buy groceries"}'

Get All Tasks

curl -X GET https://<api-id>.execute-api.<region>.amazonaws.com/tasks

Update Task Status

curl -X PUT https://<api-id>.execute-api.<region>.amazonaws.com/tasks/<task-id> \
-H "Content-Type: application/json" \
-d '{"status": "Completed"}'

Benefits of This Architecture

Scalability : DynamoDB and Lambda automatically scale to handle varying loads.
Cost Efficiency : You only pay for the compute time and storage you use.
Low Maintenance : AWS manages the underlying infrastructure, reducing operational overhead.

Enjoy the cloud 😁
Osama

Setting up a High-Availability (HA) Architecture with OCI Load Balancer and Compute Instances

Posted on February 25, 2025 by Osama Mustafa in Cloud, OCI

Ensuring high availability (HA) for your applications is critical in today’s cloud-first environment. Oracle Cloud Infrastructure (OCI) provides robust tools such as Load Balancers and Compute Instances to help you create a resilient, highly available architecture for your applications. In this post, we’ll walk through the steps to set up an HA architecture using OCI Load Balancer with multiple compute instances across availability domains for fault tolerance.

Prerequisites

OCI Account: A working Oracle Cloud Infrastructure account.
OCI CLI: Installed and configured with necessary permissions.
Terraform: Installed and set up for provisioning infrastructure.
Basic knowledge of Load Balancers and Compute Instances in OCI.

Step 1: Set Up a Virtual Cloud Network (VCN)

A VCN is required to house your compute instances and load balancers. To begin, create a new VCN with subnets in different availability domains (ADs) for high availability.

Terraform Configuration (vcn.tf):

resource "oci_core_virtual_network" "vcn" {
  compartment_id = "<compartment_ocid>"
  cidr_block     = "10.0.0.0/16"
  display_name   = "HA-Virtual-Network"
}

resource "oci_core_subnet" "subnet1" {
  compartment_id      = "<compartment_ocid>"
  vcn_id              = oci_core_virtual_network.vcn.id
  cidr_block          = "10.0.1.0/24"
  availability_domain = "AD-1"
  display_name        = "HA-Subnet-AD1"
}

resource "oci_core_subnet" "subnet2" {
  compartment_id      = "<compartment_ocid>"
  vcn_id              = oci_core_virtual_network.vcn.id
  cidr_block          = "10.0.2.0/24"
  availability_domain = "AD-2"
  display_name        = "HA-Subnet-AD2"
}

Step 2: Provision Compute Instances

Create two compute instances (one in each subnet) to ensure redundancy.

Terraform Configuration (compute.tf):

resource "oci_core_instance" "instance1" {
  compartment_id = "<compartment_ocid>"
  availability_domain = "AD-1"
  shape = "VM.Standard2.1"
  display_name = "HA-Instance-1"
  
  create_vnic_details {
    subnet_id = oci_core_subnet.subnet1.id
    assign_public_ip = true
  }

  source_details {
    source_type = "image"
    source_id = "<image_ocid>"
  }
}

resource "oci_core_instance" "instance2" {
  compartment_id = "<compartment_ocid>"
  availability_domain = "AD-2"
  shape = "VM.Standard2.1"
  display_name = "HA-Instance-2"
  
  create_vnic_details {
    subnet_id = oci_core_subnet.subnet2.id
    assign_public_ip = true
  }

  source_details {
    source_type = "image"
    source_id = "<image_ocid>"
  }
}

Step 3: Set Up the OCI Load Balancer

Now, configure the OCI Load Balancer to distribute traffic between the compute instances in both availability domains.

Terraform Configuration (load_balancer.tf):

resource "oci_load_balancer_load_balancer" "ha_lb" {
  compartment_id = "<compartment_ocid>"
  display_name   = "HA-Load-Balancer"
  shape           = "100Mbps"

  subnet_ids = [
    oci_core_subnet.subnet1.id,
    oci_core_subnet.subnet2.id
  ]

  backend_sets {
    name = "backend-set-1"

    backends {
      ip_address = oci_core_instance.instance1.private_ip
      port = 80
    }

    backends {
      ip_address = oci_core_instance.instance2.private_ip
      port = 80
    }

    policy = "ROUND_ROBIN"
    health_checker {
      port = 80
      protocol = "HTTP"
      url_path = "/health"
      retries = 3
      timeout_in_seconds = 10
      interval_in_seconds = 5
    }
  }
}

resource "oci_load_balancer_listener" "ha_listener" {
  load_balancer_id = oci_load_balancer_load_balancer.ha_lb.id
  name = "http-listener"
  default_backend_set_name = "backend-set-1"
  port = 80
  protocol = "HTTP"
}

Step 4: Set Up Health Checks for High Availability

Health checks are critical to ensure that the load balancer sends traffic only to healthy instances. The health check configuration is included in the backend set definition above, but you can customize it as needed.
Step 5: Testing and Validation

Once all resources are provisioned, test the HA architecture:

Verify Load Balancer Health: Ensure that the backend instances are marked as healthy by checking the load balancer’s health checks.

oci load-balancer backend-set get --load-balancer-id <load_balancer_id> --name backend-set-1

Access the Application: Test accessing your application through the Load Balancer’s public IP. The Load Balancer should evenly distribute traffic across the two compute instances.
Failover Testing: Manually shut down one of the instances to verify that the Load Balancer reroutes traffic to the other instance.