Advanced OCI Container Engine (OKE) with Network Security and Observability

Oracle Cloud Infrastructure Container Engine for Kubernetes (OKE) provides enterprise-grade Kubernetes clusters with deep integration into OCI’s native services. This comprehensive guide explores advanced OKE configurations, focusing on network security policies, observability integration, and automated deployment strategies that enterprise teams need for production workloads.

OKE Architecture Deep Dive

OKE operates on a managed control plane architecture where Oracle handles the Kubernetes master nodes, etcd, and API server components. This design eliminates operational overhead while providing high availability across multiple availability domains.

The service integrates seamlessly with OCI’s networking fabric, allowing granular control over pod-to-pod communication, ingress traffic management, and service mesh implementations. Unlike managed Kubernetes services from other providers, OKE provides native integration with Oracle’s enterprise security stack, including Identity and Access Management (IAM), Key Management Service (KMS), and Web Application Firewall (WAF).

Worker nodes run on OCI Compute instances, providing flexibility in choosing instance shapes, including bare metal, GPU-enabled, and ARM-based Ampere processors. The networking layer supports both flannel and OCI VCN-native pod networking, enabling direct integration with existing network security policies.

Advanced Networking Configuration

OKE’s network architecture supports multiple pod networking modes. The VCN-native pod networking mode assigns each pod an IP address from your VCN’s CIDR range, enabling direct application of network security lists and route tables to pod traffic.

This approach provides several advantages over traditional overlay networking. Security policies become more granular since you can apply network security lists directly to pod traffic. Network troubleshooting becomes simpler as pod traffic flows through standard OCI networking constructs. Integration with existing network monitoring tools works seamlessly since pod traffic appears as regular VCN traffic.

Load balancing integrates deeply with OCI’s Load Balancing service, supporting both Layer 4 and Layer 7 load balancing with SSL termination, session persistence, and health checking capabilities.

Production-Ready Implementation Example

Here’s a comprehensive example that demonstrates deploying a highly available OKE cluster with advanced security and monitoring configurations:

Terraform Configuration for OKE Cluster

# OKE Cluster with Enhanced Security
resource "oci_containerengine_cluster" "production_cluster" {
  compartment_id     = var.compartment_id
  kubernetes_version = var.kubernetes_version
  name              = "production-oke-cluster"
  vcn_id            = oci_core_vcn.oke_vcn.id

  endpoint_config {
    is_public_ip_enabled = false
    subnet_id           = oci_core_subnet.oke_api_subnet.id
    nsg_ids             = [oci_core_network_security_group.oke_api_nsg.id]
  }

  cluster_pod_network_options {
    cni_type = "OCI_VCN_IP_NATIVE"
  }

  options {
    service_lb_subnet_ids = [oci_core_subnet.oke_lb_subnet.id]
    
    kubernetes_network_config {
      pods_cidr     = "10.244.0.0/16"
      services_cidr = "10.96.0.0/16"
    }

    add_ons {
      is_kubernetes_dashboard_enabled = false
      is_tiller_enabled              = false
    }

    admission_controller_options {
      is_pod_security_policy_enabled = true
    }
  }

  kms_key_id = oci_kms_key.oke_encryption_key.id
}

# Node Pool with Mixed Instance Types
resource "oci_containerengine_node_pool" "production_node_pool" {
  cluster_id         = oci_containerengine_cluster.production_cluster.id
  compartment_id     = var.compartment_id
  kubernetes_version = var.kubernetes_version
  name              = "production-workers"

  node_config_details {
    placement_configs {
      availability_domain = data.oci_identity_availability_domains.ads.availability_domains[0].name
      subnet_id          = oci_core_subnet.oke_worker_subnet.id
    }
    placement_configs {
      availability_domain = data.oci_identity_availability_domains.ads.availability_domains[1].name
      subnet_id          = oci_core_subnet.oke_worker_subnet.id
    }
    
    size                    = 3
    nsg_ids                = [oci_core_network_security_group.oke_worker_nsg.id]
    is_pv_encryption_in_transit_enabled = true
  }

  node_shape = "VM.Standard.E4.Flex"
  
  node_shape_config {
    ocpus         = 2
    memory_in_gbs = 16
  }

  node_source_details {
    image_id                = data.oci_containerengine_node_pool_option.oke_node_pool_option.sources[0].image_id
    source_type            = "IMAGE"
    boot_volume_size_in_gbs = 100
  }

  initial_node_labels {
    key   = "environment"
    value = "production"
  }

  ssh_public_key = var.ssh_public_key
}

# Network Security Group for API Server
resource "oci_core_network_security_group" "oke_api_nsg" {
  compartment_id = var.compartment_id
  vcn_id        = oci_core_vcn.oke_vcn.id
  display_name  = "oke-api-nsg"
}

resource "oci_core_network_security_group_security_rule" "oke_api_ingress" {
  network_security_group_id = oci_core_network_security_group.oke_api_nsg.id
  direction                 = "INGRESS"
  protocol                  = "6"
  source                   = "10.0.0.0/16"
  source_type              = "CIDR_BLOCK"
  
  tcp_options {
    destination_port_range {
      max = 6443
      min = 6443
    }
  }
}

# Network Security Group for Worker Nodes
resource "oci_core_network_security_group" "oke_worker_nsg" {
  compartment_id = var.compartment_id
  vcn_id        = oci_core_vcn.oke_vcn.id
  display_name  = "oke-worker-nsg"
}

# Allow pod-to-pod communication
resource "oci_core_network_security_group_security_rule" "worker_pod_communication" {
  network_security_group_id = oci_core_network_security_group.oke_worker_nsg.id
  direction                 = "INGRESS"
  protocol                  = "all"
  source                   = oci_core_network_security_group.oke_worker_nsg.id
  source_type              = "NETWORK_SECURITY_GROUP"
}

# KMS Key for Cluster Encryption
resource "oci_kms_key" "oke_encryption_key" {
  compartment_id = var.compartment_id
  display_name   = "oke-cluster-encryption-key"
  
  key_shape {
    algorithm = "AES"
    length    = 256
  }
  
  management_endpoint = oci_kms_vault.oke_vault.management_endpoint
}

resource "oci_kms_vault" "oke_vault" {
  compartment_id = var.compartment_id
  display_name   = "oke-vault"
  vault_type     = "DEFAULT"
}

Kubernetes Manifests with Network Policies



# Network Policy for Application Isolation
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: webapp-network-policy
namespace: production
spec:
podSelector:
matchLabels:
app: webapp
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
- podSelector:
matchLabels:
app: webapp-frontend
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: database
ports:
- protocol: TCP
port: 5432
- to: []
ports:
- protocol: TCP
port: 443
- protocol: TCP
port: 53
- protocol: UDP
port: 53

---
# Pod Security Policy
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: restricted-psp
spec:
privileged: false
allowPrivilegeEscalation: false
requiredDropCapabilities:
- ALL
volumes:
- 'configMap'
- 'emptyDir'
- 'projected'
- 'secret'
- 'downwardAPI'
- 'persistentVolumeClaim'
runAsUser:
rule: 'MustRunAsNonRoot'
seLinux:
rule: 'RunAsAny'
fsGroup:
rule: 'RunAsAny'

---
# Deployment with Security Context
apiVersion: apps/v1
kind: Deployment
metadata:
name: secure-webapp
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: webapp
template:
metadata:
labels:
app: webapp
spec:
securityContext:
runAsNonRoot: true
runAsUser: 65534
fsGroup: 65534
containers:
- name: webapp
image: nginx:1.21-alpine
ports:
- containerPort: 8080
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 250m
memory: 256Mi
volumeMounts:
- name: tmp-volume
mountPath: /tmp
- name: cache-volume
mountPath: /var/cache/nginx
volumes:
- name: tmp-volume
emptyDir: {}
- name: cache-volume
emptyDir: {}

Monitoring and Observability Integration

OKE integrates natively with OCI Monitoring, Logging, and Logging Analytics services. This integration provides comprehensive observability without requiring additional third-party tools or complex configurations.

The monitoring integration automatically collects cluster-level metrics including CPU utilization, memory consumption, network throughput, and storage IOPS across all worker nodes. Custom metrics can be published using the OCI Monitoring SDK, enabling application-specific dashboards and alerting rules.

Logging integration captures both system logs from Kubernetes components and application logs from pods. The unified logging agent automatically forwards logs to OCI Logging service, where they can be searched, filtered, and analyzed using structured queries.

Security Best Practices Implementation

Enterprise OKE deployments require multiple layers of security controls. Network-level security starts with proper subnet segmentation, placing API servers in private subnets accessible only through bastion hosts or VPN connections.

Pod Security Policies enforce runtime security constraints, preventing privileged containers and restricting volume types. Network policies provide microsegmentation within the cluster, controlling pod-to-pod communication based on labels and namespaces.

Image security scanning integrates with OCI Container Registry’s vulnerability scanning capabilities, automatically checking container images for known vulnerabilities before deployment.

Automated CI/CD Integration

OKE clusters integrate seamlessly with OCI DevOps service for automated application deployment pipelines. The integration supports GitOps workflows, blue-green deployments, and automated rollback mechanisms.

Pipeline configurations can reference OCI Vault secrets for secure credential management, ensuring sensitive information never appears in deployment manifests or pipeline configurations.

Performance Optimization Strategies

Production OKE deployments benefit from several performance optimization techniques. Node pool configurations should match application requirements, using compute-optimized instances for CPU-intensive workloads and memory-optimized instances for data processing applications.

Pod disruption budgets ensure application availability during cluster maintenance operations. Horizontal Pod Autoscaling automatically adjusts replica counts based on CPU or memory utilization, while Cluster Autoscaling adds or removes worker nodes based on resource demands.

This comprehensive approach to OKE deployment provides enterprise-grade container orchestration with robust security, monitoring, and automation capabilities, enabling organizations to run production workloads confidently in Oracle Cloud Infrastructure.

DELETE All the VCNs in THE OCI Using BASH SCRIPT

The script below will allow you to list all VCNs in OCI and delete all attached resources to the COMPARTMENT_OCID.

Note: I wrote the scripts to perform the tasks mentioned below, which can be updated and expanded based on the needs. Feel free to do that and say the source

Complete Resource Deletion Chain: The script now handles the proper order of deletion:

  • Compute instances first
  • Clean route tables and security lists
  • Load balancers
  • Gateways (NAT, Internet, Service, DRG attachments)
  • Subnets
  • Custom security lists, route tables, and DHCP options
  • Finally, the VCN itself
#!/bin/bash

# ✅ Set this to the target compartment OCID
COMPARTMENT_OCID="Set Your OCID Here"

# (Optional) Force region
export OCI_CLI_REGION=me-jeddah-1

echo "📍 Region: $OCI_CLI_REGION"
echo "📦 Compartment: $COMPARTMENT_OCID"
echo "⚠️  WARNING: This will delete ALL VCNs and related resources in the compartment!"
echo "Press Ctrl+C within 10 seconds to cancel..."
sleep 10

# Function to wait for resource deletion
wait_for_deletion() {
    local resource_id=$1
    local resource_type=$2
    local max_attempts=30
    local attempt=1
    
    echo "    ⏳ Waiting for $resource_type deletion..."
    while [ $attempt -le $max_attempts ]; do
        if ! oci network $resource_type get --${resource_type//-/}-id "$resource_id" &>/dev/null; then
            echo "    ✅ $resource_type deleted successfully"
            return 0
        fi
        sleep 10
        ((attempt++))
    done
    echo "    ⚠️  Timeout waiting for $resource_type deletion"
    return 1
}

# Function to check if resource is default
is_default_resource() {
    local resource_id=$1
    local resource_type=$2
    
    case $resource_type in
        "security-list")
            result=$(oci network security-list get --security-list-id "$resource_id" --query "data.\"display-name\"" --raw-output 2>/dev/null)
            [[ "$result" == "Default Security List"* ]]
            ;;
        "route-table")
            result=$(oci network route-table get --rt-id "$resource_id" --query "data.\"display-name\"" --raw-output 2>/dev/null)
            [[ "$result" == "Default Route Table"* ]]
            ;;
        "dhcp-options")
            result=$(oci network dhcp-options get --dhcp-id "$resource_id" --query "data.\"display-name\"" --raw-output 2>/dev/null)
            [[ "$result" == "Default DHCP Options"* ]]
            ;;
        *)
            false
            ;;
    esac
}

# Function to clean all route tables in a VCN
clean_all_route_tables() {
    local VCN_ID=$1
    echo "  🧹 Cleaning all route tables..."
    
    local RT_IDS=$(oci network route-table list \
        --compartment-id "$COMPARTMENT_OCID" \
        --vcn-id "$VCN_ID" \
        --query "data[?\"lifecycle-state\" == 'AVAILABLE'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for RT_ID in $RT_IDS; do
        if [ -n "$RT_ID" ]; then
            echo "    🔧 Clearing routes in route table: $RT_ID"
            oci network route-table update --rt-id "$RT_ID" --route-rules '[]' --force &>/dev/null || true
        fi
    done
    
    # Wait a bit for route updates to propagate
    sleep 5
}

# Function to clean all security lists in a VCN
clean_all_security_lists() {
    local VCN_ID=$1
    echo "  🧹 Cleaning all security lists..."
    
    local SL_IDS=$(oci network security-list list \
        --compartment-id "$COMPARTMENT_OCID" \
        --vcn-id "$VCN_ID" \
        --query "data[?\"lifecycle-state\" == 'AVAILABLE'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for SL_ID in $SL_IDS; do
        if [ -n "$SL_ID" ]; then
            echo "    🔧 Clearing rules in security list: $SL_ID"
            oci network security-list update \
                --security-list-id "$SL_ID" \
                --egress-security-rules '[]' \
                --ingress-security-rules '[]' \
                --force &>/dev/null || true
        fi
    done
    
    # Wait a bit for security list updates to propagate
    sleep 5
}

# Function to delete compute instances in subnets
delete_compute_instances() {
    local VCN_ID=$1
    echo "  🖥️  Checking for compute instances..."
    
    local INSTANCES=$(oci compute instance list \
        --compartment-id "$COMPARTMENT_OCID" \
        --query "data[?\"lifecycle-state\" != 'TERMINATED'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for INSTANCE_ID in $INSTANCES; do
        if [ -n "$INSTANCE_ID" ]; then
            # Check if instance is in this VCN
            local INSTANCE_VCN=$(oci compute instance list-vnics \
                --instance-id "$INSTANCE_ID" \
                --query "data[0].\"vcn-id\"" \
                --raw-output 2>/dev/null)
            
            if [[ "$INSTANCE_VCN" == "$VCN_ID" ]]; then
                echo "    🔻 Terminating compute instance: $INSTANCE_ID"
                oci compute instance terminate --instance-id "$INSTANCE_ID" --force &>/dev/null || true
            fi
        fi
    done
}

# Main cleanup function for a single VCN
cleanup_vcn() {
    local VCN_ID=$1
    echo -e "\n🧹 Cleaning resources for VCN: $VCN_ID"
    
    # Step 1: Delete compute instances first
    delete_compute_instances "$VCN_ID"
    
    # Step 2: Clean route tables and security lists
    clean_all_route_tables "$VCN_ID"
    clean_all_security_lists "$VCN_ID"
    
    # Step 3: Delete Load Balancers
    echo "  🔻 Deleting load balancers..."
    local LBS=$(oci lb load-balancer list \
        --compartment-id "$COMPARTMENT_OCID" \
        --query "data[?\"lifecycle-state\" == 'ACTIVE'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for LB_ID in $LBS; do
        if [ -n "$LB_ID" ]; then
            echo "    🔻 Deleting Load Balancer: $LB_ID"
            oci lb load-balancer delete --load-balancer-id "$LB_ID" --force &>/dev/null || true
        fi
    done
    
    # Step 4: Delete NAT Gateways
    echo "  🔻 Deleting NAT gateways..."
    local NAT_GWS=$(oci network nat-gateway list \
        --compartment-id "$COMPARTMENT_OCID" \
        --vcn-id "$VCN_ID" \
        --query "data[?\"lifecycle-state\" == 'AVAILABLE'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for NAT_ID in $NAT_GWS; do
        if [ -n "$NAT_ID" ]; then
            echo "    🔻 Deleting NAT Gateway: $NAT_ID"
            oci network nat-gateway delete --nat-gateway-id "$NAT_ID" --force &>/dev/null || true
        fi
    done
    
    # Step 5: Delete DRG Attachments
    echo "  🔻 Deleting DRG attachments..."
    local DRG_ATTACHMENTS=$(oci network drg-attachment list \
        --compartment-id "$COMPARTMENT_OCID" \
        --query "data[?\"vcn-id\" == '$VCN_ID' && \"lifecycle-state\" == 'ATTACHED'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for DRG_ATTACHMENT_ID in $DRG_ATTACHMENTS; do
        if [ -n "$DRG_ATTACHMENT_ID" ]; then
            echo "    🔻 Deleting DRG Attachment: $DRG_ATTACHMENT_ID"
            oci network drg-attachment delete --drg-attachment-id "$DRG_ATTACHMENT_ID" --force &>/dev/null || true
        fi
    done
    
    # Step 6: Delete Internet Gateways
    echo "  🔻 Deleting internet gateways..."
    local IGWS=$(oci network internet-gateway list \
        --compartment-id "$COMPARTMENT_OCID" \
        --vcn-id "$VCN_ID" \
        --query "data[?\"lifecycle-state\" == 'AVAILABLE'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for IGW_ID in $IGWS; do
        if [ -n "$IGW_ID" ]; then
            echo "    🔻 Deleting Internet Gateway: $IGW_ID"
            oci network internet-gateway delete --ig-id "$IGW_ID" --force &>/dev/null || true
        fi
    done
    
    # Step 7: Delete Service Gateways
    echo "  🔻 Deleting service gateways..."
    local SGWS=$(oci network service-gateway list \
        --compartment-id "$COMPARTMENT_OCID" \
        --vcn-id "$VCN_ID" \
        --query "data[?\"lifecycle-state\" == 'AVAILABLE'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for SGW_ID in $SGWS; do
        if [ -n "$SGW_ID" ]; then
            echo "    🔻 Deleting Service Gateway: $SGW_ID"
            oci network service-gateway delete --service-gateway-id "$SGW_ID" --force &>/dev/null || true
        fi
    done
    
    # Step 8: Wait for gateways to be deleted
    echo "  ⏳ Waiting for gateways to be deleted..."
    sleep 30
    
    # Step 9: Delete Subnets
    echo "  🔻 Deleting subnets..."
    local SUBNETS=$(oci network subnet list \
        --compartment-id "$COMPARTMENT_OCID" \
        --vcn-id "$VCN_ID" \
        --query "data[?\"lifecycle-state\" == 'AVAILABLE'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for SUBNET_ID in $SUBNETS; do
        if [ -n "$SUBNET_ID" ]; then
            echo "    🔻 Deleting Subnet: $SUBNET_ID"
            oci network subnet delete --subnet-id "$SUBNET_ID" --force &>/dev/null || true
        fi
    done
    
    # Step 10: Wait for subnets to be deleted
    echo "  ⏳ Waiting for subnets to be deleted..."
    sleep 30
    
    # Step 11: Delete non-default Security Lists
    echo "  🔻 Deleting custom security lists..."
    local SL_IDS=$(oci network security-list list \
        --compartment-id "$COMPARTMENT_OCID" \
        --vcn-id "$VCN_ID" \
        --query "data[?\"lifecycle-state\" == 'AVAILABLE'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for SL_ID in $SL_IDS; do
        if [ -n "$SL_ID" ] && ! is_default_resource "$SL_ID" "security-list"; then
            echo "    🔻 Deleting Security List: $SL_ID"
            oci network security-list delete --security-list-id "$SL_ID" --force &>/dev/null || true
        fi
    done
    
    # Step 12: Delete non-default Route Tables
    echo "  🔻 Deleting custom route tables..."
    local RT_IDS=$(oci network route-table list \
        --compartment-id "$COMPARTMENT_OCID" \
        --vcn-id "$VCN_ID" \
        --query "data[?\"lifecycle-state\" == 'AVAILABLE'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for RT_ID in $RT_IDS; do
        if [ -n "$RT_ID" ] && ! is_default_resource "$RT_ID" "route-table"; then
            echo "    🔻 Deleting Route Table: $RT_ID"
            oci network route-table delete --rt-id "$RT_ID" --force &>/dev/null || true
        fi
    done
    
    # Step 13: Delete non-default DHCP Options
    echo "  🔻 Deleting custom DHCP options..."
    local DHCP_IDS=$(oci network dhcp-options list \
        --compartment-id "$COMPARTMENT_OCID" \
        --vcn-id "$VCN_ID" \
        --query "data[?\"lifecycle-state\" == 'AVAILABLE'].id" \
        --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)
    
    for DHCP_ID in $DHCP_IDS; do
        if [ -n "$DHCP_ID" ] && ! is_default_resource "$DHCP_ID" "dhcp-options"; then
            echo "    🔻 Deleting DHCP Options: $DHCP_ID"
            oci network dhcp-options delete --dhcp-id "$DHCP_ID" --force &>/dev/null || true
        fi
    done
    
    # Step 14: Wait before attempting VCN deletion
    echo "  ⏳ Waiting for all resources to be cleaned up..."
    sleep 60
    
    # Step 15: Finally, delete the VCN
    echo "  🔻 Deleting VCN: $VCN_ID"
    local max_attempts=5
    local attempt=1
    
    while [ $attempt -le $max_attempts ]; do
        if oci network vcn delete --vcn-id "$VCN_ID" --force &>/dev/null; then
            echo "    ✅ VCN deletion initiated successfully"
            break
        else
            echo "    ⚠️  VCN deletion attempt $attempt failed, retrying in 30 seconds..."
            sleep 30
            ((attempt++))
        fi
    done
    
    if [ $attempt -gt $max_attempts ]; then
        echo "    ❌ Failed to delete VCN after $max_attempts attempts"
        echo "    💡 You may need to manually check for remaining dependencies"
    fi
}

# Main execution
echo -e "\n🚀 Starting VCN cleanup process..."

# Fetch all VCNs in the compartment
VCN_IDS=$(oci network vcn list \
    --compartment-id "$COMPARTMENT_OCID" \
    --query "data[?\"lifecycle-state\" == 'AVAILABLE'].id" \
    --raw-output 2>/dev/null | jq -r '.[]' 2>/dev/null)

if [ -z "$VCN_IDS" ]; then
    echo "📭 No VCNs found in compartment $COMPARTMENT_OCID"
    exit 0
fi

echo "📋 Found VCNs to delete:"
for VCN_ID in $VCN_IDS; do
    VCN_NAME=$(oci network vcn get --vcn-id "$VCN_ID" --query "data.\"display-name\"" --raw-output 2>/dev/null)
    echo "  - $VCN_NAME ($VCN_ID)"
done

# Process each VCN
for VCN_ID in $VCN_IDS; do
    if [ -n "$VCN_ID" ]; then
        cleanup_vcn "$VCN_ID"
    fi
done

echo -e "\n✅ Cleanup complete for compartment: $COMPARTMENT_OCID"
echo "🔍 You may want to verify in the OCI Console that all resources have been deleted."

Output example

Regards

Automating Block Volume Backups in Oracle Cloud Infrastructure (OCI) using CLI and Terraform

Briefly introduce the importance of block volumes in OCI and why automated backups are essential.Mention that this blog will cover two methods: using the OCI CLI and Terraform for automation.

Automating Block Volume Backups using OCI CLI

Prerequisites:

  • Set up OCI CLI on your machine (brief steps with links).
  • Ensure that you have the right permissions to manage block volumes.

Step-by-step guide:

  • Command to create a block volume
oci bv volume create --compartment-id <your_compartment_ocid> --availability-domain <your_ad> --display-name "MyVolume" --size-in-gbs 50

Command to take a backup of the block volume:

oci bv backup create --volume-id <your_volume_ocid> --display-name "MyVolumeBackup"

Scheduling backups using cron jobs for automation.

  • Example cron job configuration
0 2 * * * /usr/local/bin/oci bv backup create --volume-id <your_volume_ocid> --display-name "ScheduledBackup" >> /var/log/oci_backup.log 2>&1

Automating Block Volume Backups using Terraform

Prerequisites

  1. OCI Credentials: Make sure you have the proper API keys and permissions configured in your OCI tenancy.
  2. Terraform Setup: Terraform should be installed and configured to interact with OCI, including the OCI provider setup in your environment.
Step 1: Define the OCI Block Volume Resource

First, define the block volume that you want to automate backups for. Here’s an example of a simple block volume resource in Terraform:

resource "oci_core_volume" "my_block_volume" {
  availability_domain = "your-availability-domain"
  compartment_id      = "ocid1.compartment.oc1..your-compartment-id"
  display_name        = "my_block_volume"
  size_in_gbs         = 50
}
Step 2: Define a Backup Policy

OCI provides predefined backup policies such as gold, silver, and bronze, which define how frequently backups are taken. You can create a custom backup policy as well, but for simplicity, we’ll use one of the predefined policies in this example. The Terraform resource oci_core_volume_backup_policy_assignment will assign a backup policy to the block volume.

Here’s an example to assign the gold backup policy to the block volume:

resource "oci_core_volume_backup_policy_assignment" "backup_assignment" {
  volume_id       = oci_core_volume.my_block_volume.id
  policy_id       = data.oci_core_volume_backup_policy.gold.id
}

data "oci_core_volume_backup_policy" "gold" {
  name = "gold"
}
Step 3: Custom Backup Policy (Optional)

If you need a custom backup policy rather than using the predefined gold, silver, or bronze policies, you can define a custom backup policy using OCI’s native scheduling.

You can create a custom schedule by combining these elements in your oci_core_volume_backup_policy resource.

resource "oci_core_volume_backup_policy" "custom_backup_policy" {
  compartment_id = "ocid1.compartment.oc1..your-compartment-id"
  display_name   = "CustomBackupPolicy"

  schedules {
    backup_type = "INCREMENTAL"
    period      = "ONE_DAY"
    retention_duration = "THIRTY_DAYS"
  }

  schedules {
    backup_type = "FULL"
    period      = "ONE_WEEK"
    retention_duration = "NINETY_DAYS"
  }
}

You can then assign this policy to the block volume using the same method as earlier.

Step 4: Apply the Terraform Configuration

Once your Terraform configuration is ready, apply it using the standard Terraform workflow:

  1. Initialize Terraform:
terraform init

Plan the Terraform deployment:

terraform plan

Apply the Terraform plan:

terraform apply

This process will automatically provision your block volumes and assign the specified backup policy.



Regards
Osama

AWS CLOUD STORAGE OVERVIEW

There are three types of cloud storage: object, file, and block. Each storage option has a unique combination of performance, durability, cost, and interface.

  • Block storage – Enterprise applications like databases or enterprise resource planning (ERP) systems often require dedicated, low-latency storage for each host. This is similar to direct-attached storage (DAS) or a Storage Area Network (SAN). Block-based cloud storage solutions like Amazon Elastic Block Store (Amazon EBS) are provisioned with each virtual server and offer the ultra-low latency required for high-performance workloads.
  • File storage – Many applications must access shared files and require a file system. This type of storage is often supported with a Network Attached Storage (NAS) server. File storage solutions like Amazon Elastic File System (Amazon EFS) are ideal for use cases such as large content repositories, development environments, media stores, or user home directories.
  • Object storage – Applications developed in the cloud need the vast scalability and metadata of object storage. Object storage solutions like Amazon Simple Storage Service (Amazon S3) are ideal for building modern applications. Amazon S3 provides scale and flexibility. You can use it to import existing data stores for analytics, backup, or archive.

AWS provides you with services for your block, file and object storage needs. Select each hotspot in the image to see what services are available for you to explore to build solutions.

Amazon S3 use cases

  • Backup and restore.
  • Data Lake for analytics.
  • Media storage
  • Static website.
  • Archiving

Buckets and objects

Amazon S3 stores data as objects within buckets. An object is composed of a file and any metadata that describes that file.  The diagram below contains a URL comprised of a bucket and an object key. The object key is the unique identifier of an object in a bucket. The combination of a bucket, key, and version ID uniquely identifies each object. The object is uniquely addressed through the combination of the web service endpoint, bucket name, key, and optionally, a version. 

To store an object in Amazon S3, upload the file into a bucket. When you upload a file, you can set permissions on the object and add metadata. You can have one or more buckets in your account. For each bucket, you control who can create, delete, and list objects in the bucket.

Amazon S3 access control

By default, all Amazon S3 resources—buckets, objects, and related resources (for example, lifecycle configuration and website configuration)—are private. Only the resource owner, an AWS account that created it, can access the resource. The resource owner can grant access permissions to others by writing access policies. 

AWS provides several different tools to help developers configure buckets for a wide variety of workloads. 

  • Most Amazon S3 use cases do not require public access. 
  • Amazon S3 usually stores data from other applications. Public access is not recommended for these types of buckets. 
  • Amazon S3 includes a block public access feature. This acts as an additional layer of protection to prevent accidental exposure of customer data. 

Amazon S3 Event Notifications

Amazon S3 event notifications enable you to receive notifications when certain object events happen in your bucket. Here is an example of an event notification workflow to convert images to thumbnails. To learn more, select each of the three hotspots in the diagram below.

Amazon S3 cost factors and best practices

Cost is an important part of choosing the right Amazon S3 storage solution. Some of the Amazon S3 cost factors to consider include the following:

  • Storage – Per-gigabyte cost to hold your objects. You pay for storing objects in your S3 buckets. The rate you’re charged depends on your objects’ size, how long you stored the objects during the month, and the storage class. There are per-request ingest charges when using PUT, COPY, or lifecycle rules to move data into any S3 storage class.
  • Requests and retrievals – The number of API calls: PUT and GET requests. You pay for requests made against your S3 buckets and objects. S3 request costs are based on the request type, and are charged on the quantity of requests. When you use the Amazon S3 console to browse your storage, you incur charges for GET, LIST, and other requests that are made to facilitate browsing.
  • Data transfer – Usually no transfer fee for data-in from the internet and, depending on the requestor location and medium of data transfer, different charges for data-out. 
  • Management and analytics – You pay for the storage management features and analytics that are enabled on your account’s buckets. These features are not discussed in detail in this course.

S3 Replication and S3 Versioning can have a big impact on your AWS bill. These services both create multiple copies of your objects and you pay for each PUT request in addition to the storage tier charge. S3 Cross-Region Replication also requires data transfer between AWS Regions.

Shared file systems

Using a fully managed cloud shared file system solution removes complexities, reduces costs, and simplifies management. To learn more about shared file systems, select each hotspot in the image below.

Amazon Elastic File System (EFS) 

Amazon EFS provides a scalable, elastic file system for Linux-based workloads for use with AWS Cloud services and on-premises resources. 

You’re able to access your file system across Availability Zones, AWS Regions, and VPCs while sharing files between thousands of EC2 instances and on-premises servers through AWS Direct Connect or AWS VPN. 

You can create a file system, mount the file system on an Amazon EC2 instance, and then read and write data to and from your file system. 

Amazon EFS provides a shared, persistent layer that allows stateful applications to elastically scale up and down. Examples include DevOps, web serving, web content systems, media processing, machine learning, analytics, search index, and stateful microservices applications. Amazon EFS can support a petabyte-scale file system, and the throughput of the file system also scales with the capacity of the file system.

Because Amazon EFS is serverless, you don’t need to provision or manage the infrastructure or capacity. Amazon EFS file systems can be shared with up to tens of thousands of concurrent clients, no matter the type. These could be traditional EC2 instances, containers running in one of your self-managed clusters or in one of the AWS container services, Amazon ECS, Amazon EKS, and Fargate, or in a serverless function running in Lambda.

Use Amazon EFS to lower your total cost of ownership for shared file storage. Choose Amazon EFS One Zone for data that does not require replication across multiple Availability Zones and save on storage costs. Amazon EFS Standard-Infrequent Access (EFS Standard-IA) and Amazon EFS One Zone-Infrequent Access (EFS One Zone-IA) are storage classes that provide price/performance that is cost-optimized for files not accessed every day.

Use Amazon EFS scaling and automation to save on management costs, and pay only for what you use.

Amazon FSx

With Amazon FSx, you can quickly launch and run feature-rich and high-performing file systems. The service provides you with four file systems to choose from. This choice is based on your familiarity with a given file system or by matching the feature sets, performance profiles, and data management capabilities to your needs.

Amazon FSx for Windows File Server

FSx for Windows File Server provides fully managed Microsoft Windows file servers that are backed by a native Windows file system. Built on Windows Server, Amazon FSx delivers a wide range of administrative features such as data deduplication, end-user file restore, and Microsoft Active Directory.

Amazon FSx for Lustre (FSx for Lustre) 

FSx for Lustre is a fully managed service that provides high-performance, cost-effective storage. FSx for Lustre is compatible with the most popular Linux-based AMIs, including Amazon Linux, Amazon Linux 2, Red Hat Enterprise Linux (RHEL), CentOS, SUSE Linux, and Ubuntu.

Amazon FSx for NETapp ONTAP

FSx for NETapp ONTAP provides fully managed shared storage in the AWS Cloud with the popular data access and management capabilities of ONTAP.

Amazon FSx for OpenZFS

Where the road leads, I will go. Along the stark desert, across the wide plains, into the deep forests I will follow the call of the world and embrace its ferocious beauty.

Regards

Osama

Automating AWS IAM User Creation with Terraform: A Step-by-Step Guide

In this post, I will share a Terraform script I developed and uploaded to my GitHub repository, aimed at simplifying and automating the creation of IAM users in AWS. This tool is not just about saving time; it’s about enhancing security, ensuring consistency, and enabling scalability in managing user access to AWS services.

For those who may be new to Terraform, it’s a powerful tool that allows you to build, change, and version infrastructure safely and efficiently. Terraform can manage existing service providers as well as custom in-house solutions. The code I’m about to share represents a practical application of Terraform’s capabilities in the AWS ecosystem.

Whether you are an experienced DevOps professional, a system administrator, or just someone interested in cloud infrastructure management, this post is designed to provide you with valuable insights into automating IAM user creation. Let’s dive into how this Terraform script can streamline your AWS IAM processes, ensuring a more secure and efficient cloud environment.

Github Link Here


Regards

Osama

Harnessing the Power of AWS ECS Fargate with Terraform: A Comprehensive Guide

Welcome to our deep dive into the world of containerization and cloud orchestration! In this blog post, we’re going to explore the innovative realm of AWS ECS Fargate, a game-changer in the world of container management and deployment. AWS ECS Fargate simplifies the process of running containers by eliminating the need to manage servers or clusters, offering a more streamlined and efficient way to deploy your applications.

But that’s not all. We understand the importance of infrastructure as code (IaC) in today’s fast-paced tech environment. That’s why we’re also providing you with a powerful resource – a GitHub repository containing Terraform code, meticulously crafted to help you deploy AWS ECS Fargate services with ease. Terraform, an open-source infrastructure as code software tool, enables you to define and provision a datacenter infrastructure using a declarative configuration language. This integration with Terraform not only automates your deployments but also ensures consistency and reliability in your infrastructure setup.

Whether you’re new to AWS ECS Fargate or looking to enhance your existing knowledge, this post aims to provide you with actionable insights and practical know-how. From setting up your first Fargate service to scaling and managing it effectively, we’ve got you covered. So, gear up as we embark on this journey to harness the full potential of AWS ECS Fargate, supplemented by the power of Terraform automation.

Stay tuned, and don’t forget to check out our GitHub repository linked at the end of this post for the Terraform code that will be your ally in deploying and managing your Fargate services efficiently.

The GitHub Link Here

Regards

Osama

Amazon Kinesis

Amazon Kinesis for data collection and analysis

With Amazon Kinesis, you:

  • Collect, process, and analyze data streams in real time. Kinesis has the capacity to process streaming data at any scale. It provides you the flexibility to choose the tools that best suit the requirements of your application in a cost-effective way.
  • Ingest real-time data such as video, audio, application logs, website clickstreams, and Internet of Things (IoT) telemetry data. The ingested data can be used for machine learning, analytics, and other applications.
  • Can process and analyze data as it arrives, and respond instantly. You don’t have to wait until all data is collected before the processing begins.

Amazon Kinesis Data Streams

To get started using Amazon Kinesis Data Streams, create a stream and specify the number of shards. Each shard is a unit of read and write capacity. Each shard can read up to 1 MB of data per second and write at a rate of 2 MB per second. The total capacity of a stream is the sum of the capacities of its shards. Increase or decrease the number of shards in a stream as needed. Data being written is in the form of a record, which can be up to 1 MB in size.

  • Producers write data into the stream. A producer might be an Amazon EC2 instance, a mobile client, an on-premises server, or an IoT device.
  • Consumers receive the streaming data that the producers generate. A consumer might be an application running on an EC2 instance or AWS Lambda. If it’s on an Amazon EC2 instance, the application will need to scale as the amount of streaming data increases. If this is the case, run it in an Auto Scaling group. 
  • Each consumer reads from a particular shard. There might be more than one application processing the same data. 
  • Another way to write a consumer application is to use AWS Lambda, which lets you run code without having to provision or manage servers. 
  • The results of the consumer applications can be stored by AWS services such as Amazon S3, Amazon DynamoDB, and Amazon RedShift.

Amazon Kinesis Data Firehose

Amazon Kinesis Data Firehose starts to process data in near-real time. Kinesis Data Firehose can send records to Amazon S3, Amazon Redshift, Amazon Elasticsearch Service (ES), and any HTTP endpoint owned by you. It can also send records to any of your third-party service providers, including Datadog, New Relic, and Splunk.

Regards

Osama

SQS vs. SNS

Loose coupling with Amazon Simple Queue Service

Amazon Simple Queue Service (SQS) is a fully managed message queuing service that use use to you to decouple and scale microservices, distributed systems, and serverless applications The service works on a massive scale, processing billions of messages per day. It stores all message queues and messages within a single, highly available AWS Region with multiple redundant Availability Zones. This ensures that no single computer, network, or Availability Zone failure can make messages inaccessible. Messages can be sent and read simultaneously.

A loosely coupled workload involves processing a large number of smaller jobs. The loss of one node or job in a loosely coupled workload usually doesn’t delay the entire calculation. The lost work can be picked up later or omitted altogether.

With Amazon SQS, you can decouple pre-processing steps from compute steps and post-processing steps. Building applications from individual components that perform discrete functions improves scalability and reliability. Decoupling components is a best practice for designing modern applications. Amazon SQS frequently lies at the heart of cloud-native loosely coupled solutions.

SQS queue types

Amazon SQS offers two types of message queues


STANDARD QUEUES

Standard queues support at-least-once message delivery and provide best-effort ordering. Messages are generally delivered in the same order in which they are sent. However, because of the highly distributed architecture, more than one copy of a message might be delivered out of order. Standard queues can handle a nearly unlimited number of API calls per second. You can use standard message queues if your application can process messages that arrive repetitively and out of order.

FIFO QUEUES

FIFO (First-In-First-Out) queues are designed to enhance messaging between applications when the order of operations and events is critical or where duplicates can’t be tolerated. FIFO queues also provide exactly-once processing but have a limited number of API calls per second. FIFO queues are designed to enhance messaging between applications when the order of operations and events is critical.

Optimizing your Amazon SQS queue configurations

When creating an Amazon SQS queue, you need to consider how your application interacts with the queue. This information will help you optimize the configuration of your queue to control costs and increase performance.

TUNE YOUR VISIBILITY TIMEOUT

When a consumer receives an SQS message, that message remains in the queue until the consumer deletes it. You can configure the SQS queue’s visibility timeout setting to make that message invisible to other consumers for a period of time. This helps to prevent another consumer from processing the same message. The default visibility timeout is 30 seconds. The consumer deletes the message once it completes processing the message. If the consumer fails to delete the message before the visibility timeout expires, it becomes visible to other consumers and can be processed again. 

Typically, you should set the visibility timeout to the maximum time that it takes your application to process and delete a message from the queue. Setting too short of a timeout increases the possibility of your application processing a message twice. Too long of a visibility timeout delays subsequent attempts at processing a message.

CHOOSE THE RIGHT POLLING TYPE

You can configure an Amazon SQS queue to use either short polling or long polling. Queues with short polling:

  • Send a response to the consumer immediately after receiving a request providing a faster response
  • Increases the number of responses and therefore costs. 

SQS queues with long polling:

  • Do not return a response until at least one message arrives or the poll times out.
  • Less frequent responses but decreases costs.

Depending on the frequency of messages arriving in your queue, many of the responses from a queue using short polling could just be reporting an empty queue. Unless your application requires an immediate response to its poll requests, long polling is the preferable option.

Amazon SNS

Amazon SNS is a web service that makes it easy to set up, operate, and send notifications from the cloud. The service follows the publish-subscribe (pub-sub) messaging paradigm, with notifications being delivered to clients using a push mechanism.

Amazon SNS publisher to multiple SQS queues

Using highly available services, such as Amazon SNS, to perform basic message routing is an effective way of distributing messages to microservices. The two main forms of communications between microservices are request-response, and observer. In the example, an observer type is used to fan out orders to two different SQS queues based on the order type. 

To deliver Amazon SNS notifications to an SQS queue, you subscribe to a topic specifying Amazon SQS as the transport and a valid SQS queue as the endpoint. To permit the SQS queue to receive notifications from Amazon SNS, the SQS queue owner must subscribe the SQS queue to the topic for Amazon SNS. If the user owns the Amazon SNS topic being subscribed to and the SQS queue receiving the notifications, nothing else is required. Any message published to the topic will automatically be delivered to the specified SQS queue. If the owner of the SQS queue is not the owner of the topic, Amazon SNS requires an explicit confirmation to the subscription request.

Amazon SNS and Amazon SQS

FeaturesAmazon SNSAmazon SQS
Message persistenceNoYes
Delivery mechanismPush (passive)Poll (active)
Producer and consumerPublisher and subscriberSend or receive
Distribution modelOne to manyOne to one

Regards

Osama

AWS API GATEWAY

With API Gateway, you can create, publish, maintain, monitor, and secure APIs.

With API Gateway, you can connect your applications to AWS services and other public or private websites. It provides consistent RESTful and HTTP APIs for mobile and web applications to access AWS services and other resources hosted outside of AWS.

As a gateway, it handles all the tasks involved in accepting and processing up to hundreds of thousands of concurrent API calls. These include traffic management, authorization and access control, monitoring, and API version management.

API Gateway sample architecture

API Gateway integrates with Amazon CloudWatch by sending log messages and detailed metrics to it. You can activate logging for each stage in your API or for each method. You can set the verbosity of the logging (Error or Info), and if full request and response data should be logged.

The detailed metrics that API Gateway can send to Amazon CloudWatch are:

  • Number of API calls
  • Latency
  • Integration latency
  • HTTP 400 and 500 errors

API Gateway features

Creates a unified API front end for multiple microservices.
Provides DDoS protection and throttling for your backend.
Authenticates and authorizes requests to a backend.
Throttles, meters, and monetizes API usage by third-party developers.

Regards

Osama

VPC Peering

Connecting VPCs with VPC peering

When your business or architecture becomes large enough, you will find the need to separate logical elements for security or architectural needs, or just for simplicity’s sake. 

A VPC peering connection is a one-to-one relationship between two VPCs. There can only be one peering resource between any two VPCs. You can create multiple VPC peering connections for each VPC that you own, but transitive peering relationships are not supported. You will not have any peering relationship with VPCs that your VPC is not directly peered with. You can create a VPC peering connection between your own VPCs, or with a VPC in another AWS account within a single Region.

To establish a VPC peering connection, the owner of the requester VPC (or local VPC) sends a request to the owner of the peer VPC. You or another AWS account can own the peer VPC. It cannot have a Classless Inter-Domain Routing (CIDR) block that overlaps with your requester VPC’s CIDR block. The owner of the peer VPC has to accept the VPC peering connection request to activate the VPC peering connection. 

To permit the flow of traffic between the peer VPCs using private IP addresses, add a route to one or more of your VPC’s route tables that points to the IP address range of the peer VPC. The owner of the peer VPC adds a route to one of their VPC’s route tables that points to the IP address range of your VPC. You might also need to update the security group rules that are associated with your instance to ensure that traffic to and from the peer VPC is not restricted. 

Benefits of VPC peering

Review some of the benefits of using VPC peering to connect multiple VPCs together.

  • bulletBypass the internet gateway or virtual private gateway. Use VPC peering to quickly connect two or more of your networks without needing other virtual appliances in your environment.
  • bulletUse highly available connections. VPC peering connections are redundant by default. AWS manages your connection.
  • bulletAvoid bandwidth bottlenecks. All inter-Region traffic is encrypted with no single point of failure or bandwidth bottlenecks. Traffic always stays on the global AWS backbone, and never traverses the public internet, which reduces threats, such as common exploits, and distributed denial of service (DDoS) attacks.
  • bulletUse private IP addresses to direct traffic. The VPC peering traffic remains in the private IP space.

 VPC peering for shared services

your security team provides you with a shared services VPC that each department can peer with. This VPC allows your resources to connect to a shared directory service, security scanning tools, monitoring or logging tools, and other services.

A VPC peering connection with a VPC in a different Region is present. Inter-Region VPC peering allows VPC resources that run in different AWS Regions to communicate with each other using private IP addresses. You won’t be required to use gateways, virtual private network (VPN) connections, or separate physical hardware to send traffic between your Regions.

full mesh VPC peering

each VPC must have a one-to-one connection with each VPC with which it is approved to communicate. This is because each VPC peering connection is nontransitive in nature and does not permit network traffic to pass from one peering connection to another.

The number of connections required has a direct impact on the number of potential points of failure and the requirement for monitoring. The fewer connections you need, the fewer you need to monitor and the fewer potential points of failure.

Regards

Osama