Day 29 - GitOps Patient Zero — Continuous Delivery on AWS EKS with Argo CD and Kustomize

Today's project moves away from manual, ad-hoc cluster provisioning and embraces true modern cloud operations by setting up a production-grade GitOps Continuous Delivery (CD) Pipeline using Argo CD and Kustomize running on an Amazon Elastic Kubernetes Service (EKS) cluster.

Traditional infrastructure management often suffers from "configuration drift," where changes made manually via CLI or console make environments impossible to reproduce. By shifting to a GitOps topology, my entire application stack state is now declared natively in code and driven directly from a GitHub repository as the single source of truth.

1. The System Architecture Topography

This project establishes a clean boundary separating core cloud infrastructure provisioning from the software application delivery lifecycle. Below is the technical structural blueprint behind today's successful deployment:

2. Technical Highlights & Version Pinning Strategy

To eliminate fragile integration dependencies, today's deployment focuses heavily on an intentional, decoupled architecture strategy:

Stable Infrastructure Core: Built utilizing the production-mature terraform-aws-modules/eks/aws v20.x module line. This completely bypassed the strict access entry race conditions found in newer major module releases, which often drop worker node registrations.
Isolated Networking Baseline: Configured a dedicated AWS VPC utilizing a custom 10.129.0.0/16 CIDR block to eliminate network address collisions from previous stack runs. Public subnets utilize automated public IP allocation alongside standard Elastic Load Balancers for inbound web traffic.
Declarative Configuration Engineering: Leveraged Kustomize to manage base configurations, centralized container tag versions, and structural replication counts across all running deployments without requiring heavy templating engines.

3. Infrastructure as Code: The Complete `main.tf`

The core cloud foundation is provisioned entirely via Terraform. Below is the fully updated and corrected main.tf module script used to build the cluster base:

Terraform

# ==============================================================================
# DAY 29: GITOPS PATIENT ZERO - MAIN INFRASTRUCTURE CONFIGURATION
# ==============================================================================

data "aws_availability_zones" "available" {
  state = "available"
  
  filter {
    name   = "opt-in-status"
    values = ["opt-in-not-required"] # Restricts allocation to primary core AZs only
  }
}

locals {
  name    = "${var.project_name}-${var.environment}"
  eks_azs = slice(data.aws_availability_zones.available.names, 0, 2)

  tags = {
    Project     = var.project_name
    Environment = var.environment
    Owner       = var.owner
    ManagedBy   = "terraform"
    Day         = "29"
  }
}

# ------------------------------------------------------------------------------
# 1. NETWORKING LAYER (VPC, Subnets, and NAT Gateway Routing)
# ------------------------------------------------------------------------------
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"

  name = "${local.name}-vpc"
  cidr = "10.129.0.0/16"

  azs             = local.eks_azs
  public_subnets  = ["10.129.1.0/24", "10.129.2.0/24"]
  private_subnets = ["10.129.11.0/24", "10.129.12.0/24"]

  enable_nat_gateway     = true
  single_nat_gateway     = true
  map_public_ip_on_launch = true

  public_subnet_tags = {
    "kubernetes.io/role/elb" = "1"
  }

  private_subnet_tags = {
    "kubernetes.io/role/internal-elb" = "1"
  }

  tags = local.tags
}

# ------------------------------------------------------------------------------
# 2. COMPUTE LAYER (EKS Control Plane & Stable Managed Node Groups)
# ------------------------------------------------------------------------------
module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0" # Stable baseline bypassing v21 access entry registration bugs

  cluster_name    = "${local.name}-eks"
  cluster_version = var.cluster_version

  # CRITICAL NETWORK INTERACTION: Forces private worker nodes to talk to control plane locally
  cluster_endpoint_public_access  = true
  cluster_endpoint_private_access = true

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  enable_cluster_creator_admin_permissions = true

  eks_managed_node_groups = {
    gitops_nodes = {
      name = "gitops-ng"

      instance_types = ["t3.medium"]
      capacity_type  = "ON_DEMAND"

      min_size     = 1
      max_size     = 2
      desired_size = 1

      labels = {
        workload = "gitops"
      }
    }
  }

  tags = local.tags
}

4. Troubleshooting Case Study: Overcoming `NodeCreationFailure`

The most critical engineering value from Day 29 came from debugging a persistent error. During initial runs, the managed node group execution hung and failed with the following message:

last error: NodeCreationFailure: Unhealthy nodes in the kubernetes cluster

The Root Cause Analysis:

This happened due to a timing race condition inherent to newer EKS authentication frameworks. Worker nodes would boot inside their Auto Scaling groups but were blocked from registering with the cluster control plane before global AWS IAM access entries could propagate.

The Solution:

Downgraded the Module Baseline: Dropped from EKS module v21 to the stable v20.x architecture to leverage stable fallback configurations.
Decoupled Execution Model: Implemented a Two-Phase Apply strategy. By explicitly targeting the network and compute resources first (terraform apply -target=module.vpc -target=module.eks), the EKS control plane stabilized completely before the Helm and Kubernetes application providers were initialized.

5. Visualizing the Screenshots of Success

To document this setup, here are the real-world operational verifications from my deployment workspace:

Phase 1 Infrastructure Allocation Success

Argo CD Deployment Verification

The Complete GitOps Continuous Sync

Proving Automated Self-Healing & Configuration Drift

To put the pipeline to the ultimate test, I intentionally introduced configuration drift by modifying the application's declared state in version control. I updated my local kustomization.yaml manifest, scaling the application's frontend web tier, and pushed the changes to GitHub.

Instead of needing manual CLI scaling interventions or console clicks, Argo CD’s background reconciliation loop immediately detected the state divergence.

Below is the chronological event stream demonstrating the real-time self-healing capability of the cluster:

Reading the telemetry events captures the exact GitOps loop in action:

The Detection: At exactly 6:32 PM, a new Git commit hash was picked up. Argo CD instantly flagged the delta, dropping the state from Synced -> OutOfSync.
The Self-Heal Execution: The controller shifted the environment's health status from Healthy -> Progressing as it commanded the AWS EKS control plane to spin up the missing worker pods.
The Resolution: Within a mere 2 seconds, the reconciliation loop finalized, restoring the entire cluster back to a green, synchronized state (OutOfSync -> Synced and Progressing -> Healthy).

This proves that my infrastructure is completely self-healing. Any unauthorized changes or new application code commits are automatically reconciled to match the absolute source of truth in Git.

6. Accessing the Live Scaled Application

To verify that the entire loop was routing traffic correctly across our newly scaled infrastructure, I fetched the public entry point using:

$ kubectl get svc frontend -n familytasks-dev

Navigating to the provisioned AWS Elastic Load Balancer DNS endpoint opens up our live web deployment running flawlessly across the cluster:

This confirms our end-to-end continuous delivery pipeline is fully functional, secure, and routing user traffic dynamically across our automated pod replicas!

8. Deep-Dive Data Layer Verification (PostgreSQL Testing)

To guarantee that the 3-tier FamilyTasks architecture was fully functional beyond a basic web interface ping, I executed an interactive database integration audit. Using the Kubernetes API, I extracted the custom container environment parameters and routed a terminal shell handshake directly into the isolated PostgreSQL unit running in the private EKS network space.

$ kubectl exec -it postgres-5cd99799f5-krb96 -n familytasks-dev -- psql -U appuser -d familytasks

Once authenticated securely into the active familytasks target datastore using the assigned appuser identity, I executed an end-to-end transaction test cycle:

This telemetry test conclusively validates three critical components of the underlying cloud database configuration:

Schema Initialization Capability (CREATE TABLE): Confirms the storage volume bounds possess accurate execution permissions allowing the software layer to dynamically allocate database schemas.
Write Integrity (INSERT 0 1): Verifies that database write actions route properly without hitting volume locks or disk space limitations.
Read Path Speed (SELECT *): Validates that the state block returns records instantly, confirming data engine stability.

This database testing confirms that the end-to-end pipeline isn't just serving static Nginx pages; the core persistent data management layer is live, responsive, and fully optimized for secure production operations.

Video Reference

Jay

Search This Blog

Jayanth Katta | Technology, Life, Health & Learning