Day 29 - GitOps Patient Zero — Continuous Delivery on AWS EKS with Argo CD and Kustomize
Today's project moves away from manual, ad-hoc cluster provisioning and embraces true modern cloud operations by setting up a production-grade GitOps Continuous Delivery (CD) Pipeline using Argo CD and Kustomize running on an Amazon Elastic Kubernetes Service (EKS) cluster.
Traditional infrastructure management often suffers from "configuration drift," where changes made manually via CLI or console make environments impossible to reproduce. By shifting to a GitOps topology, my entire application stack state is now declared natively in code and driven directly from a GitHub repository as the single source of truth.
1. The System Architecture Topography
This project establishes a clean boundary separating core cloud infrastructure provisioning from the software application delivery lifecycle. Below is the technical structural blueprint behind today's successful deployment:
2. Technical Highlights & Version Pinning Strategy
To eliminate fragile integration dependencies, today's deployment focuses heavily on an intentional, decoupled architecture strategy:
Stable Infrastructure Core: Built utilizing the production-mature
terraform-aws-modules/eks/awsv20.xmodule line. This completely bypassed the strict access entry race conditions found in newer major module releases, which often drop worker node registrations.Isolated Networking Baseline: Configured a dedicated AWS VPC utilizing a custom
10.129.0.0/16CIDR block to eliminate network address collisions from previous stack runs. Public subnets utilize automated public IP allocation alongside standard Elastic Load Balancers for inbound web traffic.Declarative Configuration Engineering: Leveraged Kustomize to manage base configurations, centralized container tag versions, and structural replication counts across all running deployments without requiring heavy templating engines.
3. Infrastructure as Code: The Complete main.tf
The core cloud foundation is provisioned entirely via Terraform. Below is the fully updated and corrected main.tf module script used to build the cluster base:
# ==============================================================================
# DAY 29: GITOPS PATIENT ZERO - MAIN INFRASTRUCTURE CONFIGURATION
# ==============================================================================
data "aws_availability_zones" "available" {
state = "available"
filter {
name = "opt-in-status"
values = ["opt-in-not-required"] # Restricts allocation to primary core AZs only
}
}
locals {
name = "${var.project_name}-${var.environment}"
eks_azs = slice(data.aws_availability_zones.available.names, 0, 2)
tags = {
Project = var.project_name
Environment = var.environment
Owner = var.owner
ManagedBy = "terraform"
Day = "29"
}
}
# ------------------------------------------------------------------------------
# 1. NETWORKING LAYER (VPC, Subnets, and NAT Gateway Routing)
# ------------------------------------------------------------------------------
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.0"
name = "${local.name}-vpc"
cidr = "10.129.0.0/16"
azs = local.eks_azs
public_subnets = ["10.129.1.0/24", "10.129.2.0/24"]
private_subnets = ["10.129.11.0/24", "10.129.12.0/24"]
enable_nat_gateway = true
single_nat_gateway = true
map_public_ip_on_launch = true
public_subnet_tags = {
"kubernetes.io/role/elb" = "1"
}
private_subnet_tags = {
"kubernetes.io/role/internal-elb" = "1"
}
tags = local.tags
}
# ------------------------------------------------------------------------------
# 2. COMPUTE LAYER (EKS Control Plane & Stable Managed Node Groups)
# ------------------------------------------------------------------------------
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 20.0" # Stable baseline bypassing v21 access entry registration bugs
cluster_name = "${local.name}-eks"
cluster_version = var.cluster_version
# CRITICAL NETWORK INTERACTION: Forces private worker nodes to talk to control plane locally
cluster_endpoint_public_access = true
cluster_endpoint_private_access = true
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
enable_cluster_creator_admin_permissions = true
eks_managed_node_groups = {
gitops_nodes = {
name = "gitops-ng"
instance_types = ["t3.medium"]
capacity_type = "ON_DEMAND"
min_size = 1
max_size = 2
desired_size = 1
labels = {
workload = "gitops"
}
}
}
tags = local.tags
}
4. Troubleshooting Case Study: Overcoming NodeCreationFailure
The most critical engineering value from Day 29 came from debugging a persistent error. During initial runs, the managed node group execution hung and failed with the following message:
last error: NodeCreationFailure: Unhealthy nodes in the kubernetes cluster
The Root Cause Analysis:
This happened due to a timing race condition inherent to newer EKS authentication frameworks. Worker nodes would boot inside their Auto Scaling groups but were blocked from registering with the cluster control plane before global AWS IAM access entries could propagate.
The Solution:
Downgraded the Module Baseline: Dropped from EKS module
v21to the stablev20.xarchitecture to leverage stable fallback configurations.Decoupled Execution Model: Implemented a Two-Phase Apply strategy. By explicitly targeting the network and compute resources first (
terraform apply -target=module.vpc -target=module.eks), the EKS control plane stabilized completely before the Helm and Kubernetes application providers were initialized.
5. Visualizing the Screenshots of Success
To document this setup, here are the real-world operational verifications from my deployment workspace:
Phase 1 Infrastructure Allocation Success
Argo CD Deployment Verification
The Complete GitOps Continuous Sync
Proving Automated Self-Healing & Configuration Drift
To put the pipeline to the ultimate test, I intentionally introduced configuration drift by modifying the application's declared state in version control. I updated my local kustomization.yaml manifest, scaling the application's frontend web tier, and pushed the changes to GitHub.
Instead of needing manual CLI scaling interventions or console clicks, Argo CD’s background reconciliation loop immediately detected the state divergence.
Below is the chronological event stream demonstrating the real-time self-healing capability of the cluster:
Reading the telemetry events captures the exact GitOps loop in action:
The Detection: At exactly 6:32 PM, a new Git commit hash was picked up. Argo CD instantly flagged the delta, dropping the state from
Synced -> OutOfSync.The Self-Heal Execution: The controller shifted the environment's health status from
Healthy -> Progressingas it commanded the AWS EKS control plane to spin up the missing worker pods.The Resolution: Within a mere 2 seconds, the reconciliation loop finalized, restoring the entire cluster back to a green, synchronized state (
OutOfSync -> SyncedandProgressing -> Healthy).
This proves that my infrastructure is completely self-healing. Any unauthorized changes or new application code commits are automatically reconciled to match the absolute source of truth in Git.
6. Accessing the Live Scaled Application
To verify that the entire loop was routing traffic correctly across our newly scaled infrastructure, I fetched the public entry point using:
$ kubectl get svc frontend -n familytasks-dev
Navigating to the provisioned AWS Elastic Load Balancer DNS endpoint opens up our live web deployment running flawlessly across the cluster:
This confirms our end-to-end continuous delivery pipeline is fully functional, secure, and routing user traffic dynamically across our automated pod replicas!
8. Deep-Dive Data Layer Verification (PostgreSQL Testing)
To guarantee that the 3-tier FamilyTasks architecture was fully functional beyond a basic web interface ping, I executed an interactive database integration audit. Using the Kubernetes API, I extracted the custom container environment parameters and routed a terminal shell handshake directly into the isolated PostgreSQL unit running in the private EKS network space.
$ kubectl exec -it postgres-5cd99799f5-krb96 -n familytasks-dev -- psql -U appuser -d familytasks
Once authenticated securely into the active familytasks target datastore using the assigned appuser identity, I executed an end-to-end transaction test cycle:
This telemetry test conclusively validates three critical components of the underlying cloud database configuration:
Schema Initialization Capability (
CREATE TABLE): Confirms the storage volume bounds possess accurate execution permissions allowing the software layer to dynamically allocate database schemas.Write Integrity (
INSERT 0 1): Verifies that database write actions route properly without hitting volume locks or disk space limitations.Read Path Speed (
SELECT *): Validates that the state block returns records instantly, confirming data engine stability.
This database testing confirms that the end-to-end pipeline isn't just serving static Nginx pages; the core persistent data management layer is live, responsive, and fully optimized for secure production operations.
Comments
Post a Comment