Day 30 - Multi-Environment GitOps Drift Detection with Automated Remediation

Introduction: 

To conclude my 30-Day AWS Terraform Challenge, I tackled a critical problem for modern cloud architectures: managing infrastructure drift across multiple environments. The goal was to build a system where the running state in AWS always reflects the desired state in GitHub, automatically healing any manual, unauthorized changes.

This final project demonstrates my ability to build scalable, highly available, and compliant infrastructure—proving I’m ready for the role of a Cloud Architect.


The Architecture: Highly Available, Highly Isolated

My application, a Dockerized Python web server (built in previous days), runs on AWS within a modular VPC. To prepare for production, I implemented environment isolation, state separation, and a professional multi-availability-zone (multi-AZ) layout.

Below is the high-level architecture diagram. It illustrates how Terraform Multi-Environment configuration, S3 state backend locking, and GitHub Actions work together to enforce compliance.



Architectural Highlights:

  • backend.hcl Files: Taught Terraform to dynamically choose the correct backend configuration (backend-dev.hcl or backend-prod.hcl) when run, keeping state data completely separated.

  • VPC Modularity: Used my shared VPC module to stand up a complete network stack (VPC, Public Subnets, Private Subnets, Gateways) for both dev and prod footprints.

  • Load Balancer Isolation: Provisioned a public Application Load Balancer (ALB) in public subnets for both environments to handle high availability across AZs.

  • Security Group Compliance: Integrated the key problem area—Security Groups—into the compliance engine.


The Self-Healing Compliance Pipeline

The core of this project is the GitOps Drift Detection & Remediation pipeline built with GitHub Actions. It acts as an automated security guard, constantly scanning the cloud environment.

Here is the exact compliance loop, which has been verified to work for both dev and prod stacks.

StepActionWhy It Matters
1. Scheduled ScanA GitHub Actions CRON schedule triggers the workflow to run on a best-effort basis (approximately every 5-15 minutes).Continuous monitoring ensures security gaps are identified quickly without manual intervention.
2. Core evaluation (plan)The runner logs into AWS, initializes the remote S3 backend with the environment choice, and runs terraform plan -detailed-exitcode.This is the test: Terraform compares the current cloud state against the desired code state. We look for a return exit code of 2, which explicitly means "Drift Detected."
3. Issue CreationIf exit code 2 is returned, the workflow uses the native GITHUB_TOKEN and the GitHub Issues REST API to create a detailed Tracking Audit Issue.In a true architect's workflow, this creates an unchangeable audit trail that documents exactly when compliance was breached.
4. Auto-Remediation (apply)Once the issue is opened, the workflow immediately executes terraform apply -auto-approve.The system auto-heals, closing the security gap in AWS and restoring compliance within seconds.
5. Automated Ticket ResolutionAfter a successful apply, the workflow comment-closes the tracking issue using dynamic API matching.The audit trail is finalized, showing the loop closed without human intervention.

Verifying the Architecture: Step-by-Step

To prove this engine works, I ran a strict sequence of validation tests. This required navigating directory and pathing bugs that were excellent architecture learning lessons!

Here is how I validated the full compliance engine.

1. Directory Structure

Running Terraform within nested subfolders inside a complex mono-repo is a common challenge. Since my workflow file was in the repository root but my .tf code was in a nested subfolder (day-30 - drift-detection/code), the pipeline often lost track of its plan files (tfplan).

The fix was using explicit absolute path routing via ${{ github.workspace }}.


2. Injecting Manual Drift

Now, I needed to play the role of the attacker or a misconfigured human. In the AWS console, I navigated to the Application Load Balancer security group for my production stack (prod-alb-sg). I manually added an unapproved inbound rule for Port 22 SSH.


3. Compliance Breach Caught

Within minutes, the CRON scheduled job woke up. You can see in the log snippet that it correctly initialized with the proper remote backend, ran the evaluation, and outputted the alert log: Plan: 0 to add, 21to change, 0 to destroy.

It caught the unauthorized rule and a background AWS AMI update, which it also flagged to heal.



4. Audit Trail Created and Healing Protocol Initiated

Right after the breach was detected, my workflow tapped into the GitHub API and automatically opened the official tracking audit issue #15. It contains my custom text alert detailing that the self-healing protocol was initiated.


5. Loop Closed: Remediated and Resolved!

Here is the perfect GitOps compliance loops in action! As shown in the finalized logs, the pipeline immediately ran the terraform apply step, wiped out that unauthorized port 22 rule, and commented on the issue to document completion.

Finally, looking at the Issues tab, the bot comment closed the ticket, flipping the status from "Open" to a purple Closed badge without my manual intervention.



Refining the GitOps Foundation: Day 30 Wrap-Up

With the pipeline fully functional and the issue-closure loop verified, the multi-environment drift detection engine is complete. This project was a great exercise in moving past basic setups and focusing on the underlying mechanics required to keep a multi-environment cloud ecosystem stable and compliant.

Here is a quick look at the core verification milestones hit during this build:

  • Automated Scheduling: Configured GitHub Actions on a predictable CRON loop to ensure continuous background verification without manual intervention.

  • State Assessment: Utilized terraform plan -detailed-exitcode to natively identify discrepancies between our Git repository (desired state) and the active cloud footprint (actual state).

  • Self-Healing Loop: Verified that the pipeline smoothly triggers terraform apply to strip out unauthorized configuration changes, bringing both dev and prod footprints back to compliance instantly.

  • Audit Trail Integration: Leveraged a custom GraphQL lookup within GitHub Script to seamlessly log, comment on, and resolve tracking tickets as soon as the live environment returns to its baseline configuration.

Coming from a Database Administration background, my focus has long been on managing and protecting persistent data state. Transitioning into Cloud Architecture has allowed me to apply that exact same precision to a new domain: ensuring the "persistence of desired infrastructure state" across distributed environments.

This brings the 30-Day AWS Terraform Challenge to a successful close. Next up: running the final terraform destroy cleanups to wind down the active test resources.

Thank you to everyone who followed along with the daily builds and architecture updates over the last month!

Video Reference


Jay

Comments

Popular posts from this blog

ASM Integrity check failed with PRCT-1225 and PRCT-1011 errors while creating database using DBCA on Exadata 3 node RAC

Life is beautiful

Lock Tables in MariaDB