Day 30 - Multi-Environment GitOps Drift Detection with Automated Remediation
Introduction:
To conclude my 30-Day AWS Terraform Challenge, I tackled a critical problem for modern cloud architectures: managing infrastructure drift across multiple environments. The goal was to build a system where the running state in AWS always reflects the desired state in GitHub, automatically healing any manual, unauthorized changes.
This final project demonstrates my ability to build scalable, highly available, and compliant infrastructure—proving I’m ready for the role of a Cloud Architect.
The Architecture: Highly Available, Highly Isolated
My application, a Dockerized Python web server (built in previous days), runs on AWS within a modular VPC. To prepare for production, I implemented environment isolation, state separation, and a professional multi-availability-zone (multi-AZ) layout.
Below is the high-level architecture diagram. It illustrates how Terraform Multi-Environment configuration, S3 state backend locking, and GitHub Actions work together to enforce compliance.
Architectural Highlights:
backend.hclFiles: Taught Terraform to dynamically choose the correct backend configuration (backend-dev.hclorbackend-prod.hcl) when run, keeping state data completely separated.VPC Modularity: Used my shared VPC module to stand up a complete network stack (
VPC,Public Subnets,Private Subnets,Gateways) for bothdevandprodfootprints.Load Balancer Isolation: Provisioned a public Application Load Balancer (
ALB) in public subnets for both environments to handle high availability across AZs.Security Group Compliance: Integrated the key problem area—Security Groups—into the compliance engine.
The Self-Healing Compliance Pipeline
The core of this project is the GitOps Drift Detection & Remediation pipeline built with GitHub Actions. It acts as an automated security guard, constantly scanning the cloud environment.
Here is the exact compliance loop, which has been verified to work for both dev and prod stacks.
| Step | Action | Why It Matters |
| 1. Scheduled Scan | A GitHub Actions CRON schedule triggers the workflow to run on a best-effort basis (approximately every 5-15 minutes). | Continuous monitoring ensures security gaps are identified quickly without manual intervention. |
2. Core evaluation (plan) | The runner logs into AWS, initializes the remote S3 backend with the environment choice, and runs terraform plan -detailed-exitcode. | This is the test: Terraform compares the current cloud state against the desired code state. We look for a return exit code of 2, which explicitly means "Drift Detected." |
| 3. Issue Creation | If exit code 2 is returned, the workflow uses the native GITHUB_TOKEN and the GitHub Issues REST API to create a detailed Tracking Audit Issue. | In a true architect's workflow, this creates an unchangeable audit trail that documents exactly when compliance was breached. |
4. Auto-Remediation (apply) | Once the issue is opened, the workflow immediately executes terraform apply -auto-approve. | The system auto-heals, closing the security gap in AWS and restoring compliance within seconds. |
| 5. Automated Ticket Resolution | After a successful apply, the workflow comment-closes the tracking issue using dynamic API matching. | The audit trail is finalized, showing the loop closed without human intervention. |
Verifying the Architecture: Step-by-Step
To prove this engine works, I ran a strict sequence of validation tests. This required navigating directory and pathing bugs that were excellent architecture learning lessons!
Here is how I validated the full compliance engine.
1. Directory Structure
Running Terraform within nested subfolders inside a complex mono-repo is a common challenge. Since my workflow file was in the repository root but my .tf code was in a nested subfolder (day-30 - drift-detection/code), the pipeline often lost track of its plan files (tfplan).
The fix was using explicit absolute path routing via ${{ github.workspace }}.
2. Injecting Manual Drift
Now, I needed to play the role of the attacker or a misconfigured human. In the AWS console, I navigated to the Application Load Balancer security group for my production stack (prod-alb-sg). I manually added an unapproved inbound rule for Port 22 SSH.
3. Compliance Breach Caught
Within minutes, the CRON scheduled job woke up. You can see in the log snippet that it correctly initialized with the proper remote backend, ran the evaluation, and outputted the alert log: Plan: 0 to add, 21to change, 0 to destroy.
It caught the unauthorized rule and a background AWS AMI update, which it also flagged to heal.
4. Audit Trail Created and Healing Protocol Initiated
Right after the breach was detected, my workflow tapped into the GitHub API and automatically opened the official tracking audit issue #15. It contains my custom text alert detailing that the self-healing protocol was initiated.
5. Loop Closed: Remediated and Resolved!
Here is the perfect GitOps compliance loops in action! As shown in the finalized logs, the pipeline immediately ran the terraform apply step, wiped out that unauthorized port 22 rule, and commented on the issue to document completion.
Finally, looking at the Issues tab, the bot comment closed the ticket, flipping the status from "Open" to a purple Closed badge without my manual intervention.
Refining the GitOps Foundation: Day 30 Wrap-Up
With the pipeline fully functional and the issue-closure loop verified, the multi-environment drift detection engine is complete. This project was a great exercise in moving past basic setups and focusing on the underlying mechanics required to keep a multi-environment cloud ecosystem stable and compliant.
Here is a quick look at the core verification milestones hit during this build:
Automated Scheduling: Configured GitHub Actions on a predictable CRON loop to ensure continuous background verification without manual intervention.
State Assessment: Utilized
terraform plan -detailed-exitcodeto natively identify discrepancies between our Git repository (desired state) and the active cloud footprint (actual state).Self-Healing Loop: Verified that the pipeline smoothly triggers
terraform applyto strip out unauthorized configuration changes, bringing bothdevandprodfootprints back to compliance instantly.Audit Trail Integration: Leveraged a custom GraphQL lookup within GitHub Script to seamlessly log, comment on, and resolve tracking tickets as soon as the live environment returns to its baseline configuration.
Coming from a Database Administration background, my focus has long been on managing and protecting persistent data state. Transitioning into Cloud Architecture has allowed me to apply that exact same precision to a new domain: ensuring the "persistence of desired infrastructure state" across distributed environments.
This brings the 30-Day AWS Terraform Challenge to a successful close. Next up: running the final terraform destroy cleanups to wind down the active test resources.
Thank you to everyone who followed along with the daily builds and architecture updates over the last month!
Comments
Post a Comment