Stories by Cherukuri sai on Medium

I Built an AI That Fixes Pipeline Failures Before Platform or DevSecOps teams Gets the Slack…

Cherukuri sai — Sun, 22 Feb 2026 18:35:15 GMT

🤖 I Built an AI That Fixes Pipeline Failures Before Platform or DevSecOps teams Gets the Slack Message

📉 The Slack Notification That’s Crushing Your Sprint Velocity

“Hey, the pipeline failed again. Can you check?”

Whether it’s 2 AM production calls or during business hours, your phone buzzes. You squint at Slack — another developer hasn’t even *looked* at the error logs. They just pinged you. Again. Meanwhile, your actual engineering tasks? Delayed. Your sprint commitments? Slipping.

Sound familiar?

If you’re on a Platform, or DevSecOps team, this is your daily reality. Developers don’t read error logs most of the time. They screenshot red X’s and ask, “Can you fix this?” Meanwhile, you’re playing detective through terraform traces, kubectl describes, and npm error dumps.

🤔 “But wait, doesn’t my CI/CD tool have AI features now?

Sure. Many platforms are adding AI capabilities. But here’s the problem:

- ❌ They don’t understand *your* organizational standards and policies

- ❌ You can’t customize them with your company’s best practices

- ❌ They’re generic — trained on public data, not your codebase patterns

- ❌ Vendor lock-in — you’re stuck with whatever they decide to build

- ❌ Can’t inject your known failure patterns and compliance rules

💡 What if you could build something better? A pipeline that truly understands YOUR context and diagnoses itself?

I Got Tired of Being a Human Log Parser

After the 47th “npm test failed, help!” message (where the error literally said `package.json not found`), I’d had enough.

So, I built something different: An AI-powered failure analyzer that tells developers EXACTLY what’s wrong and how to fix it — right in the pipeline output.

No Slack. No tickets. No context switching. Just instant, actionable answers.

Here’s What It Looks Like in Action

Example 1: NPM Test Failure
Before: The Typical Developer Experience

❌ Run Tests Failed
Error: Process completed with exit code 1

npm ERR! code ENOENT
npm ERR! syscall open
npm ERR! path /home/runner/work/project/package.json
npm ERR! errno -2
npm ERR! enoent ENOENT: no such file or directory, open '/home/runner/work/project/package.json'
npm ERR! enoent This is related to npm not being able to find a file.

Developer:” I’ll ask DevOps…” 😕

After: With AI Analysis

Developer:” Oh, I’ll fix that.” ✅

Example 2: Helm Deployment Failure
Before: The Typical Developer Experience

❌ Helm Install Failed
Error: timed out waiting for the condition

Developer:” I’ll ask DevOps…”

Developer:” Got it, fixing now. And I’ll add those kubectl commands to my pipeline for next time.” ✅

Example 3: Terraform Validation Failure
Before: The Typical Developer Experience

❌ Terraform Validate Failed
Error: Error: Unsupported argument

Developer:” I’ll ask DevOps…”

After: With AI Analysis

Developer:” Makes sense, changing it.” ✅

Example 4: Git Authentication Failure (Rule-Based)
Before: The Typical Developer Experience

❌ Clone Private Repo Failed
Error: Process completed with exit code 1

Cloning repository...
fatal: could not read Username for 'https://github.com': No such device or address
Permission denied (publickey).
fatal: Could not read from remote repository.

Developer:” I’ll ask DevOps…” 😕

After: With Rule-Based Analysis

Developer:” Ah, permissions issue. Let me check with the Platform team — they manage our tokens.” ✅

🔬 The Secret Sauce: Rule-Based + AI Hybrid Approach

I didn’t want another chatbot. I wanted something that understands *pipeline failures* specifically.

So, I built a hybrid system that combines the best of both worlds:

1. Rule-Based Analysis (Standards & Policies)

Injected our organization’s coding standards and best practices

Pre-configured common failure patterns

Fast, deterministic checks for known issues

Enforces company-specific policies and compliance requirements

2. AI-Powered Deep Analysis (Custom Model)

When rules don’t match, custom AI model takes over

Custom prompts tuned for infrastructure and deployment errors

Understands context across multiple files and configurations

Learns from error patterns we haven’t seen before

3. Smart Context Capture

Not just “npm failed” but the actual error output

Pod descriptions for Kubernetes issues

Terraform validation errors with line numbers

Relevant configuration files (values.yaml, package.json, etc.)

4. Actionable Intelligence

“Change X to Y in file Z”

Not vague suggestions like “check your config”

Includes confidence level so devs know when to escalate

🏗️ System Architecture
Here’s how it all flows together:


┌─────────────────────────────────────┐
│      GitHub Actions Pipeline       │
│    (or any CI/CD platform)         │◄──── Developer pushes code
└───────────────┬─────────────────────┘
                │
                │ ❌ Pipeline Failure
                ▼
┌─────────────────────────────────────┐
│       Error Context Capture         │
│  • Build/test logs                  │
│  • Config files (values.yaml, etc)  │
│  • Pod status (Kubernetes)          │
│  • Terraform output                 │
└───────────────┬─────────────────────┘
                │
                ▼
┌─────────────────────────────────────────────────────┐
│            (Serverless)                             │
│                                                     │
│  ┌───────────────────────────────────────────┐     │
│  │      Rule-Based Analysis Engine           │     │
│  │  • Organizational Standards               │     │
│  │  • Common Failure Patterns                │     │
│  │  • Policy Compliance Checks               │     │
│  └───────────────┬───────────────────────────┘     │
│                  │                                  │
│     Known Issue? │                                  │
│         ✓        │         ✗ Unknown Issue          │
│         │        └──────────────┐                   │
│         │                       ▼                   │
│         │          ┌─────────────────────────┐      │
│         │          │   Custom AI Model       │      │
│         │          │  • Custom Prompts       │      │
│         │          │  • Context Analysis     │      │
│         │          │  • Root Cause Detection │      │
│         │          └─────────────────────────┘      │
│         │                       │                   │
│         └───────────┬───────────┘                   │
│                     ▼                               │
│  ┌─────────────────────────────────────────┐       │
│  │         Response Generation             │       │
│  │  • Root Cause Identified                │       │
│  │  • Affected File/Location               │       │
│  │  • Exact Fix Instructions               │       │
│  │  • Confidence Level                     │       │
│  └─────────────────────────────────────────┘       │
└───────────────┬─────────────────────────────────────┘
                │
                ▼
┌─────────────────────────────────────┐
│       Pipeline Output (UI)          │
│  🤖 AI FAILURE ANALYSIS             │
│  📊 Root Cause + File Location      │
│  🔧 Step-by-Step Fix                │
│  ✅ Confidence: High/Medium/Low     │
└───────────────┬─────────────────────┘
                │
                ▼
┌─────────────────────────────────────┐
│         Developer Action            │
│  • Reads clear explanation          │
│  • Applies fix immediately          │
│  • No DevOps interruption needed ✅ │
└─────────────────────────────────────┘

The Flow:

1. Pipeline Fails → Error logs, config files, and context captured
2. Rule Engine → Checks against organizational standards and known patterns
3. AI Analysis → If rules don’t match, custom AI model analyzes with tailored prompts
4. Response → Developer gets exact file location, root cause, and fix steps
5. Resolution → Developer applies fix or auto-fix PR is created

The entire process takes 2–5 seconds from failure to actionable recommendation.

📊 The Impact: Measured in Hours Saved

Before:
- Average resolution time: 45 minutes
- 60% of issues: developers didn’t read logs
- Platform team: interrupt-driven firefighting

After:
- 80% of issues: self-service fixes
- Developers get answers in seconds
- Platform team: focused on actual platform work

🛠️ You Can Build This Too
The architecture is straightforward and cloud-agnostic:

✅ Serverless Function: (Azure Functions, AWS Lambda, or Google Cloud Functions)
✅ AI Model: (Azure OpenAI, AWS Bedrock, or Google Vertex AI — deployed as custom model for this POC)
✅ CI/CD Integration: (Works with GitHub Actions, Harness, GitLab CI, Jenkins, Azure DevOps, or any pipeline)
✅ Multi-Stack Support: (npm, Docker, Kubernetes, Terraform, Helm, and any build/deployment tool)

No vendor lock-in. Choose your cloud provider and AI service. The concept works across all major platforms.

🚀 The Future: Self-Healing Pipelines?

Right now, it *diagnoses*.
Next step? Auto-fix pull requests.
Imagine:

1. Pipeline fails on invalid Kubernetes CPU value
2. AI detects: `cpu: INVALID_VALUE` in values.yaml
3. Bot creates PR: “Fix: Change cpu to 100m”
4. Developer reviews and merges
5. Pipeline passes

From failure to fix in 30 seconds. No human parsing logs.

💥 The Bottom Line: This Changes Everything

Every platform or DevOps team on the planet faces this problem. From startups to Fortune 500 companies, the cycle is the same:

Developer breaks pipeline

DevOps drops everything to read logs

Manual explanation of obvious error

Repeat 50 times a day

**I built this AI agent because I got tired of being a log-reading service.**

This isn’t just another automation script. It’s a **fundamental shift** in how we think about developer independence and platform or DevOps team efficiency.

The Real Impact:

Your platform or DevOps team stops being interrupt-driven

Developers solve their own issues in seconds, not hours

Your organizational knowledge is baked into every analysis

You own the code — no vendor dictating features or pricing

It works across every tool in your stack

**What took me 45 minutes to debug now takes developers 45 seconds to fix themselves. **

That’s not just time saved. That’s your platform engineers building the future instead of explaining the past.

This is what modern platform engineering looks like — **systems that scale knowledge, not just infrastructure. **

💬 Want to stop being your team’s human error parser?

I’ve built the complete architecture — rule engine, custom AI integration, and deployment framework — that’s already saving platform teams hours every week.

**Let’s talk.** Visit my portfolio to get in touch if you’re ready to:

Cut your team’s pipeline debugging time by 80%

Give developers self-service failure diagnosis

Finally focus on building features instead of reading logs

Portfolio: Click Here

The code isn’t open source, but the conversation is. Let’s build something powerful for your team.

*Have a platform engineering or DevOps story? Drop it in the comments — I’d love to hear how other teams are scaling their operations. *

🤖 I Built an AI That Fixes Pipeline Failures Before Platform or DevSecOps teams Gets the Slack… was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Beyond Migration: How We Engineered a Secure & Intelligent Delivery Platform with Harness CICD

Cherukuri sai — Thu, 19 Feb 2026 05:44:39 GMT

Our Harness migration became the turning point — not because of the tool, but because of the architecture we built around it.

TABLE OF CONTENTS

Introduction
Executive Outcomes
Phase 1 — Redesigning Identity
Phase 2 — Delegate Architecture Redesign
Phase 3 — Deterministic Execution
Phase 4 — Governance as Code
Phase 5 — Immutable Artifact Lifecycle
Phase 6 — Progressive Delivery and Feature Flags
Capabilities Most Teams Never Operationalize
Migration vs Modernization
Conclusion

Introduction:

Most organizations treat CI/CD migration as a tooling upgrade.

Replace Jenkins, TeamCity, GitHub Actions etc.
Adopt Harness.
Recreate pipelines.

But migration only upgrades tools.
Modernization upgrades architecture.

When we moved to Harness, I knew that simply shifting pipelines would not reduce risk, improve reliability, or strengthen governance. Carrying forward our existing trust and execution model would only scale our weaknesses.

So instead of treating this as a CI/CD replacement, we approached it as Secure Delivery Platform Engineering — redesigning identity, governance, execution boundaries, artifact flow, and reliability as first‑class platform concerns.

CI/CD is not automation.

It is a privileged control plane.
If engineered casually, it scales risk.
If engineered intentionally, it scales safety and velocity.

🔎 Executive Results

🔐 100% removal of static cloud credentials — 37 IAM keys eliminated with OIDC
📉 ~40% reduction in pipeline inconsistencies through deterministic execution
🚫 Zero unapproved production deployments after policy‑as‑code enforcement
⚡ ~30% throughput improvement with delegate segmentation + scaling
🛡 ~50% reduction in deployment‑related risk using feature flags & progressive delivery
📦 100% artifact traceability via build‑once, promote‑everywhere
📊 Stronger audit posture and reduced governance review overhead

These were not cosmetic improvements.
They were architectural corrections.

Phase 1 — Redesigning Identity, Not Just Pipelines

Our first challenge: credential sprawl.

37 static IAM access keys across pipelines
Shared service accounts
Cross‑environment permissions

We replaced static credentials with OIDC‑based role assumption:

Pipelines assumed short‑lived scoped roles
Environment‑specific access
Long‑lived secrets eliminated

Impact:

Entire category of credential leakage risk removed
90% reduction in credential rotation
Stronger audit traceability

CI/CD became identity‑aware execution infrastructure.

Phase 2 — Treating Delegates as Privileged Control Plane Infrastructure

Delegates perform:

Infrastructure provisioning
Cluster operations
Secret access
Production deployments

They are not background agents.
They are privileged systems.

We redesigned delegate architecture:

Dedicated delegate groups per environment
Enforced delegate selectors in pipelines
Production delegates placed in private subnets
Restricted outbound egress

Impact:

Reduced cross‑environment execution risk
Contained blast radius
Clear execution boundaries

Trust became intentional, not shared.

Phase 3 — Deterministic Execution Using Containerized Toolchains

Instead of manual tool installation on delegates, we built versioned CI images containing:

Terraform, TFLint, Checkov
kubectl, Helm
AWS CLI, AZ CLI
OPA, Cosign
Internal validation scripts

Pipelines executed inside these containers.

Impact:

Zero delegate drift
~40% fewer pipeline inconsistencies
Easy tool upgrades via image versioning

Tooling became deterministic and reproducible.

Phase 4 — Governance as Code, Not Process

Security guidance without enforcement is optional.

We enforced governance at platform level:

Organization‑level reusable templates
Mandatory scanning and validation steps
Policy‑as‑code enforcement (OPA)
Approval logic encoded in pipelines
Registry restrictions and disallowed “latest” tags

Impact:

Zero bypassed production governance
Standardized patterns across teams
Faster compliance cycles

Governance became automated, not manual.

Phase 5 — Immutable Artifact Lifecycle

We eliminated rebuild‑per‑environment patterns.

Instead:

Build once
Sign artifact
Promote Dev → QA → Prod
Verify signatures before deploy

Impact:

100% artifact traceability
Less drift and fewer surprises
Strong rollback confidence

Production became a promotion environment, not a rebuild environment.

Phase 6 — Progressive Delivery & Feature Flags

The biggest risk reduction came from feature flags:

Canary rollouts
Gradual traffic exposure
Instant rollback via flag toggle
Environment‑based flag policies

Impact:

~50% reduction in deployment incidents
Faster mitigation
Higher deployment frequency with lower risk

Deployment and exposure were decoupled.

Capabilities Most Teams Never Operationalize

Most teams adopt Harness.
Few operationalize its full platform capabilities.

Here’s what we embedded:

A. Git‑Based Pipeline Change Governance

PR‑based updates
No UI editing
Full traceability

Pipelines became infrastructure‑as‑code.

B. Monitoring‑Driven Automated Rollback

Canary vs baseline checks
Automated anomaly detection
Auto‑rollback triggers

Deployments became self‑validating.

C. Delegate Auto‑Scaling

Kubernetes‑based scaling
Elastic execution
Reduced idle costs

CI/CD became elastic infrastructure.

D. Error Budget–Aware Deployment Gating

SLO health checks
Deployment restrictions during instability

Delivery became reliability‑aware.

E. Chaos‑Validated Rollbacks

Rollback paths tested through chaos engineering

Resilience became provable.

F. Centralized Connector Governance

No team‑owned connectors
Centralized authentication patterns

Credential sprawl dropped significantly.

G. Developer Experience Uplift

Faster troubleshooting
Reusable templates
Safer experimentation
Predictable deployments

Developers gained safe autonomy.

Migration vs. Modernization

Migration moves pipelines.
Modernization redesigns the delivery platform.

Modernization means:

Identity redesign
Shared‑nothing execution boundaries
Governance as code
Deterministic toolchains
Immutable artifacts
Progressive delivery
Reliability‑aware deployment gates

Many organizations migrate.
Few modernize.

Conclusion:

Harness did not modernize our ecosystem.
Architectural intent did.

By redesigning identity, segmentation, deterministic execution, governance, artifact flow, and reliability, we transformed CI/CD from automation into a secure delivery platform.

CI/CD is not just a pipeline.
It is a privileged control plane.

When engineered deliberately, it becomes the foundation of safe, scalable, high‑trust delivery.

Tools don’t create maturity.
Architecture does.
Intent does.
Design does.

Harness was the canvas.
Secure Delivery Platform Engineering was the art.

DevSecOps — Community 🚀

Thank you for being a part of the DevSecOps — Community community! Before you go:

Be sure to clap and follow ️ the Author👏️️
Follow: Newsletter | LinkedIn Groups |
More content at DevSecOps — Community

Beyond Migration: How We Engineered a Secure & Intelligent Delivery Platform with Harness CICD was originally published in devsecops-community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Part 2 — How to Build Your Own AI Agent: (Cloud-Agnostic, Fully Automated, Enterprise-Ready)

Cherukuri sai — Mon, 16 Feb 2026 09:16:20 GMT

From natural-language prompts → to Terraform module → to PR → to CI/CD → to validation

Most AI content explains concepts.
This guide helps you build something real — a fully functioning AI agent that:

Understands natural-language infrastructure requests
Generates complete Terraform modules (multi-file)
Enforces strict enterprise standards
Auto-fixes issues via LLM reasoning
Creates a GitHub branch
Commits all files
Opens a pull request
Triggers GitHub Actions
Runs unit tests
Runs Terraform init + validate + plan
Works in any cloud (AWS, Azure, GCP, etc.)
Works in any pipeline (GitHub, Harness, GitLab, Jenkins, Azure DevOps, etc.)

By the end of this article, you’ll have the blueprint of a Digital DevOps Engineer Agent.

1️⃣ What Makes This Agent Different?

It does not assume AWS or Azure or GCP.
If your prompt says:

“Create AWS Lambda module” → Generates AWS Terraform

“Create Azure Storage Account module” → Generates Azure Terraform

“Create GCP Cloud Run module” → Generates GCP Terraform

It adapts automatically because it generates pure Terraform (HCL).

✔ Cloud‑Agnostic

Works with ANY Terraform provider, including:

AWS
Azure
GCP
OCI
Cloudflare
Kubernetes
DigitalOcean
VMware vSphere
Proxmox
GitHub provider
And every provider on the Terraform registry

✔ CI/CD‑Agnostic

Runs in:

GitHub Actions
Harness
GitLab CI
Bitbucket Pipelines
Jenkins
Azure DevOps
CircleCI

Anywhere Python + Terraform exist — the agent works.

✔ Enterprise‑Grade Validation

Your standards engine validates:

Required tags

snake_case variable names

IAM least privilege

Secret detection

Deprecated syntax detection

VPC subnet structure

Provider best practices

Module reusability

Terraform version constraints

And does this for every .tf file in the module.

✔ Complete Multi‑File Generation

Every request generates:

main.tf
variables.tf
outputs.tf
providers.tf
README.md

2️⃣ Understanding AI Agents (Enterprise Context)

An AI Agent is not just an LLM.
It is a system made of:

🔹 Brain (LLM): Interprets the user’s request.
🔹 Memory (Context, RAG): Holds standards, patterns, best practices.
🔹 Tools (Python, Filesystem, Terraform CLI, GitHub API): Allows the agent to act, not just talk.
🔹 Reasoning Loop: Plan → Generate → Validate → Fix → Loop
🔹 Policy Layer: Your org’s security, naming, tagging, compliance.
🔹 Runtime Environment: GitHub Actions, Pipelines, Local runner, Cloud VMs.

Together, they form a Digital DevOps Engineer.

3️⃣ Prerequisites

✔ Skills

Python
Terraform
GitHub
CI/CD
Prompt engineering basics

✔ Installations

Run:

pip install openai langchain python-dotenv PyGithub
brew install terraform        # Mac
choco install terraform       # Windows

✔ Organization Inputs

Prepare a standards file:
standards.md

1. Tags required: environment, owner, cost_center.
2. Variables must use snake_case.
3. IAM must follow least privilege.
4. No hardcoded secrets.
5. Modules must be reusable.
6. VPC must include public + private subnets.

This becomes your policy engine.

4️⃣ Architecture of What We’re Building

User Provides Prompt
        ↓
┌───────────────────────────────────────────────────────┐
│  Agent Pipeline (agent_real.py)                       │
│  ────────────────────────────────────                 │
│  1. Plan → Break request into steps (LLM)            │
│  2. Generate → Create Terraform module files          │
│     • main.tf (resources)                             │
│     • variables.tf (inputs with validation)           │
│     • outputs.tf (outputs with descriptions)          │
│     • providers.tf (provider versions & config)       │
│     • README.md (usage documentation)                 │
│  3. Validate → Check against standards                │
│     • Required tags (environment, owner, cost_center) │
│     • snake_case variables                            │
│     • No hardcoded secrets                            │
│     • Least-privilege IAM                             │
│     • No deprecated features                          │
│     • Current provider best practices                 │
│  4. Fix → Auto-correct issues (LLM or heuristics)     │
│  5. Loop → Repeat validate/fix until clean            │
└───────────────────────────────────────────────────────┘
        ↓
┌───────────────────────────────────────────────────────┐
│  GitHub Integration (agent.py)                        │
│  ─────────────────────────                            │
│  • Create feature branch: ai/-YYYYMMDDHHMMSS    │
│  • Commit all module files to modules//       │
│  • Open Pull Request                                  │
└───────────────────────────────────────────────────────┘
        ↓
┌───────────────────────────────────────────────────────┐
│  CI/CD Workflows (.github/workflows/)                 │
│  ────────────────────────────────────                 │
│  • python-tests.yml → Run validation tests            │
│  • terraform.yml → init + validate + plan all modules │
└───────────────────────────────────────────────────────┘
        ↓
   PR Ready for Review

5️⃣ Step 1: Build the Python Utility (Terraform Standards Validator)

Create: terraform_standards.py
This is your AI enforcer.

import re

TERRAFORM_STANDARDS = """
1. Required tags: environment, owner, cost_center.
2. IAM roles must have least-privilege policies.
3. Use snake_case for variables.
4. No hardcoded secrets allowed.
5. Modules must be reusable.
6. VPC must include public + private subnets.
"""


def validate(code: str) -> str:
    issues = []

    # Check for tags presence anywhere
    if not re.search(r"\btags\s*=\s*\{", code):
        issues.append("Missing required tags block (environment, owner, cost_center).")
    else:
        # Ensure required tag keys exist in any tags block
        tags_blocks = re.findall(r"tags\s*=\s*\{([^}]*)\}", code, flags=re.S)
        for tb in tags_blocks:
            if "environment" not in tb or "owner" not in tb or "cost_center" not in tb:
                issues.append("Tags block missing one of environment/owner/cost_center.")
                break

    # Hardcoded secret detection (common patterns)
    if re.search(r"(?i)aws_secret_access_key|aws_access_key_id|secret\s*=|password\s*=|passwd\s*=|\bSECRET_", code):
        issues.append("Hardcoded secret or credentials detected.")
    # Detect secrets in variable defaults or variable names (e.g., default = "secret123")
    if re.search(r"(?i)default\s*=\s*\".*(secret|password|passwd).*\"", code) or re.search(r"(?i)variable\s+\".*(password|secret|passwd).*\"", code):
        issues.append("Hardcoded secret detected in variable default or name.")

    # Variable naming heuristic: flag variables with uppercase or camelCase
    if re.search(r"variable\s+\".*([A-Z].*|[a-z]+[A-Z].*)\"", code):
        issues.append("Variables should use snake_case (avoid CamelCase or uppercase).")

    # IAM least-privilege heuristic: look for wildcard resources or actions
    if re.search(r"aws_iam_policy|aws_iam_role_policy", code):
        if re.search(r'"?Resource"?\s*:\s*\[?\s*"?\*"?', code) or re.search(r'"?Action"?\s*:\s*\[?\s*"?.*\*.*"?', code):
            issues.append("IAM policy uses wildcard Action or Resource; prefer least-privilege.")

    # IAM role existence but missing policy
    if "aws_iam_role" in code and not re.search(r"aws_iam_policy|role_policy|policy\s*=", code):
        issues.append("IAM role present but no inline or attached policy found.")

    # VPC subnet check
    if re.search(r"resource\s+\"aws_vpc\"", code) and not re.search(r"resource\s+\"aws_subnet\".*(public|private)|public_subnet|private_subnet", code, flags=re.S):
        issues.append("VPC must include both public and private subnets.")

    # Module reusability: prefer modules rather than repeating resources
    if re.search(r"resource\s+\"aws_vpc\".*resource\s+\"aws_vpc\"", code, flags=re.S):
        issues.append("Duplicate VPC resources detected; prefer reusable modules.")

    return "OK" if not issues else "\n".join(issues)


if __name__ == "__main__":
    sample = '''
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
}
'''
    print(validate(sample))
import re

TERRAFORM_STANDARDS = """
1. Required tags: environment, owner, cost_center.
2. IAM roles must have least-privilege policies.
3. Use snake_case for variables.
4. No hardcoded secrets allowed.
5. Modules must be reusable.
6. VPC must include public + private subnets.
"""


def validate(code: str) -> str:
    issues = []

    # Check for tags
    if not re.search(r"\btags\s*=", code):
        issues.append("Missing required tags: environment, owner, cost_center.")

    # Hardcoded secret detection (simple heuristics)
    if re.search(r"(?i)secret\s*=|password\s*=|passwd\s*=|\bSECRET_", code):
        issues.append("Hardcoded secret detected.")

    # Variable naming heuristic: flag variables with uppercase letters
    if re.search(r"variable\s+\".*[A-Z].*\"", code):
        issues.append("Variables should use snake_case.")

    # IAM role policy check
    if "aws_iam_role" in code and not ("policy" in code or "aws_iam_policy" in code):
        issues.append("IAM roles must define least-privilege policy.")

    # VPC subnet check
    if re.search(r"resource\s+\"aws_vpc\"", code) and not re.search(r"aws_subnet|public_subnet|private_subnet", code):
        issues.append("VPC must include public + private subnets.")

    return "OK" if not issues else "\n".join(issues)


if __name__ == "__main__":
    # quick local smoke test
    sample = '''
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
}
'''
    print(validate(sample))

6️⃣ Step 2: Build the Agent (Brain + Tools)

This file does the magic:

Uses OpenAI (or Mock LLM for free testing)
Generates 5 Terraform files with ### FILE: markers
Parses multi-file output
Validates each .tf file
Fixes issues using LLM or fallback heuristics
Repeats validation (max iterations)
Returns a clean, production-ready module

Create: agent_real.py

import os
import re
import json
from dotenv import load_dotenv

load_dotenv()

from terraform_standards import validate, TERRAFORM_STANDARDS

OPENAI_MODEL = os.getenv("OPENAI_MODEL", "gpt-4o-mini")


PROMPT_TEMPLATE = """
You are an expert Terraform generator. Follow these org standards exactly:
{standards}

User request:
{request}

Produce a complete, production-ready, reusable Terraform module with proper enterprise structure.

Generate the following files with clear separators:

### FILE: main.tf


### FILE: variables.tf


### FILE: outputs.tf


### FILE: providers.tf


### FILE: README.md


CRITICAL Terraform Syntax Rules:
- Tags ONLY go inside resource blocks, NOT in output/variable/provider blocks
- Output blocks ONLY support: value, description, sensitive, depends_on
- Variable blocks ONLY support: type, description, default, validation, sensitive, nullable
- Provider blocks do NOT have tags - use default_tags in AWS provider if needed
- Always close ALL braces properly - verify each opening brace has a closing brace
- Use proper HCL syntax - check for missing commas, quotes, and braces

Resource-Specific Requirements:
- AWS Lambda: MUST specify exactly ONE of: filename (with default="lambda.zip"), s3_bucket+s3_key, or image_uri
- For reusable Lambda modules, use s3_bucket + s3_key as required variables (most common pattern)
- IAM roles: must have assume_role_policy with proper JSON
- Security groups: must have at least one ingress or egress rule
- VPCs: should include both public and private subnets
- CloudWatch alarms: require comparison_operator, evaluation_periods, metric_name, namespace, period, statistic, threshold

Ensure:
- Include required tags: environment, owner, cost_center in RESOURCE blocks only
- Use snake_case for all variable names
- No hardcoded secrets or credentials
- Least-privilege IAM policies (avoid Resource = "*")
- Proper variable validation and constraints
- Clear descriptions for all variables and outputs
- Module should be reusable across environments
- Include usage examples in README
- Use ONLY current, non-deprecated resource types and arguments
- Follow latest provider best practices (check documentation)
- Use required_providers block with version constraints
- Specify minimum terraform version in providers.tf
- Use terraform.workspace or variables for environment-specific values
- Avoid deprecated syntax (e.g., use for_each over count when appropriate)

IMPORTANT: Return ONLY raw Terraform code with ### FILE: separators.
DO NOT wrap code in markdown fences like ```hcl or ```terraform.
DO NOT include any code block markers or backticks.
Return clean, parseable Terraform code only.
"""


class MockLLM:
    @staticmethod
    def generate(prompt: str) -> str:
        # Keep the previous deterministic example for local testing
        from agent import MockLLM as OldMock
        return OldMock.generate(prompt)


def call_openai(prompt: str) -> str:
    try:
        from openai import OpenAI

        client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
        resp = client.chat.completions.create(
            model=OPENAI_MODEL,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=1500,
        )
        return resp.choices[0].message.content
    except Exception as e:
        print(f"OpenAI call failed: {e}")
        return MockLLM.generate(prompt)


def llm(prompt: str) -> str:
    if os.getenv("OPENAI_API_KEY"):
        return call_openai(prompt)
    return MockLLM.generate(prompt)


def generate_tf(request: str) -> dict:
    """Generate Terraform module files. Returns dict of {filename: content}"""
    prompt = PROMPT_TEMPLATE.format(standards=TERRAFORM_STANDARDS, request=request)
    result = llm(prompt)
    return parse_multi_file_response(result)


def parse_multi_file_response(response: str) -> dict:
    """Parse LLM response with ### FILE: separators into dict of files"""
    files = {}
    pattern = r"###\s*FILE:\s*([\w\.-]+)\s*\n(.*?)(?=###\s*FILE:|$)"
    matches = re.findall(pattern, response, re.DOTALL | re.IGNORECASE)
    
    if matches:
        for filename, content in matches:
            # Strip markdown code fences if present
            cleaned_content = content.strip()
            # Remove opening code fence (```hcl, ```terraform, ```)
            cleaned_content = re.sub(r'^```(?:hcl|terraform)?\s*\n', '', cleaned_content)
            # Remove closing code fence
            cleaned_content = re.sub(r'\n```\s*$', '', cleaned_content)
            
            # Fix common LLM mistakes in outputs.tf
            if filename.strip() == 'outputs.tf':
                cleaned_content = fix_output_syntax(cleaned_content)
            
            files[filename.strip()] = cleaned_content.strip()
    else:
        # Fallback: treat entire response as main.tf
        content = response.strip()
        content = re.sub(r'^```(?:hcl|terraform)?\s*\n', '', content)
        content = re.sub(r'\n```\s*$', '', content)
        files["main.tf"] = content.strip()
    
    return files


def fix_output_syntax(content: str) -> str:
    """Remove invalid arguments from output blocks (e.g., tags)"""
    # Remove tags blocks from output definitions
    # Pattern: find output blocks and remove tags = { ... } from them
    def remove_invalid_output_args(match):
        output_block = match.group(0)
        # Remove tags blocks
        output_block = re.sub(r'\s*tags\s*=\s*\{[^}]*\}\s*', '\n', output_block, flags=re.DOTALL)
        # Remove other invalid args (type, default, validation, etc.)
        output_block = re.sub(r'\s*type\s*=\s*[^\n]+\n', '\n', output_block)
        output_block = re.sub(r'\s*default\s*=\s*[^\n]+\n', '\n', output_block)
        return output_block
    
    # Match output blocks
    content = re.sub(
        r'output\s+"[^"]+"\s*\{[^}]*\}',
        remove_invalid_output_args,
        content,
        flags=re.DOTALL
    )
    
    return content


def run_terraform_validate(files: dict) -> str:
    """Run actual terraform validate on generated files"""
    import tempfile
    import subprocess
    import shutil
    from pathlib import Path
    
    # Create temp directory
    temp_dir = tempfile.mkdtemp(prefix="tf-validate-")
    try:
        # Write all files
        for filename, content in files.items():
            file_path = Path(temp_dir) / filename
            file_path.write_text(content)
        
        # Run terraform init
        init_result = subprocess.run(
            ["terraform", "init", "-backend=false"],
            cwd=temp_dir,
            capture_output=True,
            text=True,
            timeout=60
        )
        
        if init_result.returncode != 0:
            return f"Terraform init failed:\n{init_result.stderr}"
        
        # Run terraform validate
        validate_result = subprocess.run(
            ["terraform", "validate", "-json"],
            cwd=temp_dir,
            capture_output=True,
            text=True,
            timeout=30
        )
        
        if validate_result.returncode != 0:
            # Parse JSON output for better error messages
            try:
                import json
                result = json.loads(validate_result.stdout)
                if not result.get("valid", False):
                    errors = []
                    for diag in result.get("diagnostics", []):
                        severity = diag.get("severity", "error")
                        summary = diag.get("summary", "")
                        detail = diag.get("detail", "")
                        errors.append(f"{severity.upper()}: {summary}\n{detail}")
                    return "\n".join(errors)
            except:
                pass
            return f"Terraform validation failed:\n{validate_result.stderr}"
        
        return "OK"
    
    except subprocess.TimeoutExpired:
        return "Terraform validation timed out"
    except Exception as e:
        return f"Terraform validation error: {str(e)}"
    finally:
        # Cleanup
        try:
            shutil.rmtree(temp_dir)
        except:
            pass


def validate_tf(files: dict) -> str:
    """Validate all .tf files in the module"""
    issues = []
    
    # First, validate individual files against standards
    for filename, content in files.items():
        if filename.endswith('.tf'):
            result = validate(content)
            if result != "OK":
                issues.append(f"{filename}: {result}")
    
    # Cross-file validation: check for undefined variable references
    defined_vars = set()
    if 'variables.tf' in files:
        var_matches = re.findall(r'variable\s+"([^"]+)"', files['variables.tf'])
        defined_vars = set(var_matches)
    
    # Check all .tf files for var. references
    for filename, content in files.items():
        if filename.endswith('.tf'):
            var_refs = re.findall(r'var\.(\w+)', content)
            for var_ref in var_refs:
                if var_ref not in defined_vars:
                    issues.append(f"{filename}: References undeclared variable 'var.{var_ref}'")
    
    # Run actual terraform validate (most comprehensive check)
    tf_issues = run_terraform_validate(files)
    if tf_issues != "OK":
        issues.append(f"Terraform validation: {tf_issues}")
    
    return "OK" if not issues else "\n".join(issues)


def fix_tf(files: dict, issues: str) -> dict:
    """Fix issues in Terraform files"""
    if os.getenv("OPENAI_API_KEY"):
        # Use LLM to fix issues
        files_str = "\n\n".join([f"### FILE: {name}\n{content}" for name, content in files.items()])
        fix_prompt = f"""The following Terraform module has validation issues that MUST be fixed:

ISSUES:
{issues}

CURRENT MODULE FILES:
{files_str}

FIX INSTRUCTIONS:
1. Fix ALL issues listed above
2. If a variable is referenced but not declared, add it to variables.tf with proper type and description
3. If tags are in output blocks, remove them (outputs only support: value, description, sensitive)
4. If provider uses undefined variables, add them to variables.tf
5. Ensure all braces are properly closed
6. Keep all existing working code intact

IMPORTANT: Return the COMPLETE corrected module with ### FILE: separators.
DO NOT use markdown code fences (no ```hcl or ```).
Return raw Terraform code only.
"""
        result = llm(fix_prompt)
        if result:
            return parse_multi_file_response(result)
    
    # Heuristic fallback: fix main.tf only
    import agent
    if "main.tf" in files:
        files["main.tf"] = agent.fix_tf(files["main.tf"], issues)
    return files


def full_pipeline(user_request: str, max_iterations: int = 3) -> dict:
    """Run the full agent pipeline. Returns dict of {filename: content}"""
    print(f"\n Starting pipeline for: {user_request}")
    
    # Step 1: Generate initial code
    print("\n Generating Terraform module...")
    files = generate_tf(user_request)
    
    if not files:
        print(" Failed to generate initial code")
        return {}
    
    print(f" Generated {len(files)} files: {', '.join(files.keys())}")
    
    # Step 2: Validate
    print("\n Validating module...")
    issues = validate_tf(files)
    
    if issues == "OK":
        print(" Validation passed!")
        return files
    
    print(f"⚠  Validation issues found:\n{issues}")
    
    # Step 3: Auto-fix loop
    iter_count = 0
    while issues != "OK" and iter_count < max_iterations:
        iter_count += 1
        print(f"\n🔧 Auto-fix iteration {iter_count}/{max_iterations}...")
        
        fixed_files = fix_tf(files, issues)
        if fixed_files == files:
            print("  No changes made by fix attempt")
            break
        
        files = fixed_files
        issues = validate_tf(files)
        
        if issues == "OK":
            print(f" Validation passed after {iter_count} iteration(s)!")
        else:
            print(f"  Still have issues:\n{issues}")
    
    if issues != "OK":
        print(f"\n Could not fix all issues after {max_iterations} iterations")
        print("Returning module with remaining issues.")
    
    return files


def create_pr(branch_name: str, files: dict, module_path: str = None) -> str:
    """Reuse the create_pr from agent.py to avoid duplication"""
    import agent
    return agent.create_pr(branch_name, files, module_path)


if __name__ == "__main__":
    sample = "Create an AWS VPC module with public and private subnets and a basic IAM role."
    print(full_pipeline(sample))

Note: Using OpenAI in the Example — But This Agent Works With Any LLM

Example swap: OpenAI → Gemini

# Instead of OpenAI:
from openai import OpenAI
client = OpenAI()

# Use Gemini:
import google.generativeai as genai
genai.configure(api_key=os.getenv("GEMINI_API_KEY"))
model = genai.GenerativeModel("gemini-1.5-pro")

response = model.generate_content(prompt)

7️⃣ Step 3 — Local Build & Testing (Optional but Highly Recommended)

(Test the agent locally before sending code to GitHub)

Before integrating the agent into a CI/CD pipeline or letting it create PRs, you may want to test it locally. This step lets you:

Validate the module generation
Run standards checks
See auto‑fixes happen in real‑time
Inspect generated files
Debug issues faster
Avoid unnecessary PR noise

You can skip this and rely fully on GitHub Actions —
but local testing gives you a faster feedback loop, especially during development.

Run the agent locally with:

python run_example.py --prompt "Create an AWS Lambda with CloudWatch monitoring"

This produces:

main.tf
variables.tf
outputs.tf
providers.tf
README.md

All validated and auto‑fixed before being written to:

8️⃣ Step 4 — Project Folder Structure

Before we move into CI/CD automation, here is the full directory structure of the Terraform Agent you just built.

This structure is intentionally modular, testable, and enterprise‑ready.

📁 Project Structure

terraform-agent/
├── agent_real.py                # Main AI agent (LLM + validation + fixes)
├── agent.py                     # PR creation + mock LLM utilities
├── terraform_standards.py       # Org standards + validation engine
├── run_example.py               # Local execution entrypoint
├── modules/                     # Auto-created PR modules
├── out/                         # Local generated modules (no PR)
├── tests/                       # Full unit test suite
│   ├── test_agent_pipeline.py
│   └── test_terraform_standards.py
└── .github/
    └── workflows/
        ├── python-tests.yml     # Unit tests run on every PR
        ├── terraform.yml        # Terraform validate + plan workflow
        └── e2e_pr.yml           # End-to-end PR generation workflow

9️⃣ Step 5 — Running the Workflow with a Prompt (CI/CD Automation)

Now that the codebase is structured correctly, you can run the entire agent and pipeline with a single prompt — either locally or directly inside GitHub.

You now have two ways to run the pipeline:

Option A: Run the Agent Locally (Fast Feedback Loop)

If you want to test code generation before opening a PR:

python run_example.py --prompt "Create an Azure Storage Account module"

This will:

Generate all module files
Validate them
Auto-fix issues
Write output to:

out//

5. (Optional) Create a PR if you pass --create-pr

python run_example.py --prompt "Create a GCP Cloud Run module" --create-pr

Option B: Run the Entire Agent Inside GitHub Actions

You can trigger the e2e_pr.yml workflow from GitHub UI.

name: E2E PR (manual)

on:
  workflow_dispatch:
    inputs:
      prompt:
        description: 'Terraform module prompt for the agent'
        required: false
        default: 'Create a reusable Terraform module that create a lambda function with health check with cloudwatch alarm'
        type: string

permissions:
  contents: write
  pull-requests: write

jobs:
  e2e:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      - name: Install deps
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
      - name: Run example (create PR)
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          # use GH_TOKEN if provided, otherwise fall back to the Actions-provided GITHUB_TOKEN
          GITHUB_TOKEN: ${{ secrets.GH_TOKEN || secrets.GITHUB_TOKEN }}
          GITHUB_REPO: ${{ github.repository }}
        run: |
          python run_example.py --prompt "${{ github.event.inputs.prompt }}" --create-pr

1. Go to
GitHub → Actions → E2E PR (Manual Run)
2. Click “Run Workflow”
3. Enter your natural language Terraform request:
Example prompt:

4. Click Run Workflow

This workflow will:

✔ Run the full agent
✔ Generate module (main.tf, variables.tf, outputs.tf, providers.tf, README.md)
✔ Validate with your standards file
✔ Auto-fix any issues
✔ Create AI/-timestamp branch
✔ Commit the module
✔ Open a Pull Request
✔ Trigger python-tests.yml
✔ Trigger terraform.yml
✔ Show terraform init/validate/plan output right in the pipeline

What Happens Automatically After PR Creation?

When the agent opens a PR, GitHub Actions takes over.

✔ Workflow #1 — Unit Tests

File: .github/workflows/python-tests.yml

name: Python Tests

on:
  push:
    branches: ["main", "master"]
  pull_request:

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
      - name: Run tests
        env:
          PYTHONPATH: ${{ github.workspace }}
        run: |
          pytest -q

requirements.txt

openai>=0.27.0
PyGithub>=1.59.0
python-dotenv>=1.0.0
pytest>=7.0.0

This runs:

Standards tests
Validation tests
Agent pipeline tests
Mock LLM & real LLM behavior tests

✔ Workflow #2 — Terraform Validation

File: .github/workflows/terraform.yml
This workflow:

name: Terraform Plan

on:
  pull_request:

jobs:
  terraform:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v3

    - name: Setup Terraform
      uses: hashicorp/setup-terraform@v3
      with:
        terraform_wrapper: false

    - name: Find and Validate All Modules
      run: |
        echo "🔍 Finding all Terraform modules..."
        
        # Find all directories containing .tf files
        MODULE_DIRS=$(find . -type f -name "*.tf" -exec dirname {} \; | sort -u)
        
        if [ -z "$MODULE_DIRS" ]; then
          echo "ℹ️  No Terraform modules found in this PR"
          echo "This is normal for PRs that don't include Terraform code"
          exit 0
        fi
        
        echo "Found modules:"
        echo "$MODULE_DIRS"
        echo ""
        
        # Validate each module
        for dir in $MODULE_DIRS; do
          echo ""
          echo "📂 Module: $dir"
          cd "$dir"
          
          echo "⚙️  Running terraform init..."
          if terraform init -backend=false; then
            echo "✅ Init successful"
            
            echo "📋 Running terraform validate..."
            if terraform validate; then
              echo "✅ Validation successful"
            else
              echo "❌ Validation failed"
              exit 1
            fi
            
            echo "📊 Running terraform plan..."
            if terraform plan -input=false -out=tfplan 2>&1 | tee plan.log; then
              echo "✅ Plan successful"
              echo ""
              echo "📄 Plan output:"
              terraform show tfplan
            else
              PLAN_EXIT=$?
              echo "⚠️  Plan failed (exit code: $PLAN_EXIT)"
              
              # Check if it's just missing variables (expected for reusable modules)
              if grep -q "No value for required variable" plan.log; then
                echo ""
                echo "ℹ️  This is a reusable module that requires input variables."
                echo "This is EXPECTED behavior. The module syntax is valid."
                echo ""
                echo "Missing variables:"
                grep "variable \"" plan.log | head -10
                echo ""
                echo "✅ Module validation passed (plan failure due to missing vars is OK)"
              else
                echo ""
                echo "❌ Plan failed with actual errors:"
                cat plan.log
                exit 1
              fi
            fi
          else
            echo "❌ Init failed"
            exit 1
          fi
          
          cd - > /dev/null
        done
        echo ""
        echo "✅ All modules validated successfully!"

Discovers modules in the PR
Runs:

terraform init
terraform validate

Final Conclusion — You Just Built a Digital DevOps Engineer

This is not just an AI demo.

You now have a fully operational, cloud‑agnostic, Terraform‑agnostic Agentic DevOps system, capable of:

Understanding natural language
Generating Terraform modules
Enforcing standards
Auto-fixing code
Creating GitHub PRs
Passing unit tests
Running terraform validate + plan
Operating across AWS / Azure / GCP / any provider
Running in any pipeline

This is the future of platform engineering:

Consistent
Automated
Secure
Extensible
Agentic

And you’ve built the first working version.

Now extend it:

Add tfsec / tflint security scanning
Add Infracost cost intelligence
Add policy-as-code (OPA/Rego)
Add Slack/Jira approvals
Add RAG with your internal playbooks
Add multi-cloud capabilities
Add Terratest integration

Your agent is no longer theory — it’s an operational teammate.
Welcome to Agentic DevOps. 🚀

🛠️ What You Can Build Next (Beyond Terraform)

The pattern you built in Part 2 is not limited to Terraform.

By swapping the generation prompt and your validation logic, you can build additional agents that safely accelerate development across your entire organization:

🧩 Helm Agent

Generate:

Chart.yaml
values.yaml
templates
non‑privileged containers
resource limits
required annotations
org‑approved patterns

🧩 Kubernetes Manifest Agent

Generate Deployments, Services, Ingress, HPA, RBAC with:

Policy checks
OPA/Conftest validation
Security constraints
Label/annotation standards

🧩 CI/CD Pipeline Agent

Generate GitHub Actions / GitLab CI / Jenkins pipelines using:

Org‑standard workflows
Security gating
Approval flows

🧩 Policy‑as‑Code Agent

Generate:

OPA/Rego rules
Gatekeeper constraints
Governance or compliance templates

🧩 Any Config Agent

Dockerfiles
API Gateway configs
Monitoring dashboards
Secrets templates
CloudFormation
Kustomize

Everything becomes automatable with your organizational rules.

This means your teams can drastically reduce development time while staying secure, consistent, and aligned with internal standards.

Your Terraform Agent is simply the first example of what’s possible with Agentic DevOps.

🔜 What’s Coming in Part 3 — Automated Copilot PR Reviews

Now that the agent can generate Terraform modules, validate them, auto‑fix issues, create PRs, and run full CI/CD checks, the next logical step is improving your review workflow.

📘 Part 2 — How to Build Your Own AI Agent: (Cloud-Agnostic, Fully Automated, Enterprise-Ready) was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Gen AI vs Agentic AI vs Traditional AI

Cherukuri sai — Mon, 16 Feb 2026 02:29:42 GMT

Part 1: What They Are, How They Work, and When to Use Which

Artificial Intelligence is evolving so fast that even experts sometimes struggle to keep up. New terms appear every month: Gen AI, Agentic AI, Predictive AI, Foundational Models, and many more.
For someone starting their AI journey — or even someone already working in tech — it’s easy to feel overwhelmed.

We hear terms like:

Generative AI
Agentic AI
Autonomous AI
AI Agents
LLMs
Machine Learning
RAG
Copilots

But most people don’t truly understand:
👉 How they actually work
👉 What makes them different
👉 When to use which one
👉 And what to learn first

Let’s simplify everything.

A. Predictive AI (Traditional Machine Learning)

This is the AI most companies have been using for years. It answers very specific questions:

Will the customer churn?
What is the credit risk?

What is the predicted demand?
Predictive AI is good at classification, regression, and pattern recognition, but it cannot generate new content or act on its own.

Best for: Finance, analytics, forecasting, retail, operations.

How It Works

Collect labeled data
Train a model
Validate it
Deploy it
Model predicts output

It does not create new content.
It only predicts based on patterns it learned.
When to Use It
✅ When you need prediction
✅ When you have structured data
✅ When outcomes are measurable

B. Generative AI (Gen AI)

This is what most people refer to when they say “AI” today.
Gen AI creates new things:

Text
Images
Code
Reports
Designs

Models like GPT, Llama, Claude, etc., are all examples of Gen AI.

What Gen AI does well:

Summarizing
Explaining complex topics
Brainstorming ideas
Writing code
Drafting emails and documentation

But note:
Gen AI is still reactive — it waits for your instructions. It doesn’t take initiative.

Best for: Creators, analysts, students, developers, business teams.

Massive Data → Train Foundation Model → User Prompt → AI Processing → Generated Content

C. Agentic AI (AI Agents)

Agentic AI is the next big leap. Unlike Gen AI, which only responds to prompts, Agentic AI can take action.
Think of an AI intern or digital employee that can:

Plan tasks
Make decisions
Execute steps autonomously
Use tools or software
Monitor progress
Correct itself

Examples:

An AI agent that books your flights
An agent that runs testing workflows
Agents that analyze documents, then update dashboards, then notify teams
Agents that manage customer support tickets end‑to‑end

Agentic AI = Autonomous, goal-driven, action-taking AI.

This is where the future is heading.

Best for: Automation, operations, DevOps, QA, business workflows, enterprise systems.

User Goal → Agent Reasoning → Plan → Use Tools → Execute → Final Result

2. How to Know Which AI to Choose

A common question people ask is:
“Which AI should I choose when building a product or learning a new skill?”
Here’s a simple rule of thumb.

• Choose Predictive AI if:
You want numbers, probabilities, or forecasts.
Examples: risk scoring, time-series forecasting, anomaly detection.

Historical Data → Feature Engineering → ML Model → Prediction Score → Dashboard / Alert

• Choose Gen AI if:
You want AI to generate content or provide knowledge-driven insights.
Examples: customer replies, documentation, email drafting, coding help.

User Question → LLM → Knowledge Base (RAG) → Generated Response → User

• Choose Agentic AI if:
You want AI to take actions, not just respond.
Examples: autonomous testing, workflow automation, CRM updates, financial reconciliation.

User Goal → Agent (LLM Brain) → Planning → Tool Usage (API / DB / Browser) → Execution → Result

3. How These AIs Actually Work (A Simple Breakdown)

Predictive AI (ML)

Learns patterns from structured data
Maps input → output
Doesn’t “understand” meaning
Cannot generate new content

Gen AI

Trained on massive text, code, or image datasets
Learns relationships between words, sentences, or pixels
Uses statistical patterns to generate new content
Can reason “as if” it understands context

Agentic AI

Uses Gen AI as a “brain”
Adds memory, tools, decision logic, and feedback loops
Can connect to apps, APIs, databases
Can plan, act, evaluate, and improve itself

In short:
Predictive AI = analysis
Gen AI = creation
Agentic AI = action

4. For Beginners: How Should You Start Learning?

If you’re new to AI, don’t jump directly into advanced agent frameworks.
Start with a foundation.

Step 1: Understand the fundamentals

What is ML?

What is Gen AI?

What problem is each model solving?

Step 2: Learn to use Gen AI tools (hands-on)

ChatGPT

Gemini

Claude

Llama

GitHub Copilot

This builds intuition.

Step 3: Learn Prompt Engineering

This helps you interact with AI systems effectively.

Step 4: Learn Applied AI Skills

Vector databases

RAG (Retrieval-Augmented Generation)

Embeddings

Model evaluation

Step 5: Move into Agentic AI

Once comfortable, explore:

LangGraph

AutoGen

CrewAI

OpenAI Agents

Microsoft Autogenics (when available)

This is where future jobs will be.

5. For Professionals: How to Decide What to Build

If you’re already working with AI or building AI tools, use this strategy:
Ask yourself these questions:

Do I just need insights? → Predictive AI
Do I need content or explanation? → Gen AI
Do I need automation and actions? → Agentic AI
Do I need domain expertise embedded? → Fine-tuned models
Do I need the AI to learn from company knowledge? → RAG system

This framework helps avoid confusion and prevents overengineering.

Conclusion:

AI is evolving faster than ever, but the truth is simple: not all AI is the same, and not every AI solves the same problem. Predictive AI helps you analyze, Generative AI helps you create, and Agentic AI helps you act. Once you understand these three pillars, the entire AI landscape becomes clearer, and choosing the right approach stops being confusing.

If you’re just starting, begin with the basics — learn how Gen AI works and how LLMs think. If you’re already in the field, focus on choosing the right AI based on the problem, not the hype. And if you’re building for the future, prepare for Agentic AI, because that’s where real automation, intelligence, and impact are heading.

In the next part, we’ll go deeper into the future of AI — how to actually build your own agent, how tools, memory, reasoning loops work, and why understanding these systems will soon become as essential as learning to code.

The world is moving toward intelligent workflows and autonomous systems. With the right foundation, you won’t just follow that future — you’ll help build it.

🚀 Gen AI vs Agentic AI vs Traditional AI was originally published in Agentic AI & GenAI Revolution on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Terraform Blueprint (2026): How to Structure, Scale & Secure Your Infrastructure‑as‑Code

Cherukuri sai — Sun, 15 Feb 2026 08:36:36 GMT

By — Cloud | DevSecOps | SRE | Platform Engineering

Introduction
Repository Architecture
Module Design Principles
Environment & State Structure
IaC Security & Scanning
Secure Terraform Execution
Drift Detection & Observability
Terraform Maturity Framework
Conclusion

1. Introduction

Terraform has evolved from a simple provisioning tool into the backbone of modern cloud infrastructure. Today, teams rely on it to manage thousands of resources across AWS, Azure, and GCP — yet scaling Terraform successfully is harder than most engineering leaders expect.

Misconfigured modules, unmanaged drift, weak pipelines, and security gaps often become hidden liabilities inside cloud environments. The good news? These problems are completely avoidable with the right structure, patterns, and guardrails.

This guide distills the core principles, best practices, and real‑world patterns that high‑performing cloud, DevOps, and SRE teams use to keep Terraform secure, predictable, and scalable.

From repository design to IaC scanning, CI/CD hardening, drift detection, and multi‑environment state strategy — this blueprint gives you everything you need to build Terraform the right way.

Whether you’re improving an existing Terraform setup or building a new foundation, this framework will help you deliver reliable IaC with confidence.

2. Repository Architecture for Scalable IaC

2.1 Mono‑Repo
Strengths:
• Centralized governance
• Consistent patterns
• Easier module management
Weaknesses:
• Can slow down independent teams

2.2 Service‑Scoped Repos
Strengths:
• Fast iteration per team
• Clear ownership boundaries
Weaknesses:
• Duplication
• Harder to enforce universal standards

2.3 Hybrid (Recommended)
The core‑infra → module‑registry → app‑repo model ensures:
• cross‑team consistency
• ability to scale
• module reusability
• minimal duplication

Example for Hybrid: 
repo-root/
 ├── core-infra/
 ├── modules/
 │    ├── vpc/
 │    ├── eks/
 │    ├── iam/
 ├── application-services/
 │    ├── service-a/
 │    ├── service-b/

3. Module Design Principles That Prevent Chaos

3.1 What a Good Module Looks Like
A strong module is:
• small, composable, and reusable
• predictable (inputs/outputs documented)
• version‑pinned
• never environment‑specific

module "storage" {
  source = "git::ssh://example.com/storage.git?ref=v1.3.0"
  name   = var.name
  region = var.region
}

3.2 Semantic Versioning

1.0.0 → breaking changes
1.1.0 → enhancements (safe)
1.1.1 → fixes

Pin versions like this:

module "vpc" {
  source  = "git::ssh://example.com/vpc.git?ref=v1.2.3"
}

4. Environment Isolation & State Structure

State mistakes are one of the fastest ways to break production.
4.1 Isolate Everything
Use a separate state per:
• dev
• qa
• staging
• prod
Never mix them.

4.2 Backend Best Practices
AWS: S3 backend + native S3 state locking
Azure: Storage Account + blob locking
GCP: GCS + lock management through CI pipeline

terraform {
  backend "s3" {
    bucket       = "mycompany-terraform-prod"
    key          = "network/terraform.tfstate"
    region       = "us-east-1"
    use_lockfile = true
  }
}

5. IaC Security, Scanning & Policy Enforcement

This is one of the most important parts of the entire blueprint.
5.1 Pre‑Commit Hooks (Local)
Run automatically before committing:

terraform fmt
terraform validate
tflint
tfsec
checkov

5.2 Security Scanners

Tfsec
• IAM issues
• Network exposure
• Missing encryption

ERROR: aws-s3-enable-bucket-encryption
S3 Bucket encryption is not enabled.

Checkov
• Compliance rules
• Cloud‑specific misconfigurations
• Data leakage prevention

5.3 Policy as Code
Recommended engines:
• OPA / Conftest
• HashiCorp Sentinel
Example: deny untagged resources

deny[msg] {
  input.resource.tags == {}
  msg = "All resources must include tags."
}

6. Secure Terraform Execution

❌ Never Run Terraform Locally
Local terraform apply introduces:
• drift
• audit gaps
• privilege risks
• shadow infrastructure

✅ Always Use CI/CD Runners

6.1 Secure Execution Identity

Best practices:
• Federated identities (OIDC)
• No static credentials
• Least‑privilege roles
• Short‑lived tokens

6.2 Recommended Pipeline



+-----------------------------+
|        Developer           |
|      Git Commit/PR         |
+-------------+--------------+
              |
              v
+-------------+--------------+
|      Harness CI Stage      |
|----------------------------|
| - Checkout                 |
| - Build & Test             |
| - SAST/DAST                |
| - Build Docker Image       |
| - Push to Artifact Repo    |
+-------------+--------------+
              |
              v
+-------------+--------------+
| Harness CD: Terraform IaC  |
|----------------------------|
| terraform init             |
| terraform fmt/validate     |
| tflint / tfsec / checkov   |
| terraform plan             |
| Approval Step              |
| terraform apply            |
+-------------+--------------+
              |
              v
+-------------+--------------+
|  Harness Deploy Stage      |
|----------------------------|
| Helm / K8s |
| Health Checks              |
| Feature Flags (optional)   |
+-------------+--------------+
              |
              v
+-------------+--------------+
| Observability & Governance |
|----------------------------|
| Logs, Metrics, Traces      |
| Drift Detection            |
| Notifications              |
+----------------------------+

7. Drift Detection & Observability

7.1 Automated Drift Checks
Run daily/weekly:


terraform plan -detailed-exitcode
if [ $? -eq 2 ]; then
  echo "Drift Detected!"
fi

Send output to:
• Slack
• Teams
• Jira
• GitHub issues

7.2 Observability Alignment
Integrate Terraform outputs with:
• Datadog
• Prometheus
• New Relic
• CloudWatch
• Azure Monitor
• GCP Monitoring

This enables cross‑visibility between configuration and runtime signals.

8. Terraform Maturity Framework

A simple view of Terraform maturity:
Level 1 — Basic:
• Local applies
• Zero scanning
Level 2 — Standardized:
• Structured repos
• Versioned modules
Level 3 — Governed:
• Scanning enforced
• Policy‑as‑code
• Drift detection
Level 4 — Operational Excellence:
• Multi‑cloud consistency
• IaC observability
Level 5 — Intelligent Automation:
• Predictive analysis
• AI‑assisted Terraform plans
• Automated remediation

9. Conclusion

A well‑designed Terraform ecosystem isn’t just about writing modules or running plans — it’s about building a foundation that teams can trust as they scale. When your IaC is structured, secure, reviewed, tested, and continuously validated through automation, it becomes a force multiplier for every cloud initiative that follows.

The patterns in this blueprint are not theoretical. They’re the practices that consistently separate stable, predictable cloud environments from ones held together by tribal knowledge and luck. Whether you’re modernizing legacy infrastructure or enabling a high‑velocity platform team, adopting these principles will help you eliminate drift, reduce risk, and accelerate delivery with confidence.

Infrastructure‑as‑Code should empower teams, not slow them down. With the right pipelines, security checks, and execution workflows in place, Terraform becomes a strategic advantage — unlocking a cloud environment that is reproducible, scalable, and aligned with the operational rigor today’s engineering landscape demands.

If you’re investing in Terraform today, invest in doing it right. Your future platform will thank you.

If you found this guide helpful, follow me here on Medium for more deep‑dives on Cloud Architecture, DevSecOps, Terraform, Kubernetes, SRE practices, CI/CD pipelines, and Platform Engineering.

I publish hands‑on insights, real implementation patterns, and practical frameworks to help engineers and leaders build secure, scalable, and reliable cloud platforms.

➡️ Follow for more content like this.
➡️ Share this article if it helped you.

The Terraform Blueprint (2026): How to Structure, Scale & Secure Your Infrastructure‑as‑Code was originally published in AWS Tip on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stories by Cherukuri sai on Medium

I Built an AI That Fixes Pipeline Failures Before Platform or DevSecOps teams Gets the Slack…

🤖 I Built an AI That Fixes Pipeline Failures Before Platform or DevSecOps teams Gets the Slack Message

🤔 “But wait, doesn’t my CI/CD tool have AI features now?

Here’s What It Looks Like in Action

📊 The Impact: Measured in Hours Saved

💥 The Bottom Line: This Changes Everything

💬 Want to stop being your team’s human error parser?

Beyond Migration: How We Engineered a Secure & Intelligent Delivery Platform with Harness CICD

Our Harness migration became the turning point — not because of the tool, but because of the architecture we built around it.

Introduction:

🔎 Executive Results

Phase 1 — Redesigning Identity, Not Just Pipelines

Phase 2 — Treating Delegates as Privileged Control Plane Infrastructure

Phase 3 — Deterministic Execution Using Containerized Toolchains

Phase 4 — Governance as Code, Not Process

Phase 5 — Immutable Artifact Lifecycle

Phase 6 — Progressive Delivery & Feature Flags

Capabilities Most Teams Never Operationalize

A. Git‑Based Pipeline Change Governance

B. Monitoring‑Driven Automated Rollback

C. Delegate Auto‑Scaling

D. Error Budget–Aware Deployment Gating

E. Chaos‑Validated Rollbacks

F. Centralized Connector Governance

G. Developer Experience Uplift

Migration vs. Modernization

Conclusion:

DevSecOps — Community 🚀

Part 2 — How to Build Your Own AI Agent: (Cloud-Agnostic, Fully Automated, Enterprise-Ready)

From natural-language prompts → to Terraform module → to PR → to CI/CD → to validation

1️⃣ What Makes This Agent Different?

✔ Cloud‑Agnostic

✔ CI/CD‑Agnostic

✔ Enterprise‑Grade Validation

✔ Complete Multi‑File Generation

2️⃣ Understanding AI Agents (Enterprise Context)

3️⃣ Prerequisites

✔ Skills

✔ Installations

✔ Organization Inputs

4️⃣ Architecture of What We’re Building

5️⃣ Step 1: Build the Python Utility (Terraform Standards Validator)

6️⃣ Step 2: Build the Agent (Brain + Tools)

7️⃣ Step 3 — Local Build & Testing (Optional but Highly Recommended)

8️⃣ Step 4 — Project Folder Structure

9️⃣ Step 5 — Running the Workflow with a Prompt (CI/CD Automation)

This will:

4. Click Run Workflow

What Happens Automatically After PR Creation?

Final Conclusion — You Just Built a Digital DevOps Engineer

🛠️ What You Can Build Next (Beyond Terraform)

🧩 Helm Agent

🧩 Kubernetes Manifest Agent

🧩 CI/CD Pipeline Agent

🧩 Policy‑as‑Code Agent

🧩 Any Config Agent

🔜 What’s Coming in Part 3 — Automated Copilot PR Reviews

Gen AI vs Agentic AI vs Traditional AI

A. Predictive AI (Traditional Machine Learning)

How It Works

B. Generative AI (Gen AI)

C. Agentic AI (AI Agents)

2. How to Know Which AI to Choose

3. How These AIs Actually Work (A Simple Breakdown)

Predictive AI (ML)

Gen AI

Agentic AI

4. For Beginners: How Should You Start Learning?

5. For Professionals: How to Decide What to Build

Conclusion:

The Terraform Blueprint (2026): How to Structure, Scale & Secure Your Infrastructure‑as‑Code

Table of Contents

1. Introduction

2. Repository Architecture for Scalable IaC

3. Module Design Principles That Prevent Chaos

4. Environment Isolation & State Structure

5. IaC Security, Scanning & Policy Enforcement

6. Secure Terraform Execution

6.1 Secure Execution Identity