Can AI Automate Cloud Development to a New Level?

Introducing CloudAI — The Next Leap in Cloud Infrastructure Automation

Jeremy Wang
5 min readJan 2, 2024

TL;DR

  • CloudAI places an LLM agent at the helm of each key role (PM, DevOps, and SRE) and tool (CircleCI, Jenkins, Splunk, Datadog, etc..) in the infrastructure management lifecycle.
  • These agents communicate with each other in natural language, breaking down silos and automating processes that were previously manual and complex.
  • CloudAI is estimated to reduce development cycles by up to 70% and save up to 50% of the time typically spent on coding and error analysis.
  • Demo: click me

The challenge in modern DevOps

In the modern landscape of DevOps and cloud infrastructure deployment, we have an array of specialized tools at our disposal, each catering to different stages of the deployment pipeline. From Terraform for IaC, Checkov for code quality, to Harness for CI/CD, and Splunk for monitoring, these tools excel in automating specific tasks. However, a critical gap remains in the automation continuum, particularly in the information flow from downstream to upstream processes.

Although our current systems efficiently automate the linear progression of tasks — for instance, when code changes are committed, they are automatically picked up and deployed — they falter when it comes to reverse information flow. When CI/CD tools encounter errors, or when monitoring tools flag issues, the onus falls on DevOps professionals and Site Reliability Engineers (SREs) to interpret these errors, which are often communicated in human-readable logs. They must then engage in logical reasoning to trace the source of the problem and update the code accordingly. This process is both time-consuming and error-prone, requiring manual intervention and expert knowledge.

CloudAI: A Synergistic Enhancement to DevOps Ecosystems

CloudAI is designed to mimic essential roles such as DevOps and SRE, and interface with various tools, such as Jenkins, Harness, New Relic, and Splunk.

It achieves this by positioning an LLM agent in charge of each role and tool, thereby enabling the exchange of information in a natural language format. This unique approach doesn’t seek to disrupt or replace the existing ecosystem; rather, it enriches it, making the interactions between various tools and roles more intuitive and efficient.

As we begin to explore the vast and intricate architecture of CloudAI — a subject deserving of its own detailed exposition — let’s first introduce the first two critical agents within the CloudAI framework:

Code Agent: Focused on managing your Infrastructure as Code (IaC) repository, particularly for Terraform, this agent would be responsible for creating and updating code. It could also review code for optimization and compliance with best practices.

Deployment Agent: This agent works in concert with key deployment tools like Jenkins, GitLab CI, and AWS CodePipeline. Its role extends beyond mere deployment; it actively monitors CI/CD pipeline logs, identifies errors or warnings, and communicates these findings back to the Code Agent. The Code Agent then updates the code, triggering another round of deployment. This feedback loop repeats until a successful deployment.

Let me demonstrate a real example below.

A Demo of CloudAI’s Code Agent and CICD Agent Synergy

Let me talk through each response in this demo:

In this demonstration, we start with a simple request: ‘Can you update the desired_size in my EKS nodegroup to 5?’ Notice how there’s no mention of specific files or code segments. That’s because our code agent doesn’t need them. It has already embedded the entire codebase in a vector database and understands the context of the query, bypassing the need for precise contexts that might not be available in an automated interaction from other agents.

The code agent swiftly identifies the relevant section and makes the necessary update, creating a git diff file. This diff is not just a change log; it’s a bridge allowing for the smooth merging of the code agent’s updates back into the original repository, facilitating automatic commits, and triggering subsequent actions in the deployment pipeline.

However, automation isn’t infallible. The pipeline halts with an error: the max_size must be equal to or greater than the desired_size. Here’s where the CICD agent shines, interpreting the pipeline logs, distilling the error message, and communicating the issue back to the code agent for correction.

Understanding the feedback, the code agent analyzes the log summary and adjusts the code again — this time ensuring the max_size meets the new desired_size

Introducing the Remaining Key Agents

Having explored the Code and Deployment Agents, let’s now complete the picture of CloudAI by introducing the three additional agents that play pivotal roles in this innovative framework:

Human Proxy Agent: This agent serves as the central hub of coordination, managing the flow of communication among the various agents. It’s uniquely designed to also accept human input when necessary. In scenarios where the debugging process veers off course, or when crucial steps require human approval for continuation, the Human Proxy Agent steps in. This agent ensures that human oversight and decision-making remain integral parts of the process, blending automated efficiency with human judgment.

Agile Board Agent: Specializing in project management, this agent interfaces with systems like Jira for efficient project tracking. When a project manager inputs a new feature request or a ticket into the sprint board, the Agile Board Agent springs into action. It interprets and reformulates these requests, and then communicates them to the Code Agent to initiate the development cycle. This agent transforms project management from a manual task into an automated, streamlined process.

Monitoring Agent: Integrated with state-of-the-art monitoring tools such as Datadog, Splunk, and Prometheus, this agent is the vigilant guardian of your system. It is triggered by monitoring tool alarms. Upon detecting an issue, it collaborates with the Agile Board Agent to create a ticket, ensuring the issue is tracked and addressed. It then works in tandem with the Code Agent and Deployment Agent, persisting in trial-and-error cycles until the alarm is resolved. This agent is key to maintaining system health and proactively addressing potential issues.

Conclusion

CloudAI breaks down traditional communication barriers and streamlines processes, meanwhile compatible with your current tools and setup.

With its innovative use of LLMs, CloudAI is poised to redefine the boundaries of automation, making cloud infrastructure management more efficient, intelligent, and responsive to human input.

Conservatively estimated, CloudAI has the potential to cut down development cycles by up to 70% and significantly reduce the time employees spend on code writing and log analysis by at least 50%.

CloudAI isn’t public yet, but if you’re interested in being an early user, feel free to contact me:

linkedin.com/in/jeremyw90/

https://www.cloud-ai.biz/

--

--