From Alerts to Action: Building an Agentic AI Monitor for JBoss EAP on OpenShift

Nilay Saraiya — Sun, 12 Apr 2026 06:42:07 GMT

Most monitoring systems are good at telling us that something is wrong.

They are much less good at turning that signal into a clear, operator-ready explanation of what probably happened, what to check next, and how to fix it.

That gap is where on-call time disappears. A pod crashes, an alert fires, a health check fails, or an application log starts throwing errors. The monitoring stack does its job and raises its hand. Then a human still has to collect context from OpenShift, inspect logs, connect the symptoms to the platform and application runtime, decide whether this is a JBoss issue, a Kubernetes issue, a resource issue, or a deployment issue, and finally create a ticket someone else can act on.

I built a project to explore what happens when that triage loop becomes agentic.

The project is called JBoss AI Monitor. It is a Python-based monitoring agent that runs inside OpenShift, watches a JBoss EAP 8 workload, uses Claude to generate structured root-cause analysis, and automatically creates JIRA tickets with the diagnostic context and recommended resolution steps.

GitHub repository: https://github.com/nnsaraiya/jboss-ai-agent-monitor.git

The Problem I Wanted to Solve

In many enterprise environments, JBoss EAP and WildFly workloads are still business-critical. They often sit behind modern platform layers like OpenShift, Prometheus, AlertManager, JIRA, and incident-management tooling.

That tooling gives teams visibility, but the first few minutes of diagnosis are still repetitive:

- Which pod or container failed?

- Was it an OOM kill, a crash loop, an image pull issue, or a health-check failure?

- Are there useful log lines near the failure?

- Is this related to JVM heap sizing, OpenShift resource limits, readiness probes, deployment configuration, or an application exception?

- What should the ticket say so the next engineer does not have to rediscover the same context?

The goal was not to replace engineers. The goal was to remove the low-value translation step between “an alert fired” and “there is a useful ticket with enough context to begin remediation.”

How the Agent Works

JBoss AI Monitor runs as an in-cluster Python service. Every 60 seconds, it executes a monitoring cycle across four layers:

- Pod state monitoring for OOMKilled containers, CrashLoopBackOff, Error, ImagePullBackOff, CreateContainerError, and restart spikes.

- Log monitoring for JBoss and Java error patterns such as OutOfMemoryError, WFLYCTL errors, deployment exceptions, stack traces, deadlocks, and severe runtime failures.

- Alert monitoring for Prometheus or AlertManager alerts related to WildFly, JBoss, EAP, or adjacent workloads.

- Health monitoring for HTTP endpoints exposed by the JBoss service.

When an issue is detected, the agent normalizes it into a shared issue model with severity, type, namespace, pod/container information, relevant raw diagnostic data, and log excerpts where available.

It then performs deduplication using a stable fingerprint and a configurable time window. That matters operationally because a broken pod can generate the same symptom every minute. The agent should not flood JIRA with identical tickets.

Only new issues are sent to the resolution layer.

Where AI Fits

The AI component is deliberately narrow.

Instead of asking a model to “chat” about an incident, the project sends a structured issue context to Claude and forces a tool-call response. The model must return a predictable object containing:

- Root-cause analysis

- Resolution steps

- Prevention tips

- References

- Confidence level

This structure is important. In automation, free-form prose is hard to trust and hard to integrate. A structured response can be mapped cleanly into downstream systems, validated, rendered into a JIRA ticket, and improved over time.

The system prompt also narrows the operating domain: the model is asked to behave like an SRE and Red Hat JBoss/WildFly/EAP specialist working inside OpenShift. That keeps the analysis focused on platform and runtime realities instead of generic troubleshooting advice.

Turning Analysis Into a Ticket

Once the resolution is generated, the agent creates a JIRA Cloud issue using the REST API.

The ticket includes:

- Issue type and severity

- Namespace, pod, and container metadata

- Detection timestamp

- Human-readable description

- Relevant log excerpt

- AI-generated root-cause analysis

- Numbered resolution steps

- Prevention tips

- References

- Raw diagnostic data

The ticket is not meant to be treated as unquestionable truth. It is an operational starting point. The JIRA description explicitly marks the analysis as AI-generated and includes a confidence level so an engineer can calibrate how much verification is needed.

That distinction matters. In SRE workflows, the right role for AI is often not “autonomous fixer.” A more realistic and useful first step is “high-quality triage assistant that packages context quickly and consistently.”

The Architecture

At a high level, the flow looks like this:

1. JBoss EAP 8 runs on OpenShift through the Red Hat EAP/WildFly operator.

2. The monitoring agent runs in a separate namespace.

3. Kubernetes RBAC grants the agent read access to pods, logs, services, and endpoints in the JBoss namespace.

4. The agent runs periodic checks across pod state, logs, alerts, and health endpoints.

5. New issues are deduplicated and sent to Claude for structured analysis.

6. The resulting resolution is written into a JIRA ticket.

The GitHub repository includes Kubernetes manifests for deployment, service account and RBAC setup, ConfigMap-based configuration, a UBI9 Python Dockerfile, and a smoke-test script for validating the stack:

https://github.com/nnsaraiya/jboss-ai-agent-monitor.git

Configuration is environment-driven, including:

- JBoss namespace and label selector

- Check interval

- Health-check URLs

- JIRA project and issue type

- Claude model and token limits

- Deduplication window

- Per-monitor enable/disable flags

That makes the agent portable across local CRC experiments and fuller OpenShift environments.

What I Learned

The most interesting part of this project was not wiring one API to another. It was deciding where to put boundaries.

AI is powerful, but production operations need guardrails:

- The model should receive curated context, not unlimited raw data.

- The model should produce structured output, not arbitrary prose.

- The automation should deduplicate aggressively.

- The system should fail safely if AI analysis is unavailable.

- Tickets should preserve diagnostic data so humans can verify the recommendation.

- The agent should be configurable enough to adapt to real platform differences.

This project reinforced a simple idea: agentic systems become more useful when their autonomy is bounded by clear contracts.

Here, the contract is:

Detect known operational signals. Package the context. Ask for structured expert analysis. Create a ticket. Let an engineer verify and act.

That is a smaller ambition than “AI runs production for us.” It is also much more useful.

Why This Matters

Infrastructure teams already live with too many disconnected signals. Logs are in one place, alerts in another, cluster state somewhere else, tickets somewhere else again.

An agentic monitoring workflow can connect those layers.

For JBoss and OpenShift teams, that means a crash loop or health failure can move quickly from raw symptom to actionable ticket. For platform engineers, it means repetitive triage can become a paved path. For SRE teams, it means less time spent assembling context and more time spent making good decisions.

I do not think this pattern is limited to JBoss. The same architecture could apply to databases, message brokers, batch workloads, integration platforms, or any system where failures have recognizable signals and remediation depends on combining platform context with domain knowledge.

Final Thought

Agentic AI in operations does not have to start with self-healing production systems.

It can start with something simpler and safer:

When something breaks, create a better first ticket.

That alone can save real time.

And for many teams, it is the difference between another noisy alert and an engineer beginning with useful context already in hand.

— -

Suggested hashtags:

#AI #AgenticAI #OpenShift #JBoss #DevOps #SRE #Kubernetes #PlatformEngineering #Automation #AIOps

Stories by Nilay Saraiya on Medium