Stories by Peng Xie on Medium

Building an AI-Powered Data Lake Monitor: How We Automated Failure Detection for Houston’s…

Peng Xie — Wed, 20 Aug 2025 18:50:19 GMT

Building an AI-Powered Data Lake Monitor: How We Automated Failure Detection for Houston’s Wastewater Infrastructure

How we built an intelligent monitoring system that uses Claude 3.5 to automatically detect, analyze, and report data processing failures in AWS

The Challenge: Monitoring Critical Data Infrastructure

In the world of municipal infrastructure, data isn’t just about analytics — it’s about public safety, regulatory compliance, and operational efficiency. At the Houston Water Department, our Wastewater Infrastructure Program (WWIP) processes terabytes of sensor data daily through AWS Glue jobs and Lambda functions. When these processes fail, the consequences can range from delayed regulatory reporting to missed critical infrastructure alerts.

The Problem: Traditional monitoring approaches left us with:

Manual error investigation consuming hours of engineering time
Delayed failure detection leading to cascading data quality issues
Complex error logs requiring specialized knowledge to interpret
No automated way to prioritize and summarize failures for stakeholders

The Solution: We built an AI-powered monitoring system that automatically detects, analyzes, and reports failures using Amazon Bedrock’s Claude 3.5 Sonnet

Architecture Overview: Serverless Intelligence

Our solution leverages AWS’s serverless ecosystem to create a fully automated monitoring pipeline:

Key Design Principles

Serverless First: No infrastructure to manage, automatic scaling
AI-Enhanced: Claude 3.5 provides intelligent error summarization
Zero Configuration: Automatically discovers WWIP-related resources
Professional Reporting: HTML-formatted emails with actionable insights

The Technical Implementation

1. Dynamic Resource Discovery

Instead of hardcoding job names, our system automatically discovers all WWIP-related resources:

def get_job_names():
    """
    Get the names of Glue jobs and Lambda functions that start with 'wwip'.
    Returns a tuple containing lists of job names and function names.
    """
    glue_job_names = load_glue_job()
    lambda_function_names = load_lambda_function()

    return glue_job_names, lambda_function_names

def load_glue_job() -> list:
    glue_client = boto3.client("glue")
    job_names = []

    paginator = glue_client.get_paginator("get_jobs")
    for page in paginator.paginate():
        for job in page["Jobs"]:
            job_name = job["Name"]
            # only select jobs with 'wwip' (case-insensitive)
            if job_name.lower().startswith("wwip"):
                job_names.append(job_name)

    return job_names

This approach eliminates maintenance overhead and ensures we never miss new resources.

2. Comprehensive Error Collection

Our GlueErrorFetcher and LambdaErrorFetcher classes provide deep error analysis:

class GlueErrorFetcher:
    def fetch_failed_runs(self, job_name):
        """Fetch failed Glue job runs and their CloudWatch logs for a single job."""
        failed_runs_info = {}
        response = self.glue.get_job_runs(JobName=job_name, MaxResults=50)

        for run in response['JobRuns']:
            if run['StartedOn'] >= self.start_time and run['JobRunState'] == 'FAILED':
                run_id = run['Id']
                started_on = run['StartedOn'].strftime('%Y-%m-%d %H:%M:%S %Z')
                glue_error_message = run.get('ErrorMessage', 'No error message available')

                # Find the correct CloudWatch log stream
                streams = self.logs_client.describe_log_streams(
                    logGroupName=self.error_log_group_name,
                    logStreamNamePrefix=f"{run_id}"
                )['logStreams']

                if streams:
                    log_stream_name = streams[0]['logStreamName']
                    events = self.logs_client.filter_log_events(
                        logGroupName=self.error_log_group_name,
                        logStreamNames=[log_stream_name]
                    )
                    detailed_errors = "\n".join(e['message'] for e in events['events'])

                failed_runs_info[run_id] = {
                    "started_on": started_on,
                    "glue_error_message": glue_error_message,
                    "detailed_errors": detailed_errors
                }

        return failed_runs_info

AWS Lambda logs → Stored in CloudWatch Logs under the log group:

/aws/lambda/{function_name}

AWS Glue job logs → Stored in CloudWatch Logs under:

/aws-glue/python-jobs/error

This distinction is important when troubleshooting, because people often expect both services to use the same log group structure but Glue has a dedicated one.

3. AI-Powered Error Analysis

The heart of our system is Claude 3.5’s ability to transform raw error logs into actionable insights:

def summarize_errors_with_llm(errors_dict, job_type):
    """
    Send the combined errors to Amazon Bedrock Claude 3.5 for summarization.
    """
    prompt_text = f"""\n\nHuman: Please summarize the causes of the following {job_type} job failures:

    {json.dumps(errors_dict, indent=2)}

    You don't have to provide the solution, just summarize the causes of the failures.
    Assistant:
    """

    native_request = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 500,
        "temperature": 0.5,
        "messages": [
            {
                "role": "user",
                "content": [{"type": "text", "text": prompt_text}],
            }
        ],
    }

    response = bedrock_client.invoke_model(
        modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
        body=json.dumps(native_request),
    )

    result = json.loads(response["body"].read().decode("utf-8"))
    summary = result.get("content", "")[0].get("text", "").strip()

    return summary

For more details on how to call Claude 3.5 API, you can refer to:

https://docs.aws.amazon.com/bedrock/latest/userguide/bedrock-runtime_example_bedrock-runtime_InvokeModel_AnthropicClaude_section.html

4. Professional Email Reporting

We generate HTML-formatted emails that provide clear, actionable information:

def format_error_message_and_send_emails(errors, list_of_recipients):
    html_lines = []
    html_lines.append("""
    
      
        🚨 AWS Job Failure Report 🚨

    """)

    # Glue errors
    if "glue_errors" in errors and errors["glue_errors"]:
        html_lines.append('Glue Job Errors
')
        for job, message in errors["glue_errors"].items():
            html_lines.append(f"Job: {job}
{message}
")
        html_lines.append("
")

    # Lambda errors
    if "lambda_errors" in errors and errors["lambda_errors"]:
        html_lines.append('Lambda Job Errors
')
        for job, message in errors["lambda_errors"].items():
            html_lines.append(f"Job: {job}
{message}
")
        html_lines.append("
")

    html_lines.append("""
        This is an automated message from AWS SES.

      
    
    """)

    email_body_html = "\n".join(html_lines)

    # Send using AWS SES
    ses = boto3.client("ses", region_name="us-east-1")
    response = ses.send_email(
        Source="xphn1985@gmail.com",
        Destination={"ToAddresses": list_of_recipients},
        Message={
            "Subject": {"Data": "AWS Error Report", "Charset": "UTF-8"},
            "Body": {"Html": {"Data": email_body_html, "Charset": "UTF-8"}}
        }
    )

    return email_body_html

The Workflow: From Detection to Action

1. Automated Resource Discovery

Every execution starts by scanning AWS for WWIP-related resources:

Glue Jobs: All jobs with names starting with “wwip”
Lambda Functions: All functions with names starting with “wwip”

2. Failure Detection & Collection

For each resource, we collect comprehensive error information:

Job Metadata: Execution timestamps, error messages, run states
CloudWatch Logs: Detailed error logs and stack traces
Temporal Filtering: Configurable lookback periods (1–2 days by default)

3. AI-Powered Analysis

Claude 3.5 processes the raw error data to:

Identify Root Causes: Distinguish between configuration, data, and infrastructure issues
Remove Duplicates: Consolidate similar errors across multiple runs
Provide Context: Explain the business impact of failures
Prioritize Issues: Highlight critical vs. non-critical failures

4. Professional Reporting

Automated email reports include:

Visual Hierarchy: Color-coded sections for different service types
Actionable Content: AI-summarized insights for quick understanding
Professional Formatting: HTML emails that work across all clients

Business Impact: Measurable Results

Operational Efficiency

90% Reduction in manual error investigation time
Immediate Detection of failures (vs. hours of delay)
Automated Prioritization of issues requiring attention
Consistent Reporting format for all stakeholders

Data Quality Improvements

Proactive Monitoring prevents cascading data quality issues
Faster Resolution of critical infrastructure problems
Reduced MTTR (Mean Time To Resolution) for data processing failures
Enhanced Visibility into system health and performance

Cost Optimization

Preventive Maintenance avoids costly data processing failures
Resource Efficiency through intelligent error analysis
Reduced Operational Overhead through automation
Scalable Solution that grows with infrastructure needs

Lessons Learned: Building Production AI Systems

1. Prompt Engineering Matters

Our initial prompts were too generic. We learned to:

Be Specific: Ask for causes, not solutions
Provide Context: Include job type and failure patterns
Set Boundaries: Limit response length and focus
Iterate Quickly: Test prompts with real error data

2. Error Handling is Critical

AI systems can fail in unexpected ways:

Graceful Degradation: Fall back to raw error messages if AI fails
Timeout Management: Set appropriate limits for AI processing
Error Logging: Capture AI failures for continuous improvement
Retry Logic: Handle transient AI service issues

3. Security and Privacy

When processing error logs with AI:

Data Sanitization: Remove sensitive information before AI processing
Access Controls: Limit who can trigger AI analysis
Audit Logging: Track all AI interactions for compliance
Encryption: Ensure data is encrypted in transit and at rest

4. Monitoring the Monitor

Our monitoring system needs its own oversight:

Self-Monitoring: Track the health of our monitoring Lambda
Performance Metrics: Monitor AI processing times and costs
Accuracy Validation: Periodically review AI-generated summaries
Feedback Loops: Incorporate user feedback to improve prompts

Future Enhancements: Scaling Intelligence

Planned Features

Slack Integration: Real-time notifications for critical failures
Web Dashboard: Visual monitoring interface with historical trends
Predictive Alerts: Use AI to predict potential failures before they occur
Custom Filters: Allow users to define their own error detection rules
Multi-Region Support: Monitor resources across multiple AWS regions

Advanced AI Capabilities

Trend Analysis: Identify patterns in recurring failures
Root Cause Prediction: Suggest likely causes based on error patterns
Automated Remediation: Suggest or execute fixes for common issues
Natural Language Queries: Allow stakeholders to ask questions about system health

Setting up AWS Glue for Local Development with PyCharm and Docker on Windows

Peng Xie — Thu, 14 Aug 2025 15:21:12 GMT

As a data engineer, I sometimes need to develop AWS PySpark applications. While it’s possible to do this directly in the AWS console, it’s far less convenient than developing locally. It took me many hours to figure out how to set up a local environment with Docker to run PySpark. This guide will walk you through configuring AWS Glue locally using Docker with PyCharm on a Windows machine. This setup utilizes AWS Glue version 5.0, which was the latest version at the time the source material was created.

While this guide follows official AWS documentation, it also incorporates additional crucial steps not found in the complete official documentation for Glue version 5.0, which are specifically covered here.

Prerequisites

Before starting, ensure you have the following installed and configured:

Docker
PyCharm Professional Version
An AWS account with an IAM configured

High-Level Setup Steps

The configuration process involves four main steps:

1. Pulling the AWS Glue Docker image.

2. Configuring the Docker PySpark Python interpreter.

3. Configuring environment variables.

4. Updating the Docker configuration settings in PyCharm.

Detailed Setup Instructions

Step 1: Pull the AWS Glue 5.0 Docker Container Image

Open your command line and execute the following command to pull the AWS Glue 5.0 Docker image:

docker pull public.ecr.aws/glue/aws-glue-libs:5

Step 2: Configure Docker Daemon Settings

1. Right-click the Docker application icon and navigate to Settings.

2. Ensure that the option “Expose Daemon on TCP localhost:2375 without TLS” is selected. Note: This step is specifically required for Windows machines and may not be necessary for Mac.

3. If you enable this option, click Apply & Restart to restart Docker.

Step 3: Configure PyCharm Python Interpreter to Leverage Docker

1. In PyCharm, go to File > Settings.

2. Under your project settings, select Python Interpreter.

3. Click Add Interpreter and choose Docker.

4. Select Pull or use existing.

5. In the Image field, type and select amazon/aws-glue-libs:glue-libs-.0 (or the version you pulled).

6. Click Next, then Next again.

7. Ensure System interpreter is selected, then click Create.

Step 4: Edit Docker Container Settings for Credential Files

This step ensures your AWS credentials are available within the Docker container.

1. In the same Run/Debug Configurations window, locate the Docker container settings (usually represented by a folder icon).

2. You will see Host path and Container path under Volume binding.

3. You need to add a new volume binding to map your local AWS credential file to the container.

◦ For the Host path, specify the exact location of your AWS credential file on your local machine (e.g., C:\Users\YourUser\.aws).

◦ For the Container path, set it to /root/.aws.

Step 5: Configure AWS Connection in PyCharm

1. In the Run/Debug Configurations window, go to the AWS Connection tab.

2. By default, it might be set to None.

3. Choose Other credentials profile / region.

4. Select the correct AWS profile that has the necessary credentials for your AWS account.

5. Ensure your IAM role has the required permissions to interact with data in AWS, such as s3:GetObject and s3:ListBucket for accessing S3 data.

6. Set your AWS region.

7. Click Apply and then OK.

Bonus: Debugging in PyCharm with Pandas DataFrames:

◦ PySpark DataFrames are not directly supported for viewing in PyCharm’s data viewer as of the source video’s creation.

◦ To view data, convert your PySpark DataFrame to a Pandas DataFrame using .toPandas().

◦ Set a breakpoint and run the script in debugger mode.

◦ Once the debugger stops at your breakpoint, you can click on the Pandas DataFrame variable and select View DataFrame to inspect the data directly within PyCharm.

How to Publish Docker Images to AWS ECR from a Windows System Successfully

Peng Xie — Wed, 23 Jul 2025 16:20:39 GMT

Background

Recently, I transitioned my AWS work environment from Linux to Windows, a move that significantly disrupted several of my data pipelines. One major challenge was publishing Docker images to AWS Elastic Container Registry (ECR), as the change in operating systems brought unexpected compatibility hurdles and necessitated workflow adjustments. In this article, I’ll share my journey of successfully publishing Docker images to ECR from a Windows environment, along with practical tips for overcoming common issues and streamlining the process.

General Publishing a Docker Image to AWS ECR: Step-by-Step Procedure

To publish a Docker image to AWS Elastic Container Registry (ECR) from a Windows system, follow these key steps. First, ensure you have the AWS CLI and Docker Desktop installed and configured. Create an ECR repository using the AWS Management Console or CLI command:

aws ecr create-repository - repository-name

Next, authenticate Docker to your ECR registry by running

aws ecr get-login-password - region  | docker login - username AWS - password-stdin .dkr.ecr..amazonaws.com.

Build your Docker image locally with

docker build -t  .

Then tag it for ECR using

docker tag :latest .dkr.ecr..amazonaws.com/:latest

Finally, push the image to ECR with

docker push .dkr.ecr..amazonaws.com/:latest

For more details, you can visit Pushing a Docker image to an Amazon ECR private repository — Amazon ECR

Potential Issue : Docker Manifest V1 or V2

AWS ECR supports the Docker Image Manifest V2, but older Docker images or tools might default to the deprecated Manifest V1, leading to push failures with errors like “manifest invalid” or “unsupported media type.”

The screenshot below displays information about a Docker image uploaded to AWS ECR, featuring a V1 manifest format.

An example of a Docker image using a V1 manifest in AWS ECR

The artifact type detail indicates the use of a V1 manifest format, with an additional image listed at 0 KB alongside the pushed image. This is likely due to the container image being uploaded as a multi-architecture image component, referred to as a manifest list or image index. For further insights, see the explanation in “Why are there extra untagged ‘images’ in Amazon ECR after doing docker push?” on Stack Overflow.

Solution:

Ensure Docker Desktop is updated to a recent version that defaults to Manifest V2, Schema 2.
Modify the docker build into:

docker build --platform linux/amd64 --provenance=false -t docker-image:test .

— platform linux/amd64: Specifies the target platform for the image, ensuring it’s built for a Linux AMD64 architecture. This is particularly useful on Windows systems to ensure compatibility with AWS ECR, which expects Linux-based images for most AWS services like ECS or EKS.
— provenance=false: Disables the generation of provenance metadata, which is part of Docker’s BuildKit (available in newer Docker versions). Provenance metadata provides build attestations, but setting this to false avoids including this metadata, which can be useful if ECR or your deployment environment doesn’t support it or if you want to reduce image metadata size.