Querying Unstructured JSON Logs with AWS Glue ETL Job and Athena: A Step-by-Step Guide
In the previous guide, we explored the ease of querying structured JSON logs with AWS Glue Crawlers and Athena. It was all neat and organized, like opening a book where every chapter was perfectly titled. But what happens when the story becomes a little chaotic? Unstructured logs enter the picture, with unpredictable fields, shifting formats, and missing values. These logs, often generated by observability tools like New Relic & Datadog, present an entirely new challenge.
You can’t simply throw these unstructured logs into a Glue Crawler and expect to get a clean table for querying. No — these logs need to be tamed, transformed, and molded into something manageable before you can extract value from them. Enter AWS Glue ETL, a powerful tool designed to handle the most unruly of data sets.
Let’s dive deeper into the steps required to transform and query unstructured logs effectively.
Architecture:
Steps:
1. Store the Unstructed Json file in S3.
In this guide, we’re working with New Relic infrastructure logs, which are generated by the New Relic Infrastructure agent running on an EC2 instance. These logs are unstructured, lacking a consistent format, as shown in the image below. They contain various fields, making it impractical to use a Glue Crawler to process them effectively.
Upload you Unstructured JSON Log in the Bucket.
2. Create a Glue Database
1. Next, create a database in AWS Glue to store the data catalog for your logs.
a. Go to AWS Glue → Databases.
b. Click Add Database.
c. Provide a name for the database and click Create Database.
3. Create a IAM Role
To run the AWS Glue Crawler, create an IAM Role with the necessary permissions to access your S3 bucket.
Attach the managed policy AWSGlueServiceRole to grant necessary permissions.
Define an inline policy with permissions to access your S3 bucket like below image.
4. Create Glue ETL Job
Since the JSON logs are unstructured, we need to use an ETL job to transform the data and extract key fields.
If you examine the log file, there are four key fields that appear frequently: context, msg, timestamp, and level. These fields vary in structure, so we can use regular expressions in the ETL job to extract them and store the results in the Glue Data Catalog table.
Regular expressions:
context_pattern = r’”context”:(\{[^}]+\})’
msg_pattern = r’”msg”:”([^”]+)”’
time_pattern = r’”timestamp”:”([^”]+)”’
level_pattern = r’”level”:”([^”]+)”’
Steps to create a Glue ETL job:
- Navigate to the AWS Glue dashboard and select ETL Jobs.
2. Click Create job and choose the Blank graph option.
3. In the Script section, add the provided ETL script. Make sure to specify the S3 log file URI, output bucket, database name, and table name.
import sys
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import regexp_extract
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import col, trim
# @params: [JOB_NAME, S3_URI, TABLE_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Read the JSON log file from S3 using the provided S3 URI
df = spark.read.text("s3://logquerybucketblog/Dynamic_Logs.logs")
print(df.show())
context_pattern = r'"context":(\{[^}]+\})'
msg_pattern = r'"msg":"([^"]+)"'
time_pattern = r'timestamp":"([^"]+)"'
level_pattern = r'"level":"([^"]+)"'
# Extract fields into separate columns
final_df = df.withColumn("Message", regexp_extract("value", msg_pattern, 1)) \
.withColumn("Timestamp", regexp_extract("value", time_pattern, 1)) \
.withColumn("Level", regexp_extract("value", level_pattern, 1)) \
.withColumn("Context", regexp_extract("value", context_pattern, 1))
final_df = final_df.drop('value')
filtered_df = final_df.filter(
(trim(col("Message")) != "") |
(trim(col("Timestamp")) != "") |
(trim(col("Level")) != "") |
(trim(col("Context")) != "")
)
print(filtered_df.show())
# Convert to DynamicFrame
dynamic_frame = DynamicFrame.fromDF(filtered_df, glueContext, "dynamic_frame")
sink = glueContext.getSink(
path="s3://output-logqueryblog/",
connection_type="s3",
updateBehavior="LOG",
partitionKeys=[],
compression="snappy",
enableUpdateCatalog=True,
transformation_ctx="sink"
)
sink.setCatalogInfo(catalogDatabase="log_query_database", catalogTableName="dynamic_log_query_database")
sink.setFormat("json")
sink.writeFrame(dynamic_frame)
job.commit()
4. In the Job details tab, add a job name and assign the IAM role that was created earlier.
5. Save the job.
6. Ensure that the output bucket’s policy allows access by the IAM role.
5. Run the AWS Glue ETL Job
To run the job, click the Run button.
You can track its progress in the Run tab, and it will take a few minutes to complete.
Once the job is finished, a new table will be generated in the Glue Data Catalog using the database name you provided in the code.
6. Use Athena to Query the Data
1. To query the data
a. In the AWS Glue console, go to Database → Table and select View Data.
b. Click Proceed on the pop-up that appears.
c. You will be redirected to the Athena dashboard.
From there, you can run your SQL queries and view the results displayed in a table format.
And there you have it — unstructured logs transformed into a clean, queryable format using AWS Glue ETL. With Athena, you can now easily sift through the data and extract valuable insights. From chaos to clarity, this process shows the power of turning even the messiest logs into something meaningful.
Happy learning !!!!