How to read compressed files from an Amazon S3 bucket using AWS Glue without decompressing them

Jay Jain
Jay Jain
May 10 · 3 min read
AWS GLUE

Introduction to AWS Glue

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics.

The Problem

From certain sources we were receiving the data in the compressed format directly into our S3 bucket. The challenge was to read the content of this compressed data without decompressing the file.

The Approach

First Step is to identify whether the file (or object in S3) is zip or gzip for which we will be using the path of file (using the Boto3 S3 resource Object)

The Code

  1. Function to return Bucket and Key name from S3 path:
def split_s3_path(s3_path):
path_parts=s3_path.replace(“s3://”,””).split(“/”)
bucket=path_parts.pop(0)
key=”/”.join(path_parts)
return bucket, key
if path.endswith('.zip'):
zip_obj = s3_resource.Object(bucket_name=bucket ,key=key_name)
buffer = BytesIO(zip_obj.get()[“Body”].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
file_info = z.getinfo(filename)
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket=bucket,
Key=output_location + filename,
Config=config
)
elif path.endswith(‘.gz’):
obj = s3.Object(bucket_name=bucket ,key=key_name)
with gzip.GzipFile(fileobj=obj.get()[“Body”]) as gzipfile:
content = gzipfile.read()
client.put_object(Body=content, Bucket=bucket, Key=’newfolder/new_filename.txt’)

The Conclusion

Thus, as we have seen that we can read any compressed files (zip/Gzip) easily with the help of python modules and can execute it in AWS Glue without worrying about any time constraints. Also, we can easily create the text file of the content present in the compressed files without decompressing it.

CodeX

Everything connected with Tech & Code. Follow to join our 500K+ monthly readers