How to read compressed files from an Amazon S3 bucket using AWS Glue without decompressing them

Jay Jain
CodeX
Published in
3 min readMay 10, 2021
AWS GLUE

Introduction to AWS Glue

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics.

You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. AWS Glue supports spark (Pyspark and Scala) language and python shell as well.

Once cataloged, your data is immediately searchable, query able, and available for ETL. AWS Glue generates the code to execute your data transformations and data loading processes.

The Problem

From certain sources we were receiving the data in the compressed format directly into our S3 bucket. The challenge was to read the content of this compressed data without decompressing the file.

All the compressed files present in S3 where zip or gzip, we had to handle both the variants of file format in the single script.

Have no fear, we have the solution to above problem, we have used python boto3 libraries, along with zip and gzip modules to obtain the solution!

The Approach

First Step is to identify whether the file (or object in S3) is zip or gzip for which we will be using the path of file (using the Boto3 S3 resource Object)

This can be achieved by using endswith function of python. Also with the help of path provided, you can split it into Bucket name and Key Name which can be used later as well

Based on the identification of compressed object, respective block for zip and gzip (or any other) will be called in the script (Here, I have used If/Else)

Once the file format is identified, we now have to read the content of the zip file. This can be done by reading it into a BytesIO buffer object and then we need to iterate over each object in the zip file for which we are using the namelist method and ultimately to write into S3 we can either use meta.client.upload_fileobj method or client.put_object

Similarly to read the content of the gzip file I have tried Gzip Constructor.
It will read the content of S3 object using read function of python and then with the help of put_object Boto3 command, it will dump this content as Text file into your respective destination

I have used AWS Glue — python shell to execute the following code, we can even use AWS Lambda for the same, but the only problem with AWS Lambda is Execution time limit which is maximum of 15 minutes. So if your file size is more, it might cause a problem.

The Code

  1. Function to return Bucket and Key name from S3 path:
def split_s3_path(s3_path):
path_parts=s3_path.replace(“s3://”,””).split(“/”)
bucket=path_parts.pop(0)
key=”/”.join(path_parts)
return bucket, key

This will return bucket name and key name of the path provided, in string format

2. To Identify and extract the contents of the Zip File to another location

if path.endswith('.zip'):
zip_obj = s3_resource.Object(bucket_name=bucket ,key=key_name)
buffer = BytesIO(zip_obj.get()[“Body”].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
file_info = z.getinfo(filename)
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket=bucket,
Key=output_location + filename,
Config=config
)

Alternatively, one can also use is_zipfile() function instead of using path.endswith().

3. Lastly, we need to check for Gzip object as well

elif path.endswith(‘.gz’):
obj = s3.Object(bucket_name=bucket ,key=key_name)
with gzip.GzipFile(fileobj=obj.get()[“Body”]) as gzipfile:
content = gzipfile.read()
client.put_object(Body=content, Bucket=bucket, Key=’newfolder/new_filename.txt’)

With this we can easily read the zip and gzip files present in S3.

The Conclusion

Thus, as we have seen that we can read any compressed files (zip/Gzip) easily with the help of python modules and can execute it in AWS Glue without worrying about any time constraints. Also, we can easily create the text file of the content present in the compressed files without decompressing it.

You can even add blocks to read bz2, tar and many other compressed file formats just by using endswith function.

I hope this was helpful for the Developers and the Data Engineers. Do let me know your thoughts on this and/or any possible optimizations

--

--

Jay Jain
CodeX
Writer for

Senior Data Engineer at Exponentia AI | AWS Certified Solution Architect | BI Tool | ETL | Spark | AWS Glue | Data Warehouse | Big Data