How to read compressed files from an Amazon S3 bucket using AWS Glue without decompressing them
Introduction to AWS Glue
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics.
You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. AWS Glue supports spark (Pyspark and Scala) language and python shell as well.
Once cataloged, your data is immediately searchable, query able, and available for ETL. AWS Glue generates the code to execute your data transformations and data loading processes.
From certain sources we were receiving the data in the compressed format directly into our S3 bucket. The challenge was to read the content of this compressed data without decompressing the file.
All the compressed files present in S3 where zip or gzip, we had to handle both the variants of file format in the single script.
First Step is to identify whether the file (or object in S3) is zip or gzip for which we will be using the path of file (using the Boto3 S3 resource Object)
This can be achieved by using endswith function of python. Also with the help of path provided, you can split it into Bucket name and Key Name which can be used later as well
Based on the identification of compressed object, respective block for zip and gzip (or any other) will be called in the script (Here, I have used If/Else)
Once the file format is identified, we now have to read the content of the zip file. This can be done by reading it into a BytesIO buffer object and then we need to iterate over each object in the zip file for which we are using the namelist method and ultimately to write into S3 we can either use meta.client.upload_fileobj method or client.put_object
Similarly to read the content of the gzip file I have tried Gzip Constructor.
It will read the content of S3 object using read function of python and then with the help of put_object Boto3 command, it will dump this content as Text file into your respective destination
I have used AWS Glue — python shell to execute the following code, we can even use AWS Lambda for the same, but the only problem with AWS Lambda is Execution time limit which is maximum of 15 minutes. So if your file size is more, it might cause a problem.
- Function to return Bucket and Key name from S3 path:
return bucket, key
This will return bucket name and key name of the path provided, in string format
2. To Identify and extract the contents of the Zip File to another location
zip_obj = s3_resource.Object(bucket_name=bucket ,key=key_name)
buffer = BytesIO(zip_obj.get()[“Body”].read())
z = zipfile.ZipFile(buffer)for filename in z.namelist():
file_info = z.getinfo(filename)
Key=output_location + filename,
Alternatively, one can also use is_zipfile() function instead of using path.endswith().
3. Lastly, we need to check for Gzip object as well
obj = s3.Object(bucket_name=bucket ,key=key_name)
with gzip.GzipFile(fileobj=obj.get()[“Body”]) as gzipfile:
content = gzipfile.read()
client.put_object(Body=content, Bucket=bucket, Key=’newfolder/new_filename.txt’)
With this we can easily read the zip and gzip files present in S3.
Thus, as we have seen that we can read any compressed files (zip/Gzip) easily with the help of python modules and can execute it in AWS Glue without worrying about any time constraints. Also, we can easily create the text file of the content present in the compressed files without decompressing it.
You can even add blocks to read bz2, tar and many other compressed file formats just by using endswith function.
I hope this was helpful for the Developers and the Data Engineers. Do let me know your thoughts on this and/or any possible optimizations