Consuming Large Size files from S3 at Airtel Digital

Parshantverma
Airtel Digital
Published in
3 min readOct 16, 2023

In this blog post, we will delve into the process of consuming and downloading large files from S3. At Airtel Ads, we regularly handle substantial file sizes to fulfil various use cases.

Problems faced when we used the AWS S3 library methods

We initially used the S3 library to process files from S3. However, when dealing with large files, some in the hundreds of gigabytes, we encountered connection issues such as ‘Connection Break’ or ‘Reset Error.’ The S3 library processes files in chunks and maintains a persistent TCP connection with the S3 servers. When processing large files from S3, a broken connection can be problematic, as it requires either reprocessing the file from the beginning or resuming from the last byte already processed. We also attempted to adjust parameters to increase the TCP connection’s stability but still faced connection issues.

Solution of processing large files

We needed to process files in customized byte ranges, which led us to delve into the AWS library to understand how it interacts with S3 to retrieve object content. We then utilized the same objects to manually make S3 calls with specific byte ranges, following these steps:

Step 1: Establish a TCP connection with Amazon S3.

Step 2: Send requests to S3 with the object key and the desired byte size ranges for chunk processing.

Step 3: Receive responses from Amazon S3.

Step 4: Close the TCP connection.

Step 5: Repeat the above four steps for subsequent chunks if they are available.

Here’s a code snippet to process very large files

FileMetadata :- This file metadata will contain info about files content length

Get file content from S3

long contentLength =s3Object.getObjectMetadata().getContentLength();
s3Object.close();
chunkSizeInMB = 10 MB(we can update it to any value)
processFile(FileMetadata fileMetadata){
BufferedReader br = null;
try {
long chunkSize = (long) chunkSizeInMB * 1024 * 1024;
Enumeration enumeration = new Enumeration() {
long currentPosition = 0L;
long totalSize = fileMetadata.getContentLength();
@Override
public boolean hasMoreElements() {
return currentPosition < totalSize;
}
@Override
public Object nextElement() {
byte[] byteArray;
try {
S3Object s3Object =
getS3ObjectByRange(fileMetadata.getBucketName(),
fileMetadata.getObjectKey(), currentPosition,
currentPosition + chunkSize - 1);
byteArray = IOUtils.toByteArray(s3Object.getObjectContent());
currentPosition += chunkSize;
s3Object.close(); // Here we are closing connection once we
get the chunk data
} catch (Exception ex) {
log.error("Error in reading file : {}, ex :{}",
fileMetadata.getFileName(), ex);
throw new
SegmentationException(ErrorEnum.FILE_READ_FAILURE); }
return new ByteArrayInputStream(byteArray);
}
};
SequenceInputStream sequenceInputStream = new SequenceInputStream(enumeration); // Sequence input stream will call the next chunk automatically if there is data available based on response from hasMoreElements method.
GZIPInputStream gzipInputStream = new GZIPInputStream(sequenceInputStream); // In our case it was zip file
br = new BufferedReader(new InputStreamReader(gzipInputStream));
String currLine = br.readLine();
while (currLine != null) {
System.out.println(currLine);
currLine = br.readLine();
}
} catch (Exception ex) {
//throw ex;
} finally {
if (br != null)
br.close();
} }
This method below is to get the s3 object content by giving byte ranges.
public S3Object getS3ObjectByRange(String bucketName, String objectKey, long
startIndex, long endIndex) {
try {
GetObjectRequest getObjectRequest = new GetObjectRequest(bucketName,
objectKey);
getObjectRequest.setRange(startIndex, endIndex);
S3Object s3Object = amazonS3.getObject(getObjectRequest);
return s3Object;
} catch (Exception ex) {
System.out.println("Exception occurred while fetching s3 object : "+ ex);
// One can throw an exception if they want to retry this method in case of getting a chunk from S3.
}
}

Conclusion

We can successfully process or download files that are hundreds of gigabytes in size using this method without encountering any connection issues. Additionally, AWS offers the capability of S3 Select.

Reference

https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html

--

--