Troubleshooting an ETL pipeline on AWS

5 min readApr 27, 2023

In the first part of this series, we created an Extract, Transform, Load (ETL) pipeline on AWS with the help of Terraform. Then, I explained how to test this pipeline using manual and automatic means. Now, it’s time to have a look at some possible errors you might encounter along the way. Also, we will compare different resource configurations to find a balance between runtime performance and costs.

Permissions

Even for our relatively simple ETL pipeline, we have to set up a number of access rights. These come in various shapes: as IAM roles (attached to users), predefined and custom policies as well as simple permissions. If you misconfigure or forget any of these, code execution can fail silently, even if logging is set up. Possibly, you will encounter timeouts (more on these later) but for security reasons, AWS will not always tell you directly that some service is not allowed to access another. If the pipeline is fairly simple, it may be worthwhile to create it manually in the console instead through Terraform. This removes one layer of complexity and you may be able to rely on the official tutorials to set things up correctly.

Timeouts

Timeouts, i.e. your code taking longer than allowed to execute, can happen for a number of reasons. As mentioned above, missing permissions is one of them. Lambda functions in particular have a configurable timeout, by default 3 seconds. It is worth increasing that as a first measure if you run into problems. If the code simply takes a little longer but does eventually complete, you at least know that there isn’t a permissions problem.

Especially when communicating with other services, you could run into bandwidth limits. For example, the Free Tier option for AWS DynamoDB allows for 25 provisioned Read and Write Capacity Units. If you try to write many entries at once, you will quickly run into request throttling, causing the calling Lambda function to wait until its own timeout forces it to quit. Luckily, such a problem can easily be checked since CloudWatch tracks throttled requests and issues alarms to notify you of the failures.

Your code may also simply need much more time than you would expect. Even processing just a few thousand CSV records as we did in part one of this series can take more than five seconds with the default hardware configuration. I will talk more about balancing runtime performance with costs below because this is a fairly complex topic. One simple thing you can do is creating a timing wrapper in your Python code that will log execution times:

import time
from contextlib import contextmanager

@contextmanager
def timer(name):
    start_time = time.time()
    try:
        yield
    finally:
        print(f'{name} took {time.time()-start_time:.3f}s')

print('running extraction...')
with timer('extracting'):
    extract_data()

This will create log entries that look like this:

running extraction...
extracting took 0.472s

Source archiving

We used Terraform’s archive_file module to pack up our application code and deploy it to AWS Lambda. While this works fairly well, there are a few quirks you may run into. For once, we can configure archive_file to zip the whole src folder. It will however not include the folder itself in the archive which means the entry point in the aws_lambda_function configuration (called handler) should not include the src/ part either. This is an open issue that is currently (2023) being addressed. When I encountered this, one suggested solution was to create an intermediate folder. Since I didn’t want to change the directory structure — because that would affect the unit tests — I symlinked the source folder from one level below. However, it turns out that archive_file also does not properly handle symbolic links which has been a long-standing known issue. Luckily, that has also seen some recent activity.

Performance

When running a test case from the Lambda console (as described in part 2 of this series), AWS will tell you how much memory was used. If that happens to be exactly what you requested, your code will likely require more and AWS will highlight that fact. However, even if you provide it with more than the strictly required amount you may not get optimal performance. That’s because with every increase of memory, AWS will also scale up the available CPU resources. So even if you don’t need additional RAM, you will likely benefit from the added computing power. This is well documented but your actual runtime performance and costs will depend on your workload. For this reason, I have tested my setup with different memory (and thus, CPU) settings from the default value (128 MB) all the way up to the maximum allowed in my region (3008 MB). Unsurprisingly, spending more money (measured in Gigabyte-seconds, below in red) gets you more power (lower runtime, in blue):

Graph of runtime vs. cost comparison — Runtime vs. cost comparison

As the above graph shows, the relationship is not linear. Depending on one’s needs and preferences, a memory setting in the range of 512 to 1024 MB may be a good balance here. With only about a 25% cost increase over the baseline, the execution time can be sped up by a factor of 3 to 5. Doubling the cost can give up to 10x more performance! All of these measurements exhibit some variation, of course. For the above, I took the median value out of three runs for each configuration. One thing to keep in mind is that the first invocation on a new setup will always be slower since AWS needs to provision and spin up new containers.

Keep in mind that it is not obvious how much your code can be improved. When I measured the execution times for the different parts of the ETL pipeline, even the presumably I/O bound bits — that is, the communication with S3 — was significantly sped up with higher memory/CPU values. The takeaway from this is that you should run representative inputs for your pipeline on different hardware, plot the results and decide based on that. Most likely, you will want to increase the default memory size.

Summary

I hope I was able to clarify some possible problems with AWS-managed ETL pipelines. Especially access rights and performance tuning can be tricky. Be wary of assumptions about your code’s runtime behavior. Always measure first, then analyze to improve and finally measure again.

Good luck and thank you for reading!