Member-only story
Python for Data Engineers
Advanced ETL techniques for beginners
In this story I will speak about advanced data engineering techniques in Python. No doubt, Python is the most popular programming language for data. During my almost twelve-year career in data engineering, I encountered various situations when code had issues. This story is a brief summary of how I resolved them and learned to write better code. I will show a few techniques that make our ETL faster and help to improve the performance of our code.
List comprehensions
Imagine you are looping through a list of tables. Typically, we would do this:
data_pipelines = ['p1','p2','p3']
processed_tables = []
for table in data_pipelines:
processed_tables.append(table)
But instead, we could use list comprehensions. Not only they are faster, they also reduce the code making it more concise:
processed_tables = [table for table in data_pipelines]
For example, looping through a super large file with data to transform (ETL) each row has never been easier:
def etl(item):
# Do some data transformation here
return json.dumps(item)
data = u"\n".join(etl(item) for item in json_data)