Tips and tricks for a good performance data pipeline

Published in

Data Tech Tips

2 min readNov 23, 2019

What a good data engineer must do once on a while, is for sure to maintain a good overall performance of your ETL tools.
You need not only to check if your database is in good health like:

Creating indexes
Dropping any useless table saving space
Optimizing search queries
Using temporary tables to do heavy CPU operations

You need also that your ETL software should go at maximum speed, how can we achieve this?
First of all to an analysis of what your software should do and how is evolving. You know that once you write you “Mk.1” version it won’t be the same as the “Mk.15”, so you need to re-check it’s functionality:

Do a functional analysis of your software
Predict how it will become in the future covering how many needs you can think
Avoid writing useless functions and maintain it as simple as possibile (KISS principle anyone?)

Once you have done with the analysis let’s do some dirty work:

Rewrite your code making it less heavy and more efficient without changing any logic behind it.
Try to maintain a linear complexity of your operations
If you can demand any CPU operation to GPU depending on your programming language
If there is any DB bound operation write down a Stored Procedure and call it whenever you can. (Don’t perform a query getting your result entirely on your RAM!)
Try to avoid unnecessary operation like changing column name but leave the logic as simple as you can (KISS ftw).

These are some of the operation you need to do to maintain your software as light and young as possible, obviously if needed rewrite your entire code with a new more useful programming language.

Any advice by you? I am mistaking anything?
Write it on comments.
Updates are on its way!

Tips and tricks for a good performance data pipeline

Written by Daniele