Tips and tricks for a good performance data pipeline

Daniele
Data Tech Tips
Published in
2 min readNov 23, 2019
Photo by Dan Roizer on Unsplash

What a good data engineer must do once on a while, is for sure to maintain a good overall performance of your ETL tools.
You need not only to check if your database is in good health like:

  • Creating indexes
  • Dropping any useless table saving space
  • Optimizing search queries
  • Using temporary tables to do heavy CPU operations

You need also that your ETL software should go at maximum speed, how can we achieve this?
First of all to an analysis of what your software should do and how is evolving. You know that once you write you “Mk.1” version it won’t be the same as the “Mk.15”, so you need to re-check it’s functionality:

  • Do a functional analysis of your software
  • Predict how it will become in the future covering how many needs you can think
  • Avoid writing useless functions and maintain it as simple as possibile (KISS principle anyone?)
Photo by Austin Distel on Unsplash

Once you have done with the analysis let’s do some dirty work:

  • Rewrite your code making it less heavy and more efficient without changing any logic behind it.
  • Try to maintain a linear complexity of your operations
  • If you can demand any CPU operation to GPU depending on your programming language
  • If there is any DB bound operation write down a Stored Procedure and call it whenever you can. (Don’t perform a query getting your result entirely on your RAM!)
  • Try to avoid unnecessary operation like changing column name but leave the logic as simple as you can (KISS ftw).

These are some of the operation you need to do to maintain your software as light and young as possible, obviously if needed rewrite your entire code with a new more useful programming language.

Any advice by you? I am mistaking anything?
Write it on comments.
Updates are on its way!

--

--