Member-only story
2 Silent PySpark Mistakes You Should Be Aware Of
Small mistakes can lead to severe consequences when working with large datasets.
In programming, when we make a mistake, we don’t always get an error. The code works, doesn’t throw an exception and we think everything is fine. Those mistakes that don’t cause our script to fail are difficult to notice and debug.
It’s even more challenging to catch such mistakes in data science because we don’t usually get a single output.
Let’s say we have a dataset with millions of rows. We make a mistake in calculating the sales quantities. Then, we create aggregate features based on the sales quantities such as weekly total, the moving average of the last 14 days, and so on. These features are used in a machine learning model that predicts the demand in the next week.
We evaluate the predictions and find out the accuracy is not good enough. Then, we spend lots of time trying different things to improve the accuracy such as feature engineering or hyperparameter tuning. These strategies don’t have a big impact on the accuracy because the problem is in the data.
This is a scenario that we may encounter when working with large datasets. In this article, we’ll go over two specific PySpark mistakes that might cause unexpected…