Iza Stań
9 min readAug 26, 2023
Photo by Scott Graham on Unsplash

Silent Heros of Analytics: Data Preprocessing 101 — Data Cleaning

We have all heard the age-old saying, “Garbage in, garbage out.” It unveils a truth in data analytics: the quality of your insights comes hand-in-hand with the quality of your data. Regardless of how sophisticated your model is or how optimized your report may seem, if the foundation — your data — is flawed, your insights will be too. It might not sound too bad when you think about it first. Costs of amending the data yet in the development phase perhaps are not giving you a heart attack. Now imagine how these are going to escalate when reports fueled by flawed data are already used organization-wide and how much effort it is going to require to make any amendments to the data. The solution? Rigorous data cleaning.

Below please find my checklist with easy Python template scripts, which helps me and hopefully will help you to seamlessly preprocess data and make it ready for modeling.

  1. Step-by-step workflow

1.1. Understanding Your Data

Before diving into any analytic task, take a moment to familiarize yourself with your data. A thorough initial assessment can illuminate the path ahead. Domain knowledge or consulting with a subject matter expert will help you pick the most accurate pre-processing tools so that you get rid of fluff while keeping all the relevant nuances in place.

The questions below always help me:

  • How, when, where was the data collected? (Any potential data collection bias?)
  • What does it represent (rows, column names, etc.)?
  • What might be missing?
  • How raw the data truly is? Has the data already undergone any preprocessing?

Answering these questions, your domain knowledge, and reaching out to a subject matter expert when in doubt will help you pick the most accurate pre-processing tools.

1.2. Removing the Unwanted

Clutter is the enemy of clarity, and distracts you from the most essential data. Get rid of duplicates and redundancies — this will help you keep your dataset relevant and crips.

import pandas as pd
df = pd.read_csv(‘dataset.csv’)

#To remove all the duplicated rows
df_nd = df.drop_duplicates()

#You might want to apply it on the specific columns
df_nd = df.drop_duplicates(subset=['column1', 'column2'])

#Or enhance by adding additional conditioning
df_nd = df[~df.duplicated(subset=['column1', 'column2']) | (df['column3'] != specific_value)]

1.3. Handling Missing Data

That is usually a hiccup in my datasets. There are a few techniques with which these gaps can be efficiently handled. Keep in mind though that depending on the cause of missing data, the approach might be different:

MCAR: Data is missing completely at random

  • The probability of missingness is the same for all observations.
  • Imagine a survey where some respondents accidentally skipped questions purely by chance, with no relation to any characteristic of the respondents — that is an example of MCAR.
  • Removing MCAR cases does not bias the results, though, since the sample size is smaller, it may reduce the power of statistical tests.
  • A solution can be to drop the missing data rows

#To remove all rows containing any NaN values
df.dropna()

#Instead you can remove columns that contain any NaN values
df.dropna(axis=1)

#Similarly to how we removed duplicates, you might want to specify columns
df.dropna(subset=['column1', 'column2'])
  • Simple imputations like mean
#To replace all NaN values with mean non-NaN values from column1 
df['column1'].fillna(df['column1'].mean())

#To replace with a constant value x + let us add a parameter to define number of NaN we want to fill
df['column1'].fillna(value=0, limit=3)

#Instead you can fill NaN with value from the previous row (change the method to 'bfill' to fill NaN with value from the next row)
df['column1'].fillna(method='ffill')
  • Linear interpolation: Linear interpolation allows you to estimate missing values by finding a point (x, y) between two already given points (x₁, y₁), (x₂, y₂). It comes with an assumption that the change is linear though (there are other interpolation methods to explore: polynomial, spline, time).
#For linear interpolation
df_interpolated = df.interpolate(method=’linear’)

#You can apply linear interpolation on the part of the dataset and apply additional method on the rest of the dataset
#Assuming 'column_name' is the column you're interested in
df.sort_values('column1', ascending=False)

# Apply interpolation to the top 40 rows
df.iloc[:40] = df.iloc[:40].interpolate(method='linear')

# Fill the rest with the minimum value of the column
min_value = df['column1'].min()
df.iloc[40:] = df.iloc[40:].fillna(value=min_value)
  • KNN imputation: The KNN method is based on the principle that a point with missing values can be imputed using information from its ‘k’ closest neighbors in the data space.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=3)

# You can apply the imputer only on the selected column and then update your original DataFrame with the imputed values
imputed_data = imputer.fit_transform(df[['column1']])
df['column1'] = imputed_data

— MAR: Data is missing at random

  • The probability of missingness depends on the observed data, not on the unobserved. If you can predict the missingness based on other observed variables, the data is likely MAR.
  • For example, imagine that older respondents to a survey are less likely to reveal their income but more likely to reveal their age. If you know their age, you can model the probability that they will disclose their income. In this case, age is tracked, but income might be missing. The missingness of income data is related to the observed age data but is not related to the actual (unobserved) income values.
  • Removing MAR cases can lead to biased results unless the model accounts for the variables causing the missingness.
  • Here you might need to look for more advanced solutions, that will capture the data relationships: weighted estimating equations or model-based imputations, for example, a regression model.

— MNAR: Missing not at random

  • The missingness depends on the unobserved data itself. This is the trickiest mechanism because it implies that there is a systematic reason for the missing values, directly related to the values themselves.
  • For example, consider a survey where respondents with higher incomes are less likely to disclose their earnings. Here, the missingness of the income data is directly related to the income values that are not observed.
  • Handling MNAR is challenging because you can not determine the missingness mechanism using only the observed data. Additional modeling or advanced domain knowledge might be needed, regardless of the method, it still leaves a lot of room for assumptions.

Even if not handled by the implementation of any of the solutions above, a good practice can be to flag any missing data in the data set and introduce the category missing.

1.4. Correcting Outliers

These can be either a result of human input or experimental errors.

  • Capping: Handling outliers by setting a threshold — any data values below or above it are considered outliers and get replaced with a specific minimum and maximum value. To determine a threshold you might either rely on the domain knowledge (the range of valid data values can be already known) or use a statistical method, for example, IQR.

The Interquartile Range method is a technique to detect outliers, especially for datasets that are not normally distributed. The IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile) of the data. In a nutshell, using statistical dispersion, the IQR method takes into account 50% of the middle data (as per a rule of thumb) and considers points outside boundaries as outliers.


q1 = df['values'].quantile(0.25)
q3 = df['values'].quantile(0.75)
iqr = q3 - q1

lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

df['values'] = df['values'].apply(lambda x: upper_bound if x > upper_bound else lower_bound if x < lower_bound else x)

1.5. Data Consistency

This is a non-negotiable. Having ‘test_test_test’ as a column might be tempting at the moment, but fast forward a few weeks, and its meaning might be lost. Consistent naming conventions are crucial. They empower not just you, but the entire organization, by facilitating seamless data navigation. Check out if there are no organization-wide standards, that you should apply to your data, for more take a look at this article.

1.8. Error Correction and Text Data Cleaning

As a data analyst, you will frequently encounter textual discrepancies — special characters that need removal, words that require stemming, and more. Encoding text data is also pivotal for certain analytic operations.

Next week I will share with you some linguistic nuances that I consider prior to and during an analysis of textual data (whether it is an exploratory data analytics or NLP case), but for the time being, here come some universal tricks and suggestions to follow:

  • Convert to lowercase
#To convert text from specific column
df['column1'] = df['column1'].str.lower()

#To capitalize the first letter
df['column1'] = df['column1'].str.title()
  • Remove leading and trailing white spaces: They usually are a result of human input, but as always, review your dataset. Take extra care when modifying your dataset.
#To remove leading whitespaces
df['column_name'] = df['column_name'].str.lstrip()

#To remove trailing whitespaces
df['column_name'] = df['column_name'].str.rstrip()
  • Filter out unwanted characters: The list is highly dependent on the data. Some characters might be irrelevant for one dataset, but be crucial for the other.
# To replace any character with is not a number or a letter with a space
df['column_name'] = df['column_name'].str.replace('[^a-zA-Z0–9]', ' ')

#Or to completely remove these characters
df['column_name'] = df['column_name'].str.replace('[^a-zA-Z0–9]', '')
  • Remove non-ASCII characters: To avoid encoding or compatibility issues, consider getting rid of NO-ASCII Characters

#To remove non-ASCII characters
text = text.encode(“ascii”, errors=”ignore”).decode()

#To remove non-ASCII chracters from a specific column in a DataFrame
df['column1'] = df['column1'].apply(lambda x: x.encode('ascii', errors='ignore').decode() if isinstance(x, str) else x)

#You can also use unidecode module to turn non-ASCII characters into the closest representation in ASCII
from unidecode import unidecode
df['column1'] = df['column1'].apply(lambda x: unidecode(x) if isinstance(x, str) else x)
import pandas as pd
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

def tokenize_and_stem(text):
tokens = word_tokenize(text)
stemmed = [stemmer.stem(token) for token in tokens]
return ' '.join(stemmed)

df['stemmed_text '] = df['column1'].apply(tokenize_and_stem)

1.7. Validation: Always make sure your data conforms to specified formats or constraints. There are a few steps that I follow to ensure I did not miss any nuances along the way:

  • Descriptive statistics: I generate high-level summary statistics (mean, median, mode, standard deviation) and compare these to the original data.
  • Distribution analysis: I like to keep things visual and tend to spot anomalies quicker on a visualization, hence I usually create an additional histogram or boxplot using matplotlib or seaborn.
  • Recheck for missing values: I ensure once again that the whole dataset is reviewed and all discrepancies are handled.
  • Domain-specific validation: I consult any last concerns with an SME or, if possible, share data for domain-specific checks.

2. Remarks

It is not that any preprocessing method is fail-safe. Taking some additional steps along the way allows me to preprocess the data without fear I will mess it up (and this fear is always my fun-killer).

2.1. Choosing the Right Methods

Not all of the data cleaning techniques will fit every situation. Always keep in mind the purpose of your analysis and context.

2.2. Backups are your best friends

Storing previous versions of the data saves me lots of energy (and sanity). Especially if the data cleaning process ends up being tedious and time-consuming, make sure to keep track of old versions and document amendments you are making to the data.

2.3. Maintaining Raw Data

Remember never to discard the original, unfiltered dataset. It often serves as a time capsule, providing context, which might come in handy in the future for reference.

2.4. Re-evaluation Over Time

The dynamic nature of data mandates regular check-ins and updates to the cleaning process. What works well for your data now, might not in the future.

2.5. Automation, but with Care

While automation can be a time-saver, proceeding with caution is key. Regularly checking for anomalies post-automation is a must. It helps not only improve the cleaning pipeline but also spot nuances, which can then lead to unexpected observations.

In the grand tapestry of data analytics, data cleaning might seem like a humble thread. But it is these individual threads that hold the masterpiece together. Cleaner data means clearer insights.

Stay tuned for the next chapter on data transformation.

— — — —

If you feel like discussing this further, drop me a line at:

So much fun to talk to other data geeks!

Iza Stań

Passionate about data and languages, undecided which I love more. Days aren't complete without coffee and flight deals.