As a data scientist I’ve noticed that it is often easy to get enthusiastic about all of the fascinating opportunities available to express a story through data. Even when we set aside options for modeling, the possible visual explanations (multi-colored scatterplots, simple yet immensely informative boxplots, perfectly shaded overlapping histograms, carefully fine-tuned heatmaps, etc.) are .
Nonetheless, seamless code and truly effective visualizations are generally the product of a lot of hard work behind the scenes before initial analysis and modeling processes can even begin. It personally brings to mind many of the home cleaning approaches and decoration techniques that pop up every few years such as the art of Feng shui, minimalism, Shoukei Matsumoto’s “tidy house, tidy mind” outlook, and more recently the KonMari method, as made famous by Marie Kondo’s The Life-Changing Magic of Tidying Up (which has even been launched into its own Netflix series). Just as when you walk into a gorgeous home and immediately feel a sense of peace and calm, this is probably the end result of meticulous planning and deliberate choices that have been made about what to keep and where to place things.
As data scientists we have to respect our belongings, understand that nostalgia is not our friend, and be open to rediscovery with our datasets.
The first step is deciding what to do with the empty (aka null) values — for whatever reason (data entry error, user misunderstanding, lack of time for research team to , etc.), these are cells that were left blank and therefore won’t help us later in the evaluation and modeling steps. Think of these like the unpaired socks, items with holes you’re never actually going to sew back together, or that super cozy sweater that is nowhere to be found.
For the purposes of interpreting the coding examples below, we’ll use ‘data’ to represent our dataset. First, we can just grab the total number of null values that exist throughout the entire dataset:
If this number is anything but 0, we want to figure out where they’re coming from. We then can use the following code to return a list of only the names of the columns that contain null values with the specific number pertaining to each. From there we can start to notice any trends and pinpoint if there are particular columns with a disproportionate amount of null values.
After inspecting each column, we can choose to either delete the column or rows with null values (more on this below), or simply fill those cells in with a specific filler value as shown below. The space left as ‘____’ on the last line is where we fill in the actual number 0, a new string such as “Unknown” or “N/A” or any other value we deem appropriate. We can test this out first without the inplace=True, but once we include this it makes the change permanently on the dataset.
data_replace = data_original[['column_name_2',
for column in data_replace:
We could also run additional loops for different columns and replace with other values as many times as needed:
data_replace_string = data_original[['column_name_22',
for column in data_replace_string:
data[column].fillna('not null anymore!!', inplace=True)
It is a good idea to run the below code again after every replacement to make sure the changes are being implemented…plus it’s super satisfying when that number is finally 0!
Observing/Converting Data Types
There are other factors to consider when cleaning up a home or a dataset. For example, we don’t want dresses folded up in a drawer where they’ll get wrinkled and it doesn’t do us much good to try and store socks on hangers. It’s also a good rule of thumb to keep things organized in the same sections such as pajamas in one area and professional attire in another. Data is similar because it’s much more cohesive later on in our functions if all the values of each column are of the same type. This can be tricky because there are times when it looks like all the values are integers, but really it’s stored as a string. Sometimes we can leave it be and it will never cause an issue…that is until 72 lines down in our code and we get an error that a function can’t be performed on a string because it was expected an integer, sending us all the way back to the data cleaning process (like hitting a slide in Chutes and Ladders sending you all the way back to the beginning).
In Pandas, the following code can be used to display the data type of each column:
The Pandas data types are:
- float (numbers with decimal point)
- integer (whole numbers)
- datetime (dates and times)
- object (strings)
Note: If there is a combination of different types within the same column, the dtype for that column will show as “object.” To convert all the values of a column into a particular data type (for example, from , we can use the following method:
Finally, we can check all our efforts towards correcting null values and data types with one simple tool:
This will again generate a list of every column of the dataset, and shows both the total number of null values along with its data type.
It’s also important to look at outliers in our data — they don’t always indicate an error, but they should be analyzed properly and be taken into consideration as they can affect mean and other calculated values that we utilize to make predictions. This could be compared to a fancy dress that you know you don’t wear very often, but it is still a valid piece in your wardrobe that you make an intentional decision to keep. On the other hand, it could be 7 scarves that you’ve never seen before and can’t even explain how they got there…assumably thanks to your sister who visited a few years ago, but we’ll never really know.
A great way to identify outliers is by using the .describe() tool as shown below. This will return the entire dataset and show the count, mean, minimum/maximum values, and standard deviation values for every column. It is a helpful way to identify values that should never be negative (price, age, area, number of bathrooms, etc.) and others that are obviously incorrect (such as a person’s age of 200+, someone’s DOB as a future date, an entry of “7” when it was a survey question intended to be ranked 1–5, etc).
We can also use the following code to extract and observe the describe() values for individual columns:
Removing Columns and Rows
After investigating the origin and causes of null values and outliers, a data scientist is often faced with the decision of simply excluding certain data from the analysis completely because it would possibly cause more harm or distraction than good. For some, this is a difficult part and can cause some anxiety — when to just let things go. It’s like that scarf you haven’t actually worn in almost a decade but you keep it around “just in case,” but in reality it’s crowding up your closet and even causes an emotional reaction because it was a present from your ex and you’re convinced he would just be so hurt if he knew you got rid of it…but deep down you know it’s best to move on and just focus on the items you do wear on a regular basis. There is a wide variety of techniques to deleting columns and rows; here are some of the most common:
To drop any rows that have at least one remaining null value:
To drop any columns that have at least one remaining null value:
data.dropna(axis = 'columns')
To drop a specific column:
data.drop(['column_a', 'column_b'], axis=1)OR data.drop(columns = ['column_a', 'column_b'])
To drop a specific row we can use it’s index number:
data.drop()OR multiple:data.drop([7, 11, 14])
Saving New Dataframes
This is the fun part — you get to save your brand new sparkling dataset and completely rename it so that none of your efforts up to this point can be undone. It’s like a new dresser or bin or shelf (or just a box on a shelf!) that has its own designated purpose going forward…with airtight seal that actively blocks extra clutter from getting in.
clean_data = pd.DataFrame(data)
An additional step would be to save that dataframe as an entirely new csv file itself that can be imported for future notebooks or projects. The index=False is used to avoid creating an ‘Unnamed’ column with the dataset’s index values as the new csv will already have an index.
Some KonMari exceptions…
I would say we can go ahead and skip the step of taking a moment of silence to introduce ourselves to the dataset and express our gratitude to a Jupyter Notebook for allowing us the space to run our code before reading in our files or other data. It’s also probably unnecessary (not to mention an example of poor time management skills) to look at each column and row individually while hugging our laptops and sensing if each one brings us joy to determine if we throw it out or not (after thanking it for its contribution to our lives of course). However, there is something to be said about approaching each new data science project with a fresh mindset and a clean slate. Not all datasets are created equally — some need just a little TLC while others have stubborn and hard-to-reach stains you have to just keep digging for and scrubbing until your mind is about to explode. It can be a grueling process, but in the end it’s worth it because it will have a positive impact on the quality of your work overall. As Matsumoto put it,
“If you see cleaning as a chore and something to avoid if you can, then it’s something that’s not good for you in your mind. But if you can change your recognition of cleaning from something that’s negative to something that’s positive, then your quality of life will improve.”
— Shoukei Matsumoto, A Monk’s Guide to a Clean House and Mind
Tidying up our data is an inevitable, crucial step in the data science process and the more we learn to accept, or dare I say, even enjoy that piece of the puzzle, our patience and quality of projects will improve. Perhaps we could implement a new routine of intermittently repeating the mantra “…tidy dataset, tidy mind…” throughout the data cleaning process while burning some sage and sipping on hot tea. The good news (as with all things) is the more practice you get, the better your instincts will be about which approach to take and which features to keep. More importantly, your grit and determination will pay off by giving you fewer errors in your code and visual displays which enable you to find connections and solve real world problems…and that’s the magic that truly sparks a data scientist’s pure JOY!