
For this micro overview of cleaning data using pandas we will be assuming some basic knowledge of python, methods, and regular expressions.
Dealing With Missing Data
We will often have to deal with missing data. Two common ways to do this are to fill the missing data with some number (mean, median, mode) or to drop the rows with the missing data altogether.
- dataframe.fillna()
- dataframe.dropna()
Dealing With Strings
When cleaning strings we work with series data instead of the data frame. Each column of the data frame, when accessed separately is treated as a series data type. We can then use methods like replace and contains along with regular expressions to do deal with complex formatting issues. You can also set the expand parameter to True for the split method in order to return a data frame with the columns for each split.
Formating Strings
- dataframe.column_name.str.lower()
- dataframe.column_name.strip()
Cleaning Strings
- dataframe.column_name.replace()
- dataframe.column_name.str.contains()
- dataframe.columns_name.str.split()
Dealing With Dates
Similarly for dates we must first start with a series to do things like get day name from a date.
- dataframe.column_name.dt.days()
- dataframe.column_name.dt.day_name()
As you can see, manipulating data in pandas is really straight forward if you understand python and numpy. This is just a very brief sampling of what you can do using pandas. The documentation is always the best place to find more information.
But what if I just want to split strings?
I’ve come to love Tableau not just for rapid visualization, but also for simple string splits. Instead of needing to use SQL or python, we can do this directly in Tableau.
Tableau will try to guess what type of split you want if you click the drop down menu next to a column on the Data Source tab. Let’s do this for a case number that starts with a letter representing case type, followed by numbers generated for each case.


We get something kind of off here. Tableau seems to have picked up on the most common format, but fails for our second row because of that. Let’s try choosing the custom split instead.
We will split all of the case number components into new rows


This looks much better! If this is the type of split we were looking for we’d be all set. But what if we really just wanted to know the case type. We’d want to split on just the first dash. Let’s try that.


This turned out to be exactly what we were looking for!
We should always be on the lookout for faster and easier ways to work with data. While using open source tools like python and R are often the best way to deal with really messy data, and perform complex analysis with the benefit of flexibility, there are also a lot of software options that can make our jobs simpler.
