Data-Sci Basics: Pandas
For this entry, we will discuss some key points of the data cleaning process, including some important considerations, and then practice on a real world dataset from New York City’s 311 hotline.
WHAT IS DATA CLEANING?
Data cleaning is the actual process of removing any anomalies and inconsistencies from your data and preparing it to be processed. This means normalizing all of the values in your Series,
(If a Series should only have a two-word string for its values, then you make sure all of the values match this ‘norm’. Keep in mind, each Series will have its own norm based on what it is describing, so Series’ for zip codes, longitudes, names, stock indexes, etc may all have a different norm.)
and overall just making sure that everything in your data set makes sense. If values don’t make sense, you try to programmatically convert them to sensible data. As a last-case scenario, you get rid of them, but this should always be the last case.
The first step in data cleaning is to figure out what to do with “bad” data. When we say bad data, what we actually mean is data that is invalid for some reason. This is most easily expressed as a ‘Null’, or ‘NaN’ value (from numpy), but can also mean any type of data that is just the wrong kind or doesn’t pass that Series norm. For example, if you have:
- a series that is supposed to describe latitude, every valid point will be a floating-point with 5 decimal places. ‘Bad’ data would include non-float integers, floats with only 3 decimal places, or any kind of string value
- a series that describes Zip-code as a string of five numbers, bad data would include: a 5-value integer (type == int), any floating-point value, the string “JFK”, etc
- And of course, in any data point, ‘NaN’ indicates an invalid or missing point
Though we describe this data as ‘bad’, keep in mind that we do not just want to indiscriminately delete all “bad” data; some poorly formatted data may still contain valuable information. For instance, values of the wrong type can be converted. the string ‘JFK’ in zip codes could indicate JFK airport in New York, so that is still a convertible, valuable value.
That being said, values that are not good, that really cannot be worked with, we will turn into a universal ‘bad value type’; this is where the Numpy ‘Not a number’ value comes in. Any data-point whose value has been determined meaningless and non-convertible can be assigned the value ‘np.NaN’, (called a ‘nan’ value). This will be very helpful later, as it will allow you to drop rows that contain ‘nan’ values for cleaner analysis.
TIME AND DATE
When working with any data that is date or time based, we want to convert that data into a ‘datetime’ object. This way, we can use all the normal arithmetic and comparisons that we are used to against the newly created datetime object (this is simply not possible with normal python strings, which dates are very often stored as).
We will do this using ‘Lambda’. Lambda functions are exactly like normal functions, except that they are ‘anonymous’; this just means that they are defined, and exist, only during the one use in which they are called.
import datetime as dt
data[‘Created Date’] = data[‘Created Date’].apply(lambda x:dt.datetime.strptime(x, ‘%m/%d/%Y %I:%M%S %p’))
Note, that datetime is using a series of regular expressions to EXACTLY mimic the format of the datetime string contained in ‘Created Date’. WHENEVER you use the datetime library, you MUST be sure to use regex to mimic THE EXACT FORMAT of how the string is written.
Now, once you have done the conversion, if you call ‘data.info()’, you will see that ‘Created Date’ contains ‘datetime64’ objects, instead of just ‘object’
A CASE STUDY
For this case study, I will be working with data from New York’s 311 hotline records. The 311 hotline is a general purpose, non-emergency hotline for everything from noise complaints to heating and cooling complaints. This particular data set contains the information about every call made to 311 in the last 7 years (millions of rows of data).
If you are not familiar with data cleaning, I highly recommend downloading the dataset. The first link above is very large (several gigabytes), but the second link is a much smaller subset, containing only the year 2015.
Understanding the Context:
The data set that we are working with contains raw, unaltered data entered by various social workers, police officers, park rangers, etc (the data for each point is entered by different individuals across varying departments and even cities). This is important to know; the variety of methods and individuals entering data means that there is likely a high variability in our data set.
This data is structured with the columns being each of the Series of information, and the rows being unique complaints, described by the information in each Series.
While the dataset contains 53 different Series of data (zip codes, lat/longs, addresses, agency receiving complaint, time of day, etc etc), many of these are redundant, only pertain to a handful of complaints, or contain way too many ‘NaN’ values. We will be constructing a view (not a copy) which only contains a handful of the series, and cleaning up this view.
We do not just want to indiscriminately delete all “bad” data; some poorly formatted data may still contain valuable information.
Our procedure will be to:
- open the csv file into a pandas DataFrame
- create a view of the DataFrame containing the Series we are interested in
- apply a self made function to clean all zip code values
— make sure all zip code values are strings of exactly 5 integers between 10000 and 19999 (presumably the range of NYC zip codes)
— note that some of the zip codes contain 4 digit extensions, we will remove those
- apply a self made function to clean date and time values from strings to datetime objects
— make the strings into datetime objects
- create two new Series, ‘processing_time’ (total time b/n complaint creation and complete resolution) and and ‘start_time_window’ (hour that a complaint was logged)
REMOVING NULL VALUES
To run analytic tests and use visualizations on data, it is sometimes necessary to remove any of the values that do not contain some kind of useful data. In most cases, you will have already cleaned up all bad data into useful data, or replaced the associated values with ‘np.NaN’ values. To remove all of the rows that contain null values in a given Series:
Next, to clean up the other null values throughout your other Series, run the ‘.notnull()’ function on each of the Series, and use a logical ‘and’ statement to limit your return value to only the notnulls across all of these val. (remember, in Pandas, and = ‘&’ , or = ‘|’, and not = ‘!’) If you did this correctly, ‘data.info()’ should return the same incidence count for each Series:
data[data[‘Incident Zip’].notnull() & data[‘Agency’].notnull()]
NOTE: dropping of null values should only be done on views of the original data, and only where necessary. For example, if I want to look at the times of creation for all complaints and then find when most complaints occur, I only need to clean the ‘NaN’ values out of the ‘Creation Date’ Series; in fact I shouldn’t drop ‘NaN’ rows anywhere else or I will lose valuable ‘Creation Date’ data.
A CLEANING CHALLENGE FOR YOU
Go through the original, uncleaned data again. Look for points that are still descriptive and valuable in the ‘Incident Zip’ Series, but that do not follow our norm (5 digit string). Things like ‘JFK’, ‘Coney Island’, etc still contain very valuable information. See if you can discover any other trends in the data that are valuable and can be cleaned to properly preserve data.
Afterwards, try and build a function that will take all of the zip codes available and sum up all of the zips wherein the most complaints occur, and then find which agency’s receive the most complaints.
For this, you may need to learn a bit more about Pandas. Below are some links to the Pandas manual pages:
Simplified, condensed, new-user friendly, in-line examples have been inserted where possible to augment the Stack…pandas.pydata.org
As always, I will be adding to this post, and please feel free to message me if you have questions or corrections (especially if you have trouble with any of the code!) Cheers!