A Look into San Francisco’s Economy via Data Cleaning + EDA

Published in

Spring 2019 — Information Expositions

5 min readFeb 25, 2019

It’s inevitable that some data will naturally be messier than others, and although sometimes trying to generate meaningful findings with messy datasets can feel like a horror story, or an unsolvable rubix cube. Fear not, there are tools that can help make untidy datasets more manageable for EDA processes, and today that is what I intend to show using messy San Francisco Salary data. I will be using is a 12MB dataset from Kaggle consisting of SF city salaries between 20011–2014, but to begin we will load up a new python file and import all of our data visualization and curation tools.

Next, we will load up out data to get a visual overview of what we will be working with.

We can see that columns such as “Benefits”, “Notes” and “Status” commonly have missing values. In order to get a better sense of how many missing values we have throughout this dataset, we will run a “.isnull().sum()” function to get a direct count.

From what we can see from above, there’s a large majority of data missing from the Status, Notes, and Benefits column. Basepay also has around 600+missing values as well, so from here we will use “del”, “.isnull”, and “.fillna” functions on our dataset in order to clean up the missing values.

After these functions, we can then run “isnull().sum()” within our dataset again to see if it worked. If all goes correctly, then you should either see a dramatic decrease in missing values or a complete removal from taking out the specific troublesome columns.

As we can see from our second “isnull” count, all of the missing values have been removed, replaced, and cleaned up. From here, we should look at an overview of our dataset in order to compare all of the changes we have made before we move further into our data cleaning process.

This is a good indication that we are moving in the right direction, most of the “NaN” values have been cleaned up, and by removing the “Notes” and “Status” columns has helped us get rid of unnecessary missing data which we will not need for our ending visualizations. We are not finished yet though, as we will look into fixing another common error in data analytics. It is common that when these types datasets have open-ended options for job titles and other columns, there will be inevitable duplicates.

For an example, you can see that there are two different options for the same job, as one of these is registered as all uppercase, while the other is registered as a different job due to the capitalization differentiation. This can be a big problem further down the line if we were to end our cleaning process here and start visualizing the data, as there would be inaccurate data representations due to the duplicate job titles. We will look at an example of this below.

Notice “SPECIAL NURSE” & “Special Nurse”, “TRANSIT OPERATOR” & “Transit Operator”

From the plot above, it is easy to see how this simple dataset input error can create misrepresentations if we do not continue to clean the data by converting all job titles into a streamlined and standardized form in able for python to read the job titles as a singular title instead of individually reading capitalized duplicates as individual and different jobs. To fix this, I had to find find some resources from Stackoverflow to help me convert all of the job titles to be uppercase.

After converting all of the job title duplicates, we can now finally move into visualizing our data. We will start by using the same “JobTitle” value count sns.barplot to compare the previous plot in order to ensure that there aren’t anymore duplicates within the dataset.

Finally, we have a visualization chart that is clean! From what this plot displays, it appears that transit operators take up a large majority of the basis of jobs throughout SF, and the second most common job being a special nurse. I am not completely sure the difference between what a Police Officer I, Police Officer II or Police Officer III is, although they seem to be quite relative and distributed throughout the city. The smallest from this list on top twenty that I used was a senior clerk, which I guess would make sense as there appears not to be an incredibly high demand or demographic/experience to fulfill this job title making it less popular than other jobs such as transit operator. Lastly, I’d like to look at the most popular jobs in the city of San Francisco throughout all available years in the dataset.

The common trend throughout al four year in this dataset shows little variation across the top cumulative job titles, although you can see that the number special nurses appears to be growing within the city. Most other jobs also see small increases, but this could simply just be due to increased population number over the years. Overall, the process of cleaning untidy data and removing duplicates can allow you to find new connections and trends using messy datasets, and this example hopefully highlights how using these simple processes can allow a broader and more accurate spectrum of curation techniques by making the overall dataset more manageable.

A Look into San Francisco’s Economy via Data Cleaning + EDA

Written by Justin Klemperer