Photo by Daiga Ellaby on Unsplash

Data Cleaning 101

Jeffrey Ng
The Startup
Published in
4 min readDec 8, 2020

--

Everyday data comes at us fast and furious in infinite directions in the form of ads, people, and media. Whether it’s under a rock or in the sky (figuratively), data for data science comes in all forms, often messy, in pieces, and unusable. Today I will talk about several data cleaning techniques or steps I perform to prepare my data for data analysis. It is by no means a reference or even a guide, just something I put together for fun.

I will take you through a dataset preparation which I began by web scraping with Selenium. The data was then preprocessed to finally be use in a pandas data frame for EDA. The data set I scraped is from Trulia.com which contains real estate housing data for the entire country. I searched under the term New York, NY for our purposes. And Tada! 20,000 something entries were returned in 334 pages and I intend to use all of it.

Importing Selenium, I decide to use CSS selectors corresponding to the elements I want to scrape, but there are other ways to identify the element such as by tag, Xpath, or ID. I suggest taking a HTML, CSS course to get acquainted with the coding necessary to select the right elements on the web page. Then I enter it into an instance of the Selenium driver. Other parameters I include is the URL with a {}.format at the end so I can loop through the pages using a for loop. Since Selenium is a dynamic web scraping tool, all the work is done and I just sit back and relax while it scrapes the 334 pages of real estate housing data. NOTE: In order to be a good net citizen, there is a parameter called implicitly wait in Selenium. Please refer to documentation to review. I set it to 3 so the scraper will wait 3 seconds before making each separate request. This avoids the scraper to crash or possibly fail, due to too much activity on the webpage.

Now that I have my data, I examine it and realize it is somewhat rough on the edges. I examine its structure. The data is organized into a list of list of strings. I call the first element of the list and see that it is made of 60 strings each holding the price, living area, address, etc separated by a ‘\n’ in one entire string. This is definitely unusable at the moment.

I import regex and use re.split( ‘\n‘, string) which will break the string into individual elements. Regular expressions is an invaluable and necessary tool in string splicing and cleaning. I do the same thing for the rest of the 334 elements on the list and iterate through with a nested for loop. I call the finished object z.

I import pandas and df= pd.DataFrame(z[i] for i in range(len(z))). I then rename the columns with appropriate column names. I call the data frame again and see my rows and columns configured neatly.

The next step of data cleaning involves observing your data. Unless your data is from an organized database or curated dataset, extensive data cleaning is often necessary. Data cleaning maybe another step even with organized data such as replacing nan values, or outliers. We may impute these values with a specific category, or use the mean, median, or mode. We may drop them out of the data completely if we aren’t losing too much information. If the data frame has too many columns that aren’t required for EDA, we may drop them too. We may choose to engineer certain columns to better organize our data. Data engineering is a final step that is put into data pre-processing. This stage also requires skill and experience from a data scientist.

To sum up in this brief blog, data can come to us in a variety of ways and when we aggregate it, it can sometimes be messy and unusable. Strategies we must use to convert our data include examining our data’s structure. It can be in a form of a .json, a .csv, and or just lists of lists, a list of strings, or list of dictionaries. Some tools we can use are built-in python methods and regex.

Correctly organizing the data takes experience, knowledge of data types, and general coding know-how. Many times for EDA, we must get our data into a pandas data frame. Once in the data frame, we need to examine it to see if it makes sense. We replace nan values, outliers, and make small tweaks such as imputing the mean, median, or mode for certain missing values or outliers. We may also drop columns or rows or rename columns into more informative descriptions. Engineering columns or features that may help our model may be done as well, but this is towards the end of data cleaning and is sometimes mixed in with data pre-processing.

This post was written more for me than you, the reader. However, I hope readers can enjoy my insight into data cleaning. It is often stated that a data scientist spends 50–60% of their time on data cleaning and data pre-processing. There is no exact method and no data sets are built equally! Each is unique with their own identity and challenges. Happy coding everyone!

--

--