Sales Customer Call List — Data Cleaning Using Pandas — Guided Project
This is my first time using Pandas to clean a dataset. In my past work experience, I’ve always used Microsoft Excel, which I am the most familiar with. As an aspiring Data Analyst, using more technical skills when manipulating data is very daunting. But here we are doing our best to learn as much as possible.
Sources:
- Youtube (Guided-Project): Data Cleaning in Pandas | Python Pandas Tutorials
- Dataset: Alex The Analyst — Github
- Checkout my cleaned data: Maria’s Data Cleaning Repository — Github
Scenario:
- This is a Customer Contact List, whereby, the sales team would like it cleaned up and standardised for their cold calls
- For those names that opt for NO CONTACT, we have to remove them from the list
- Remove any duplicated names and details
- Explore the dataset and check each column and row — make it consistent and standardized
Here are the steps I took:
Step 1
Create a dataframe (df1) and then call your raw dataset (excel, csv, JSON etc.). As you can notice my path is pretty long because I didn’t use the file name instead, even though it is saved inside my Jupyter Notebook Folder.
Lesson learned.
Alternative Way: You can simply save the raw dataset inside your Jupyter Notebook folder and paste over the name of the file including the type of file (eg. ‘.csv’, ‘.xlsx’), instead of using the path.
Step 2
Data exploration and observation. I look through each column and row and see the discrepancies, missing data, or formatting issues. From here, we would know what we are working with.
Keep in mind, this is a small dataset, hence, it is easy to see the discrepancies. It’ll be more challenging for bigger datasets.
Step 3
Remove duplicated rows, as seen on the dataset. It’s a smaller dataset, hence, it’s easy to eyeball if there are any duplicated rows.
Alternative Way: You can also use the below to find out if there are any duplicated rows in the dataset. It’s very helpful if you have a bigger dataset.
Step 4
Remove unnecessary column(s), such as the ‘Not_Useful_Column’. Removing unwanted columns these solely based on the gathered requirements from your end-users/stakeholders. For this instance, the ‘Not_Useful_Column’ is not needed when the sales team does the call.
Step 5
Clean and standardize each column. Check each column if their formatting is correct and no unwanted characters in each string.
a. Clean ‘Last_Name’ column
There are dots, slashes, or underscores in some of the strings within the column. We have to strip those out.
Shorter Version: This is by combining the different characters inside the strip bracket.
Longer Version: This is by doing the remove one by one.
b. Cleaning “Phone_Number” Column (Difficult)
This part got me a little confused because there are a few steps shown in the video to remove ALL characters that are non-numeral or alpha. Thereafter, adding in the hyphen ( — ) between the phone numbers.
i) Removing the characters from the strings under the Phone_Number EXCEPT a-z,A-Z,0–9. As shown in the (^a-zA-Z0–9).
ii) Converting the column into a string first then adding the hyphen between the contract numbers using lambda and indexing.
Step 6
Remove the NA — — or nan- — from the Phone Number Column by replacing them with a BLANK. So that It’ll look neater and cleaner.
Step 7
Splitting the Address Column into 3 separate columns (Street_Address, State, Zip_Code). Here, we’ve looked for commas within each cell, if there is a comma, then Pandas will split the address accordingly into the stated (n = 2).
If ‘expand = True’ is not added in, then it will only show the address in a List format.
If there is ‘expand = True’ it will shown in a Table format.
Step 8
Replace the ‘Yes’ with ‘Y’ etc. to make each string to be consistent for ‘Paying_Customer’ column.
Step 9
Replace the ‘Yes’ with ‘Y’ etc. to make each string to be consistent for ‘Do_Not_Contact’ column
Step 10
Remove all the N/a, NaN, and None items within the table.
i) As for N/a, it is a string, hence, we will have to replace it with an EMPTY string.
ii) For NaN or None, we have to fill them in with an EMPTY string.
Step 11
Remove all the rows that end-user/stakeholders cannot call. In this case, we will have to use For Loop to go to each item and upon reaching the column ‘Do_Not_Contact’ if they see ‘Y’ then drop the entire row.
As you can see below, the ‘Y’ rows are removed and left behind rows that are ’N’ and Empty cells.
Step 12
Remove all the rows that do not have any phone number under Phone_Number column. This is quite obvious as our end-user won’t be able to call them without any phone number.
Similar way as Step 11, using For Loop to go through each item and upon reaching Phone_Number column, if the cell is EMPTY. Drop the entire row.
Step 13
Reset the Index Column numbers to make it consecutive. After removing the unwanted rows, it kinda messes up with the index numbering. Hence, we have to fix this.
Step 14
Export our final customer contact list into our chosen format. For this project, I’ve chosen to export it into Microsoft Excel but you can export it in other formats as well.
Other Formats:
Conclusions
With the final Customer Contact List that has been cleaned and standardized, the end-users are now able to utilize the list to conduct their cold calls.
Using Pandas library has opened my eyes to wider possibilities of working with bigger and more complex datasets. There are far more Pandas features that I’ve not touched on yet but hopefully shortly, I’d be able to learn more about it.
Next step / Action
- I will do another data-cleaning project that is Unguided
- Use bigger and/or complex datasets from Kaggle or other sources
- Use other Pandas features when cleaning the datasets
“Maria”