Data Wrangling (English)

Published in

The Startup

9 min readFeb 4, 2021

As you know, the accuracy data is really important before doing analysis and other thing that need the data in the process. If the data that you use not accurate, then the bad thing maybe will happen, like over predict or under predict, etc. It’s really dangerous and of course everyone don’t want this thing happen. Therefore, data wrangling is really important so the data be an accurate data. Data wrangling is the process of cleaning, structuring and enriching raw data into a desired format for better decision making in less time [1]. We can increase our data quality with data wrangling. I just learned about data wrangling from DQLab and i will share all what i got here. Data that i use here is “dirty” data customer. The detail of the columns of data is below:

a. Kode Pelanggan (Customer Code)

The code for every customer, it’s unique.

The code consists of 2 letters its K and D and 5 numbers with separator “-“. Here the sample of the code

Let’s start the process of data wrangling for this column. First, we need to see the pattern for this column, is there any different pattern ? If yes, we have to fix it. To get the pattern, we can use basic pattern analysis (bpa) function, bpa is one of the tools in R to make the process of data wrangling more easy. Therefore we need to install bpa package first to use the bpa function

Output from bpa function is below:

· Every capital letter A — Z will be represented by “A”

· Every letter a — z will be represented by “a”

· Every number 0–9 will be represented by “9”

· Space and tab will be represented by “w”

· All symbol will be represented by itself. Example: minus symbol (-) will be represented by “-“

· Missing value (NA) will be represented by “NA”

· NaN (Not a Number) will be represented by “AaA”

The format for use bpa function in r is bpa(data that you use for bpa). The result that will you get is data with their own pattern, if there are 100 data so the result is 100 patterns. If you only want to know what pattern inside and also the frequency, just add parameter unique_only = TRUE like code below

Evidently, there is a data that only have 4 numbers and its different with the other patterns in Kode Pelanggan column. This is what we have to fix in data wrangling process

After being seen, the data that have different pattern with the others is lack of number 0 at the front, so the correct data it must be KD-00047. To change it from KD-0047 to KD-00047, there are a lot of way to change it, but i use this way to change it, see the picture below

b. Nama Lengkap (Full Name)

The full name of customer

You can see at picture above that there is still nick name like “Ibu”, degree, space more than 1, etc. The things that you have do for clean the Nama column is, first find out the data that contains symbol (Besides letter, space, point [.], comma [,]) then change it to be “” (this is one of the way to remove something). The second thing that you have to do is remove the nickname like “bapak” and “ibu” and fix the degree at Nama column. The last thing is remove the space that no longer used. Maybe there is more than 1 space at the middle of the name or maybe at the beginning or end of the name. Heres the code that i use for clean Nama column

The meaning of “\\b” at the beginning and the end is we want to find something that exactly same with the writing between that. For example “\\bbapak\\b” it means we want to find the writing of “bapak” not “bapake” or “kebapak”, exactly same.

c. Alamat (Address)

The data of address for every customer

You can see that the address still not have same pattern, there are Jln. Jl. Or Jalan. We can fix it using gsub function like code below to replace Jln. And Jl. to be Jalan

d. Tanggal Lahir (Date of Birth)

Date of birth of customer. Tanggal Lahir column is the column that has a lot of different pattern in this data. You can see the different from the picture below

You can see that there is date using name of the month, the number of the month, the pattern is ddmmyy and also mmddyyyy with separator “-“ or “/”. The first thing that you have to do for clean this column is find out is there any different pattern

We got a lot of pattern for this column, therefore the next step is equalize the other pattern to be pattern from the most pattern, its 99–99–9999. Before equalize the pattern, lets see the data of pattern 99–99–9999

Based on the picture above, the data of pattern 99–99–9999 has format date dd-mm-yyyy, so we have to change the other pattern to be dd-mm-yyyy. As you know, there is data that using name of the month and also the number of the month. First, let’s see the pattern for the data that contain the name of the month

Evidently, the pattern is “day[space]name of month[space]full year”, from this pattern we only need to change the name of the month to be the number of the month. Before change it, we have to find out the name that exist in the data. The thing that you have do is just take the data that contain letters, then remove all numbers there and find the unique data. Here the codes that i use to do that all

Next, let’s change the name to be number, of course there are many way to change it, but this is what i use to change the name

Based on result that we got from basic pattern analysis, there is no longer data that contain letters, and pattern 99–99–9999 got increase, that means the data that we just change already have same pattern with the pattern that we want. The last thing that we have to do is equalize the pattern that contain “/”, there are “99/99/99” and “99/99/9999”. To change this, it will be very easy if you use format function in R

Data that we will change <- format ( data that we will change[make sure the data has date format, if not, use as.Date (data that will change to be date format, format = the format date from the data like %m/%d/%y) , the format date that you want )

Heres the date format in R

Source: https://www.statmethods.net/input/dates.html

For example, a data from Tanggal Lahir column is 1/12/1964, to change this to be date format, you have to use this as.Date(data that will change to be date, format = “%d/%m/%Y). If the data is 12/1/64, the code is as.Date(data that will change to be date, format = “%m/%d/%y). Heres the code that i use to equalize the pattern

e. Aktif (Active)

Contains information whether the customer active or not. Data for Aktif column it should be contains 1 for active and 0 for not active only, but there is some different patterns like picture below

First let’s see the patterns for Aktif column using basic pattern analysis

The most patterns is “9” it means the data is only contain one number. Now see the data for other patterns, for pattern “A” maybe there is writing mistake like it should be “1” but in data its “I” and “0” but “O”. For pattern “AAAA” the data is “TRUE”, you have to change it to be “1” and pattern “AAAAA” with data is “FALSE” you have to change it to 0. Heres the code that i use for equalize the patterns

f. Kode Pos (Postal Code)

The postal code from customer address, consists of 6 numbers. Let’s do the same thing like what we do for other columns, find out the patterns for Kode Pos column

From the result that we got, it likely writing mistake from number be letter. To change letter to be number, we can use gsub function

g. No. Telepon (Telephone Number)

Customer telephone number that can be contacted. This column also has a lot of different patterns like Tanggal Lahir column. There is data that start with +62, 0, etc. You can see the data for No. Telepon in picture below

Patterns for No. Telepon column

The most pattern is +9999999999999999 where No. Telepon start by +62. If you see the data for other patterns you will see that all data start by 0 and there is only 1 pattern that start by 62 its 9999999999999999. Now we have to change 0 at the beginning to be +62 using gsub function and add “+” for data that start by 62 using paste function

h. Nilai Belanja Setahun (Value of Spending in a Year)

Value of spending in the last one year. There is some missing value in this column. There are 2 options to fix this, there are remove the data of missing value or fill it. Usually people will fill it with mean, median or modus. Fill it with mean if the data is numeric and the data is normal (no outlier), if there is outlier then fill it with median, and modus is for categorical data. Or you can fill it with anything what you want. For this, we will fill it with mean, but before that let’s see is there any missing value using summary function