Cleaning on SMS Dataset

Rachma Chrysanti
Women in Technology
3 min readJun 21, 2023

If you are here to seeing how SMS dataset to be clean, yes i will share it and here are step by step that I do to cleaning the SMS Dataset. But before we do that, we have to know a little about SMS and about what kind of dataset that i will use here. So, what is SMS? As we know, SMS or Short Message Service from many years ago is already known cause it benefits. This was originally present to assist humans in conveying messages or exchanging news from people to people. Currently SMS is rarely used to exchange news, but as a means of promoting applications and notifications — important notifications from applications or instancies. So, what i am gonna do is to clean SMS advertising, promotions, and important notifications.

PART 1

First things to do is. Collect the data. Choose what date you wanna use to analyze.

Picture 1 SMS Dataset

The dataset in Picture 1 has include SID (Sender ID), response keyword, date, msisdn (number phone), message, and count of records (how many the SMS sent to the phone number).

PART 2

Get ready for the coding time. For this time, i will use Google Colaboratory for doing my code. Of course with Python programming language.

2.1 Import the libraries

We need to import some libraries to run the codes.

import pandas as pd
import numpy as np
from scipy import stats
from google.colab import drive
import nltk
import string
import re
drive.mount('/content/drive')
dataset_path = '/content/drive/MyDrive/Tesis2023/dataset/dataset-3.csv'
df = pd.read_csv(dataset_path)

2.2 Use DataFrame

We need to use DataFrame to execute the codes.

df = pd.DataFrame(df)

2.3 Change the columns name

The column names are too long. So, i think it’s better to make it simple and understandable.

df.columns = df.columns.str.replace("values of sid.keyword", "sid")
df.columns = df.columns.str.replace("values of response.keyword", "response")
df = df.rename(columns=lambda x: x.replace('+', ''))
df = df.rename(columns=lambda x: x.replace('.', ''))
df = df.rename(columns=lambda x: x.replace(' ', ''))
df.columns = df.columns.str.replace("valuesoftanggalkeyword2others", "tanggal")
df.columns = df.columns.str.replace("Top10000valuesofmsisdnkeyword", "msisdn")
df.columns = df.columns.str.replace("valuesofsmskeyword", "sms")
df.columns = df.columns.str.replace("Countofrecords", "total_records")

2.4 Case Folding — Change to lower case

Case folding is one of the important step on cleaning.

df['sms']=df['sms'].str.lower()

2.5 Case Folding — Change symbol to space

df['sms'] = df['sms'].str.replace('[^\w\s]',' ')

2.6 Clean the date column

df['tanggal'] = df['tanggal'].str.replace('[^\w\s]',' ')

2.7 Change the type of msisdn to string by add the another one

df['msisdn.keyword'] = df['msisdn'].astype(str)

2.7.1 Deleting the symbol “—” and “(space)” on the new msisdn column

df['msisdn.keyword'] = df['msisdn.keyword'].str.replace('-', '')
df['msisdn.keyword'] = df['msisdn.keyword'].str.replace(' ', '')

2.7.2 Delete the old one msisdn column

df = df.drop('msisdn', axis=1)

2.7.3 Replace column name of the new msisdn column

df.columns = df.columns.str.replace("msisdn.keyword", "msisdn")
df

2.8 Save the results of case folding

Picture 2 shows the results.

Picture 2 Case Folding Results
sms_casefolding = data
sms_casefolding.to_csv("sms-result.csv", encoding='utf-8', index=False)

The step number 2.8 is the last step for my article in this time. Hope it useful for you. For the next topic, i will continue the steps of pre-processing code with the same dataset. If any questions, please don’t hesitate to comment below and discuss. Thank you for reading.

--

--