Predicting chronic kidney diseases with Machine Learning — Pandas Data Mining Part 1

Alex
7 min readFeb 20, 2023

--

Image credit: CDC

Chronic kidney disease(CKD) has been a serious issue worldwide. It is estimated more than 800 million of people are suffering from this issue since Year 2017. It brings fatal effects to the patients, including financial burden, disability, and ultimately life lose. Therefore, discovering the disease in an early stage is essential so the patient can receive a necessary treatment . In this article, we are performing predictions of CKD based on different criteria using machine learning model. The framework we are using is CRISP-DM. You can also get the full code from my github: https://github.com/R3AlL3nGz3i/kidney_disease_identification

Steps to Perform

The purpose of the model is to determine whether the person is suffering from CKD. The dataset we are using originated from India, where the author collected the information based on 26 attributes from 400 people. The dataset can be obtained from Kaggle. After we have obtained our dataset, there are a few steps need to be done before we can deploy our model:

  1. Data Understanding.

It is essential for us to know our data very well. It could allow us to gain more insights of the data and potentially help us in data analysis and preparation steps. We can understand the amount of data, type of data as well as the coding schemes of the data. For example, some data may store with ‘Yes’ or ‘No’, while some may store with 1 or 0. Thus, understanding the data can increase the efficiency for the later steps. We can perform data understanding through visualizing it with plots and graphs.

2. Data Preparation.

The next step is data preparation. In fact, data scientists will spend half of the time in this phase. They have to clean the data, visualize it and repeat the steps again to achieve the ideal training dataset. One of the primary objectives of data preparation is to prepare and process the raw data so the outcome result is valid. These raw data often contain inaccurate or missing values as well as different formatting of the data and will result in errors during data modelling phase. Besides, data preparation also ensures that only the necessary data will be used to prepare the machine learning model. This is because all the irrelevant attributes will be removed. Selecting the contributing attributes in this step. Moreover, data preparation will also help to unveil the relationship between attributes. We can also discover the unstructured pattern in this phase, allowing us to build a more informative and reliable machine learning model. There are a few methods that can be done in order to obtain the finalized dataset, including transforming the rows and columns, blending with new dataset, detecting and removing outliers, changing field formats and filling the null values.

3. Data Modelling

After we have our finalized dataset, we can now start to develop our ml model. There are two main types of model we can develop based on the requirement, which are the regression and classification models. Regression models usually focus on predicting continuous values (also known as a regression problem) while classification models will approximate mapping function to predict a discrete output. Therefore, we are developing the classification model in this project as the outcome result is “ckd” or “not ckd”.

4. Evaluation and Deployment

Lastly, we need to evaluate our model’s performance using evaluation metrics. It is significant to assess the efficacy and performance of the model in order to provide a reliable result in a real life scenario. There are a few metrics that can be used to evaluate the classification model, including accuracy, confusion matrix, precision, log-loss, and AUC.

Confusion Matrix

Data Understanding

We will use the dataset which consists of 26 features from 400 people. The target attribute of the dataset was “classification” whereas if the person was facing the issue, then he would be classified as “ckd”, else he would be determined as ‘not ckd’. Other attributes in the dataset were including ‘id’, “age” ‘bp’, ‘sg’, ‘al’, ‘su’, ‘rbc’, ‘pc’, ‘pcc’, ‘ba’, ‘bgr’, ‘bu’, ‘sc’, ‘sod’, ‘pot’, ‘hemo’, ‘pcv’, ‘wc’, ‘rc’, ‘htn’, ‘dm’, ‘cad’, ‘appet’, ‘pe’ and ‘ane’.

# data frame info
print('data shape: ', df.shape)
print('data info: ', df.info())
print(df.columns) # to print out all the attributes
data shape: (400, 26)
Information and Shape of initial data set
Attributes in dataset

The initial dataset contains 26 columns and 400 rows of data. Based on the information of the dataset, there are only two attributes, which are ‘id’ and ‘classification’ that do not have a non-null while the other attributes have. There are also three data types contained in this dataset, which are int64, float64 and object data type.

print(df.head()) # to print out top 5 rows of the attributes
Printing top 5 rows of the dataset

We can also check the values contained in each column by using df.head. It could allow us to view the top 5 rows in each column.

# checking unique value in each col
for i in df.columns:
print('unique values in "{}":\n'.format(i),df[i].unique())
Unique values contain in each column

After that, we check the unique values contained in each column. It will help us to identify the errors present in each column before we will eliminate these errors. Based on the results, it showed that the dataset contained errors like unnecessary symbol “\t”, question mark symbol, “?” and “nan” value. For instance, there are nan values in attributes “bp, al, su, rbc, pc, pcc, ba, bgr, ane”, “\t” issue in “rc, dm, cad, pcv, wc, classification” and question mark in “rc, dm,cad, pcv, wc”. We will try to remove these errors in the data preparation process.

Data Preparation

In this section, we will start our data preparation process.

# dropping the id
# didnt contribute to the prediction
df = df.drop(['id'], axis=1)

Firstly, we will drop the id column since it is not contributing to the classification of ckd patient. We will remove it using drop, and dropping it as a column(axis = 1).

# eliminating the blank space in each cell
probColumns1 = ['rc','dm','cad','pcv','wc','classification']

for i in probColumns1:
df[i] = df[i].str.replace('\t','')

df['dm'] = df['dm'].str.replace(' yes','yes')

After that, we will eliminate the errors that exist in each column. We first store the columns in a list. Then, we will remove the unnecessary spacing in the columns using replace. ‘dm’ attribute will also undergo the same technique separately to eliminate the blanking.

# replacing the question mark with mean value
probColumns2 = ['rc','dm','cad','pcv','wc']

for i in probColumns2:
df[i] = df[i].replace('?',np.mean(pd.to_numeric(df[i], errors='coerce')))

We will also replace the question symbol with the mean value of each column. If there is an error to parse the value in the cell to a numeric, then the value will be converted to a NaN value instead.

# forward filling the na value 
df = df.fillna(method="ffill",limit=1)

# replace the na value with mean
for i in df.columns:
df_clean = df.apply(lambda x: x.fillna(x.value_counts().index[0]))

The presence of null will result in problems during training the model. Therefore, we have to eliminate these null values using methods like, dropping the rows with null, replacing it with 0, forward or back filling, filling with mean, mode or median, and filling with imputers with machine learning. In this model, we will fill the null value with only one unit forward to avoid overfitting of model. Then, the remaining null value will be filled with the mode value of the column.

# checking null value
df_clean.isnull().sum()
Checking null value

We will check the null value again to ensure it is eliminated completely from our dataset.

# getting the categorical columns
cols = df.columns
num_cols = df._get_numeric_data().columns
list(set(cols) - set(num_cols))

After that, we will obtain the list of columns with object datatype. Since currently there is no method to get object data in python, we will get the numeric column from the data frame before selecting the object datatype column.

# label encoding the categorical data
lab = LabelEncoder()
catColumns = ['appet','ba','pc','ane','pe','cad','rbc','dm','pcc','htn','classification']

for i in catColumns:
df_clean[i]= lab.fit_transform(df_clean[i])

# converting the column to float type
df_clean['pcv']=df_clean['pcv'].astype(float)
df_clean['wc']=df_clean['wc'].astype(float)
df_clean['rc']=df_clean['rc'].astype(float)
Encode categorical type variable into numerical type

Machine learning models cannot train on categories as strings. Hence, we need to convert the data into unique integers in order to input the data into model training. Label encoding is a useful tool to handle these labels by transforming them into numerical values, especially the attribute that only consists of two values. For example, “appet” attribute containing “good” and “poor” value will be encoded into “0” and “1” respectively. Additionally, we will also convert “pcv, wc, and rc” attributes into float data type as it initially is an categorical type variable.

# check again the data
print(df_clean.dtypes)

Lastly, we shall check again the data type of the attributes to ensure it is clean before we use it to train our model.

We will continue with data visualization and modeling phases in the next article. Thank you.

Did you find this article useful? Share your views in the comments section below. Your kind support will definitely aspire me to come out with more quality content.

--

--

Alex

AI engineer from Asia Pacific University in Malaysia. Currently working on computer vision and data science project.