Hepatitis C virus

In this blog, I am going to predict the stage of hepatitis C in patients suffering from Hepatitis C (virus) using machine learning models. This is a classification dataset, hence all classification algorithms are used.

What is Hepatitis C?

Hepatitis C is a liver infection that can lead to serious liver damage. It’s caused by the hepatitis C virus. The virus is spread by contact with contaminated blood; for example, from sharing needles or from unsterile tattoo equipment. Most people have no symptoms. Those who do develop symptoms may have fatigue, nausea, loss of appetite and yellowing of the eyes and skin.

Now we will understand how machine learning is helpful in predicting the output.

Reading Dataset

Importing pandas for reading dataset

Data Preprocessing

I will perform feature selection for better accuracy. Although, since the dataset is very small as compared to its values and requirements, which gives less accuracy if feature selection is not performed, then accuracy is kind of similar (but less).

We are going to use the Pearson coefficient for finding the correlation between the target column and other corresponding columns.

This gives us a list of correlation values ranging between -1 and 1. Through this we will make a new data frame having a positive correlation, hence improving the accuracy further.

After making a new data frame with relevant columns and target columns (Baselinehistological staging), its time for splitting the dataset into a training dataset and testing dataset.

Splitting dataset into train and test set

This is done using the sklearn package. Also, we will import a few more libraries for accuracy, splitting the dataset and algorithms.

Implementing Machine Learning Algorithms (models)

We have experimented with different sizes of testing and training sets. These are represented in fractions(0.3 represents 30% of the dataset is testing dataset and 70% dataset is testing dataset and so on).

Decision tree classifier (CART)

Summary

We conclude that each algorithm gives a different accuracy. Since accuracy is low in all algorithms even after feature selection, we come to know that the dataset is a bit inconsistent. Each algorithm applied here is for classification. We cannot apply any regression algorithm on the classification dataset.

--

--