EDA (Python)

Aysen Çeliktaş
8 min readOct 31, 2023

--

In this article, I will show how I applied some simple functions on how to perform a beginner-level EDA (Exploratory Data Analyzes) in the Python environment to the “Cirrhosis Patient Survival Prediction” data set that I downloaded from Kaggle. My previous articles for those who want to review theoretical information for machine learning: “Supervised Machine Learning for Beginners”, “Unsupervised Machine Learning for Beginners”, and “Deep Learning for Beginners

[prepared by the author in Canva]

When studying data science, you must first master the data set you are working on. However, within the data set for which you have relevant information, you can take targeted steps without missing important points and get the results you want from your model. Once you have knowledge about the data, your knowledge about the data set will increase exponentially as a result of the analysis and your mind will take steps towards whatever you want to do. This process can be broadly listed as Exploratory Data Analyses, Preprocessing, Modeling, Evaluation, Hyperparameter Tuning.

First of all, when performing EDA, it is very important to have comprehensive data knowledge because the more information is obtained about the data set, the more accurate the selection of features to be used can be, and the options of imputing or dropping missing values can be evaluated more accurately. A well-done analysis will show how to approach situations such as scaling, encoding methods needed before entering the data set into the model, or regularization methods to be applied in the presence of multicollinearity, the regularization to be made for outliers where discarding from the data is not preferred, and the regularization to be used depending on whether the data is balanced or unbalanced. affects metric choices.

In this data set downloaded from Kaggle, a total of 424 PBC patients were eligible to participate in the experiment, but 112 patients refused to participate in the experiment and the information of 6 of these 112 patients could not be monitored. Only basic measurements and information on survival of patients who could not participate in the trial and whose records were kept are available. Survival statuses were recorded as ‘D’: death, ‘C’: censored and ‘CL’: censored with liver transplantation. There are 17 features and 3 classes here.

First of all, it started by adding libraries. Numpy and pandas libraries, which are indispensable for an EDA, were imported, especially seaborn library used for visualization and built on pandas and matplotlib modules.

[from the author’s notebook]

Then, the file in which the data was recorded was read. The data file (.csv) here is a comma-separated Excel file. Examples of other data types used are html and json. Subsequently, the first five of all patients whose records were kept using head() were observed. You can also specify in parentheses how many rows you want to examine. It will not be possible to review all rows, especially in data sets containing millions of data. In fact, Python provides us with comfort in this sense. Here, just as the first five rows of the data set can be examined, it is very important to be able to examine the last rows, to know how many rows and columns it has (shape), and to know how many missing values are in which of them.

[from the author’s notebook]

The info() server was looked at to find out information about the type of the variables and how many rows and columns they had. Additionally, information about missing values can be obtained here. Here there are 10 float, 3 int and 7 object data and 418 rows and 20 columns. duplicate() was used to check whether there was any duplicate data.

[from the author’s notebook]

Afterwards, the describe() of the data was examined. Here, the min and max values of the features, their standard deviations, and the ratios of the quarters between min and max were examined. Based on this, it can be predicted which feature will outlier. For example, when looking at the min-max range of ‘Alk_Phos’, it can be observed that there is a jump between the third quater and max.

[from the author’s notebook]

Additionally, when data is examined while classifying, one of the first things to do is to check whether the data is balanced or imbalanced. In the imbalanced case, if the data to be captured belongs to the larger class, then it may not affect much, but otherwise it will affect the choice of metric used.

[from the author’s notebook]

Here, the ‘ID’ column representing the patient numbers and 5 features from the object data that were not thought to create a significant difference that would affect the analysis were removed. There was 1 object feature ‘Drug’ left, and it would be converted from categorical data to numerical data with one of the Label/Ordinal/OneHot Encoding types in the preprocessing section, without being inserted into the model, as needed. Also, target is still a categorical data. This time, the residuals were observed in the last five lines using tail(). It can be easily said here that there are missing values (NaN/Not a Number). Whether to remove NaN values from the data set or fill them with any desired variable, mean, median, should be decided by taking into account how much the feature affects the classification.

[from the author’s notebook]

Following this, it was checked how many missing values the data set had and which column they belonged to. While 134 NaN values are observed in ‘Cholesterol’, 106 are observed in ‘Drug’.

[from the author’s notebook]

It was shown in practice how to drop the columns above, based on the features that are not considered critical for the analysis. Here too, rows were dropped (dropna) via the ‘Drug’ value. It can be examined with tail() that 106 missing vaules belonging to the ‘Drug’ column from 418 patients were successfully dropped.

[from the author’s notebook]

Now target is placed at the end as a column on the dataframe, with feature being the only object. Here, we wanted to give an example for those who want to change the location of the columns.

[from the author’s notebook]

Now, the distribution of the features on histogram graphics will be examined and boxplots will be examined on a class basis. Histogram charts show how a variable is distributed within the data set, at certain intervals. While it is important for parametric tests that the variables have a normal distribution, as a result of the observations here, a comment can be made on the transformation technique to be used for a feature that does not have a normal distribution.

[from the author’s notebook]

Boxplot is a very useful tool that shows the range in which the data is distributed. Boxplots should be viewed through feature targets because the general distribution and the distribution within the class may differ. Here, the fact that the target and estimators are in the same range will make classification difficult. For example, in this analysis, it can be predicted that logistic regression will not yield good results.

[from the author’s notebook]

Also, by dropping the ‘Drug’ column completely, the NaN values in the remaining data were filled with mean values using fillna(). This time, the situation was examined via pairplot and shown in these applications, in Python.

[from the author’s notebook]

It can also be clearly seen on the pairplot that the classes are intertwined here. This shows that the features do not reveal clear differences to classify the target.

[from the author’s notebook]

A correlation analysis should also be performed to evaluate the correlation between features and the features with the target. Here, the use of heatmap, which gives the correlation through the visual, is shown. Here, the correlation between the features is expected to be low, and the correlation between the features and the target is expected to be high. High correlation between features may cause multicollinearity. Looking here, the features do not have a high correlation among themselves, but there is no feature that has a high correlation with the target.

[from the author’s notebook]

In such cases, outliers can be reduced according to the size of the data set. The scaling method to be applied in the preprocessing section can be preferred as robustscaler. Maybe the classes can decrease it. By evaluating different models, metrics can be improved through different regularization methods.

Such a data set was chosen to embody how important it is to analyze before inserting it into the model. Real-life data often do not have suitable features to introduce into the model. At such times, it is necessary to make the data more suitable with various techniques. In addition, it is not possible to examine a data set in which millions of data are recorded one by one. Considering all this, data science is a whole with all its stages and requires detailed examination at all its stages.

--

--