Explore H2O package in R by implementing Classification using Logistic Regression.

In many cases or datasets, the dependent variable is often nominal in nature. Nominal in the sense that it represents different classes or categories such as male or female, smoker or non-smoker etc.

Here, let's take two categories or where the dependent variable is denoted in the form of a dichotomous or binary variable. Although the Linear regression models can be used for the analysis in such cases but they tend to ignore the essence of the categories involved. So we use specialized methods such as the Generalized linear models that help in classification and also defining the relationship of the classes with the independent variables in the data.

Here logistic regression comes from the underlying assumption of the GLMs which we will discuss in the next section. We will be talking about two ways of carrying out logistic regression in R.

  1. The standard method of using the glm function from the base package.
  2. The h2o.glm function from the h2o package in R.

We will also see how the accuracy has improved from the first model to the second model.

What are Generalized linear models and how do they differ from the classical linear models?

We already know that the distribution of the error term in the linear models is assumed to follow a normal distribution. But in the cases where we have binary classes for the response variable, we assume that the error term does not follow the normal distribution. Rather it follows the logistic distribution, given by the Cumulative density function:

Hence the term logistic regression. The above cdf can be graphically represented as:

Which is also known as the sigmoid function. The output of this function will always be between 0 and 1.

For the analysis, we’ll be using an example dataset and the following steps will be followed:

  1. Reading the data
  2. Splitting the data into training and testing sets
  3. Applying glm on the training set
  4. Prediction using the test data
  5. Calculating the accuracy

The dataset considered here contains 21 variables and 3168 observations, where the label variable represents if the voice of an individual considered is a male or female. Before we step forward for the analysis, there is some pre-processing of the data required. We will first subset the data and consider only those variables that are important for our analysis and then convert the label variable into factor variable with levels 1 and 2 representing female and male respectively.

The data looks somewhat like this:

data <- read.csv(“voice.csv”)
head(data)

data$label <- factor(data$label)
str(data)
names(data)

Along with label we have a set of 20 other variables that are the descriptive statistics upon which our response variable depends.

Now, we attempt to partition the data into training and testing data sets. For this we use the “caret” package. The most widely used package for machine learning. [Click here]

Boosted Accuracy from h2o package

Before we proceed to the analysis it is necessary to understand what this package is and what does it do?

H2O is a leading open source platform that makes it easy for financial services, insurance companies, and healthcare companies to deploy AI and deep learning to solve complex problems. To make it easier for non-engineers to create complete analytic workflows, H2O’s platform includes interfaces for R, Python and many more tools.

The steps that will be followed here are quite different from the previous case:

  1. Initialize the H2O package.
  2. Read in the data
  3. Data pre-processing if required
  4. Convert data into H2O readable format
  5. Split the data into training and testing sets
  6. Check for the accuracy of the model on the test data

In order to check the predictions made for each observation in the test data and the how strong the probability is for the prediction made we use the following function:

So the prediction made for the first observation is a male with probability 0.99966 which is quite high and so on.

And this is how you can use two different methods of carrying out logistic regression on the same dataset.

Hit the link https://stepupanalytics.com/h20-package-classification-using-logistic-regression/ to read the whole article.

Do you share the same enthusiasm for Data Science, ML, Deep Learning and collaborative learning!! Go ahead and fill in your details here and we will add you as a writer on our Medium publication and StepUp Analytics. Happy writing!

And of course — don’t forget to spread the word around about our publication!.

Scale Up Your Skills with StepUp Analytics.

“Keep Learning, Keep Practicing”

--

--

StepUp Analytics

StepUp Analytics is a Community of Creative, Highly Energetic Data Science and Analytics Professionals and Data Enthusiast.