Understanding Logistic Regression w/ Apache Spark & Python

Kirill Fuchs
Fuzz
Published in
3 min readMar 9, 2017

In this article, I will try to give a fundamental understanding of Logistic Regression by using simplified examples and trying to stay away from complex equations. Which if you’ve visited the wiki page suggests this can prove to be difficult ;).

Logistic regression predicts whether a dependent value is true or false (dichotomous) given some independent variable. Here is a short explanation, followed by a contrived example using Apache Spark + Python.

Logistic regression is better understood with a simple example. For instance, take gambling and the definition of “odds”. We look at craps, where 7 is the winning number.

The probability of rolling a 7 is one out of seven, since a dice has 7 sides. The probability of rolling something else is six out of seven. Odds is defined as, how many times you are expected to lose in a game.

Odds are the (change of success) / (change of failure). In this case it is:

(1/7) / (6/7) = ⅙

Note, this can also be expressed as (chance of success) / (1-chance of success).

Let’s write this as function g: g(x) / 1 - g(x).

The first step is to take the natural logarithm of the odds:

z = ln(g(x) / 1 - g(x))

This is called the “long odds”.

Then the probability of winning is

1 / (1 + e^-z)

Where e is the constant e. The reason you use a logarithm and e is to have a number that varies between 0 and 1. So you can say that 0 means no chance and 1 is 100% chance of winning.

In logistic regression, the output must be 0 (false) or 1 (true). The convention is if the probability is > 50% then the logistic regression output is 1 (true). Otherwise, it is 0.

Side Note: Multinomial Logistic Regression can be used when the output is more than just 0 or 1 (not dichotomous).

Logistic Regression using Apache Spark:

Now Let’s see how we apply this with Apache Spark and Python. We will input our our observations in an array and then feed that into the Spark ML LogisticRegressionWithLBFGS function. Then we use the predict() function to make a prediction of whether we have a positive or negative outcome.

We want to see how much time our salespeople need to spend with customers to make a sale. Our sample data:

As you can see, there is obviously a correlation between hours spent with the customer and the chance of a sale. Logistic regression requires that type of linear relationship between the independent (hours) and dependent (sale) variables. Otherwise, the model is not suited for the data.

import numpy as np
from numpy import array
from pyspark.mllib.regression import LabeledPoint

lrm = LogisticRegressionWithLBFGS.train(sc.parallelize(a))

clientTime = [[0, [1]], [0, [2]], [1, [3]], [1, [4]]]

def labelPt(label, points):
return LabeledPoint(label, points)


for x, y in clientTime:
a.append(labelPt(x, y))

a = []

for p, h in clientTime:
a.append(labelPt(p, h))

lrm = LogisticRegressionWithLBFGS.train(sc.parallelize(a))

lrm

This gives us this coefficient and intercept.

(weights=[0.287699552825], intercept=0.0)

The value 0.287 means that for each additional hour you spend with the customer, there is a 28.7% increase in the likelihood that you will make a sale.

Now we can use the model and run some predictions. Obviously, based on the sample data, if we meet with the customer 4 times we should expect to make a sale:

lrm.predict([4])
1

And no sale with only 1 visit to the customer:

lrm.predict([1])
0

Kirill Fuchs is a passionate developer at Fuzz Productions in Brooklyn, NY. He builds APIs and data-driven applications for clients such as CBS and Anheuser-Busch. Fuzz is a New York based mobile app development company that specializes in designing and developing IOS, Android, and Data Driven applications. PS: Fuzz is hiring :)

--

--