Logistic Regression with PySpark
In this post, we will build a machine learning model to accurately predict whether the patients in the dataset have diabetes or not. This time, we will use Spark ML Libraries in PySpark.
Data Information
We will use a dataset from Pima Indians Diabetes Database that is available on Kaggle. Certain diagnostic measurements are included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients are females at least 21 years old of Pima Indian heritage.
Content
The datasets consists of several medical predictor variables and one target variable, Outcome
. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
- Pregnancies : Number of times pregnant
- Glucose : Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- BloodPressure: Diastolic blood pressure (mm Hg)
- SkinThickness: Triceps skin fold thickness (mm)
- Insulin : 2-Hour serum insulin (mu U/ml)
- BMI : Body mass index (weight in kg/(height in m)²)
- DiabetesPedigreeFunction : Diabetes pedigree function
- Age : Age (years)
- Outcome : Class variable (0 or 1) 268 of 768 are 1, the others are 0
Load Data
We can use the read() function similar to pandas to read data in csv format. We can manually specify the options;
- header : If data set has column headers, header option is set to “True”.
- sep: Use sep option to set the single character as a separator for each field and value. Our file is separated by “,”.
- inferSchema : “inferSchema” automatically predicts the data type for each variable. If we specify “True” it will read a sample of data from the file to extract the schema. If it is defined as “False”, all the columns will be denoted as “string”. Hence, we will have to specify all the column data types by hand.
Let’s display the data frame in the table format like pandas dataframe. So, we’ll use show() function.
Data Exploration
Run the printSchema() function on the dataset to see the data types of the columns in the table.
df.printSchema()root
|-- Pregnancies: integer (nullable = true)
|-- Glucose: integer (nullable = true)
|-- BloodPressure: integer (nullable = true)
|-- SkinThickness: integer (nullable = true)
|-- Insulin: integer (nullable = true)
|-- BMI: double (nullable = true)
|-- DiabetesPedigreeFunction: double (nullable = true)
|-- Age: integer (nullable = true)
|-- Outcome: integer (nullable = true)
Use the describe() function to extract the statistics of the numerical columns in the table.
Distribution of Features
Target Variable Distribution
Our target variable is “Outcome”. This classification dataset does not have exactly equal number of instances in each class, but this difference is not important for now.
Check whether the dataframe contains any null values
Now, let’s find null values in our dataframe. Run this piece of code:
Luckily, we have a data set with no null values.
User Defined Functions (UDF)
We may want to make various transformations in our data. A User defined function(UDF) is a function provided by the user at times where built-in functions are not capable of doing the required work. We can use our UDFs for data manipulation.
a) Change Column Name and Convert the Target Variable (OutCome)
We want to change the “Outcome” values from 0 to “No”, and from 1 to “Yes”. We’ll use y_udf to change the name and the values of the target variable .
Now, the data type of this column became “String”
b) Create a new column — Age Groups
Our goal is to create groups according to age ranges with the udf_multiple UDF.
Window Function - Age Groups Distribution
We can use the “groupBy” and “Window” functions to see the Glucose distribution by age group. The Window allows us to see the percentage of each age group in the total dataset.
Pearson Correlation
We can use the PySpark statistics library to determine if there is a high correlation between our data. First of all, we determine the numerical columns and make a list of them into the “df_corr” dataframe.
We checked whether there is a “pearson” correlation between numerical variables. As you can see,the highest correlation is 0,43 and between Insulin and SkinThickness. It’s obvious that there aren’t highly correlated numeric variables. Therefore, we will keep all of them for the model.
Getting Ready for Modelling
Firstly, we should apply 5 important tranformers/estimators from the pyspark.ml library before we start to build model.
After applying them, the data will be ready to build a model.
- StringIndexer
- OneHotEncoderEstimator
- VectorAssembler
- LabelIndexer
- StandardScaler
StringIndexer
StringIndexer converts a single column to an index column. StringIndexer simply replace each category with a number. The most frequent values gets the first index value(0.0). As we see below, “Under 25” has taken 0.0 index value. “Over 50” group has the least population in our dataset. It takes the biggest index value.
OneHotEncoderEstimator
We use “OneHotEncoderEstimator” to convert categorical variables into binary SparseVectors.
With OneHotEncoder, we create a dummy variable for each value in categorical columns and give it a value 1 or 0. This method produces different results with pandas. If we made this transform on Pandas, 4 new columns would be produced for four groups. However, 3 columns are produced on Spark. If our age group is “Over 50+”, then Age_encoded will be (0.0,0.0,0.0)
VectorAssembler
Transform all features into a vector using VectorAssembler.
LabelIndexer
Convert label into label indices using the StringIndexer.
“No” has been assigned with the value “0.0”, “yes ”is assigned with the value “1.0”.
StandardScaler
Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not look like more or less normally distributed data (e.g. Gaussian with 0 mean and unit variance).
StandardScaler standardize features by removing the mean and scaling to unit variance.
Model Pipeline
We use pipeline to chain multiple Transformers and Estimators together to specify our machine learning workflow. A Pipeline’s stages are specified as an ordered array.
An Easier Way To Build Pipeline
The following code is taken from databricks’ official site . First of all we determine categorical columns. Then,it indexes each categorical column using the StringIndexer. After that, it converts the indexed categories into one-hot encoded variables. The resulting output has the binary vectors appended to the end of each row. We use the StringIndexer again to encode our labels to label indices. Next, we use the VectorAssembler to combine all the feature columns into a single vector column. As a final step, we use StandardScaler to distribute our features normally.
Run the stages as a Pipeline. This puts the data through all of the feature transformations we described in a single call.
Train / Test Split
There are 768 records in total. Split the data into training and test sets (20% held out for testing)
Model Training
Model Evaluation
We can use BinaryClassificationEvaluator
to evaluate our model.
Note that the default metric for the BinaryClassificationEvaluator
is areaUnderROC.
ROC is a probability curve and AUC represents degree or measure of separability. ROC tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at distinguishing between patients with diabetes and no diabetes.
Model Accuracy
accuracy = predictions.filter(predictions.label == predictions.prediction).count() / float(predictions.count())print("Accuracy : ",accuracy)Accuracy : 0.7446808510638298
Cross Validation and Parameter Tuning
Now we will try tuning the model with the ParamGridBuilder
and the CrossValidator
.
If you are unsure what params are available for tuning, you can use explainParams()
to print a list of all params and their definitions.
print(lr.explainParams())
As we indicate 3 values for regParam (Regularization Parameter), 3 values for maxIter (Number of iterations), and 2 values for elasticNetParam (Elastic Net Parameter ), this grid will have 3 x 3 x 3 = 27 parameter settings for CrossValidator to choose from. We will create a 5-fold cross validator.
Best Model Performance
Best Model Feature Weights
Best Model Parameters
Source code can be found on Github.