Predictive Analytics using Logistic Regression in Power BI

Tasmia A
Slalom Insights
Published in
7 min readApr 18, 2019

Why you should use R with Power BI

The ability to incorporate R visualizations into Power BI allows users to develop complex charts that might not be readily available in Power BI. I’ve often referenced the R Graph Gallery to pull code for complex scatter plots or specific maps to fit the themes of my reports. These charts can be filtered using the dimensions in the Power BI model and allow for customization through R code. The Power BI R visual also provides troubleshooting tips that help when developing these visualizations.

Outside of creating neat visuals, R can also be used for many types of analyses including predictive analytics. The R language has many built-in functions to build different types of models and predict outcomes based on changes within the data. Coupled with Power BI’s built-in “what if” parameter functionality, building out a logistic regression model and predicting outcomes can be both interactive and visualized using different types of R plots.

When to use Logistic Regression

Logistic regression is a method used to analyze data in order to predict discrete outcomes. The data below is a snapshot of passengers that were on the Titanic. The data shows each passenger, their details including fare, sex, age, class of ticket purchased, siblings/spouses, and parents/children as well as whether or not the passenger survived. If we wanted to predict the chances of survival, we could use logistic regression because the response variable (Survived) of this data is binary (1 or 0).

Snapshot of the Titanic data

How to write code for Logistic Regression using R

The first steps to developing logistic regression model and using it for prediction would be to determine which variables will be present in the model and which ones will be manipulated using “what if” parameters to understand the prediction results. We will cover the customized inputs using “what if” parameters in the next section. To keep things simple, I’ve decided to run my model to predict the outcome of survival dependent upon ticket class (labeled ‘Pclass’ in the dataset), age, and sex.

  1. Select the R visual from the Visualizations menu in Power BI.

2. After selecting the R script visual, drag and drop the column names from the data that will be used to make the model into the Values section of the visual (Survived, Age, Sex, Pclass). Run the glm() (generalized linear model) function on the outcome variable and include the independent variables in order to perform logistic regression. Setting the “family = binomial” in the code will indicate that you are running a logistic regression function.

3. The next step is to write some code to predict the outcome based on certain features. In the code below, the line 8 creates a data frame that sets the Pclass = 1, Sex = female, and Age = 30. Line 9 creates a variable called “pred.test” which runs the predict.glm() function on the model created in step 1. Line 10 of code is a barplot to show the value of the predicted outcome. Line 11 adds a label to the bar in order to see the predicted outcome value.

Currently, our model is predicting that a 30-year-old female with a first-class ticket has an 87% chance of survival.

The Power BI R script visual pops up with both a script editor and an onscreen space for the visual that’s being created. Unlike more exhaustive R terminals or IDEs, the Power BI script editor will only display visualizations that are built with R. This means that some of the initial analysis might need to be done prior to translating the code into Power BI. This would include steps to identify the significant variables, the model accuracy, and the tests run to evaluate the goodness of fit. There are many online resources that walk through the process of building a logistic regression model in R.

“What If” Parameters and Logistic Regression

Now that we’ve put together a simple plot, it would be more impactful to be able to manipulate the variables to see how the changes would affect survival rate. We can incorporate Power BI “what if” parameters to change age and sex in order to see how different values affect the survival rate across ticket holder classes. In order to do this, we will replace some of the code in the previous section with variables from our parameters.

1. To create a “what if” parameter in Power BI, go to the Modeling tab and select New Parameter. A pop-up window will appear. I set the minimum age to 1 and the maximum to 100 with an increment of 1. After pressing OK, a slicer will appear on the page as well as a table on the left-hand side that contains the incremented age and the value selected for age on the slicer.

2. The next variable to manipulate will be the sex of the passenger. “What if” parameters can only be numerical values; therefore, we cannot use the same method to create a slicer for sex. Instead, we can go into the Home tab and select Enter Data. From there, we can make a simple table called “Sex Dimension” and add a column with both female and male. Press Load once the data has been inputted.

3. Select the ellipses next to the newly created table called “Sex Dimension” and create a new measure called “Sex Value”. Type the DAX code from the screenshot below and press the check mark to finish creating the measure. This will provide the selected value from the newly created table.

4. Select the Slicer visual and drag in the “Sex” column from the table, “Sex Dimension”, created in step 2.

5. Now that there are 2 parameters to manipulate, we can go back to the R code and make the graph interactive. Select an age and sex on the 2 slicers. Click on the R visual bar plot from the previous section. Drag the “Age Value” and “Sex Value” columns into the values section so that they can be used in the R code. Change the Sex = “female” to Sex = dataset$”Sex Value”. Repeat this process for Age to create dataset$”Age Value”. Also, update the barplot code to select the first value from the predicted dataset.

We can now manipulate the slicer to see different survival rates.

6. To finish out this visual representation of survival prediction, I decided to create a prediction for all 3 ticket classes and plot the survival rate side by side on the bar plot. Duplicate the code for the prediction and set the Pclass = 2 for the second prediction. Duplicate the code again and set the Pclass = 3 for the 3rd prediction.

7. Create a new variable called “y” that will contain the prediction results for each ticket class. This means that the “y” variable will be a vector with 3 data points. Add a title to the bar graph and labels to the bars. The plot will now show the differing survival rate predictions based on age and sex.

8. The final code for the R visual will be the following:

model <- glm(Survived ~ factor(Pclass) + Age + Sex, family=binomial, data = dataset)
predict.data <- data.frame(Pclass=1, Sex = dataset$”Sex Value”, Age = dataset$”Age Value”)
pred.test = predict.glm(model,predict.data,type=”response”)
predict2.data <- data.frame(Pclass=2, Sex = dataset$”Sex Value”, Age = dataset$”Age Value”)
pred2.test = predict.glm(model,predict2.data,type=”response”)
predict3.data <- data.frame(Pclass=3, Sex = dataset$”Sex Value”, Age = dataset$”Age Value”)
pred3.test = predict.glm(model,predict3.data,type=”response”)
y <- c(pred.test[1], pred2.test[1], pred3.test[1])
xx <- barplot(y, main=”Survival Rate”, names.arg=c(“Class 1”,”Class 2", “Class 3”))
text(x = xx, y = y, label = y, pos = 1, cex = 0.8, col = “black”

This completed visual allows us to compare and draw conclusions about the survival rates for different passengers on the Titanic.

With further analysis, we could compare our predictions to the actuals and determine the accuracy of this model. Parul Pandey’s article, A Guide to Machine Learning in R for Beginners: Logistic Regression, discusses how to build out a confusion matrix in order to evaluate the performance of the model.

Final Thoughts

This logistic regression model is useful because it can be implemented in many different predictive analytic scenarios. A business could use a logistic regression model to predict whether emails they’re receiving are spam and implement a filter to intercept predicted incoming spam. The impact of this type of analysis ultimately enables organizations to make more informed business decisions. Incorporating R analysis into Power BI dashboards will allow companies to make strides towards understanding their customer base and developing solutions for the future.

--

--