Classification Algorithm using PyCaret and Power BI

Predictive Analytics with Machine Learning

Abhishek Shivdekar
Globant
6 min readFeb 13, 2024

--

Created by using gencraft an AI image generator with a prompt “ Data prediction for business users”
Image created using https://gencraft.com/ an AI image generator with prompt “Data prediction analysis for business users”

We are witnessing an era where data is the new oil. If this data is used for predicting outcomes, we will make better-informed decisions, optimize processes, and mitigate risks and uncertainty.

There are many open-source and subscription-based cloud tools available to do this job. I will use Microsoft’s Power BI, one of the leaders in business intelligence(BI) platforms in the leading position as per Gartner, along with PyCaret, an open-source.

https://pycaret.org/
https://pycaret.org/

What is PyCaret?

PyCaret is a Python library. It automates machine learning workflows. Moez Ali created it as an open-source, low-code solution for machine learning. People with no prior experience with coding can also use it.

Power BI supports built-in machine learning such as Quick Insights, Decision Trees, Linear Regression, Forecasting, and Clustering. It has limited algorithms. Hence, we can shake hands with PyCaret and use other algorithms not available in Power BI.

PyCaret provides a variety of classification algorithms, including Decision Tree, AdaBoost, and Light Gradient Boosting, among others.

In Machine Learning, hyperparameters are the optimal settings for a model. These settings control the learning ability of the model and its overall performance. The tune_model function in PyCaret is used to tune a model’s hyperparameters.

Below are a few of Pycaret’s hyperparameter advantages:

  • Built-in Tuning: PyCaret automatically performs a random grid search on a predefined search space for each model you choose. This space considers model complexity and data. It also looks at the data’s characteristics. You can easily control the search space. Do this by setting custom ranges or values for hyperparameters. These include Bayesian optimization (Optuna or Hyperopt) and TPES for more advanced tuning.
  • Easy Optimization: PyCaret works with distributed computing libraries. These include ones like Ray Tune. This allows tuning to scale across many machines and ease in optimizations.
  • Logging and visualization: PyCaret logs all model tuning results in MLflow. This includes tested hyperparameter combinations. It gives a simple way to explore the tuning process and its effects. Additionally, PyCaret provides visualization tools. They help users compare model and hyperparameter performance.

What does Classification mean in terms of Machine Learning?

Classification is supervised learning. The algorithm learns from a labeled dataset. Each data point already has a known category or simply a ‘Training Data set’. It then identifies a new dataset or test data set into a few classes such as 0 or 1, yes or no.

Classification in ML is used in Spam Filtering, Image Recognition, Fraud Detection, Customer Segmentation, Medical Diagnosis, Sentiment Analysis, Speech recognition, Document classification, Risk assessments, and many other areas.

Problem Statement

A city hospital wants to know a patient’s likely outcome and the disease. This helps them diagnose and treat patients in earlier stages.

This helps patients recover quickly. It also allows hospitals to analyze the type of patients they have. This analysis helps them know what doctors they need. It also helps with managing medical inventory. It helps with emergencies.

This data is obtained from the URL below, which provides free sample data
: https://www.kaggle.com/datasets/kaushil268/disease-prediction-using-machine-learning.

There are two files: Training (The data considered for training) and Test (on which we will test the ML for outcomes).

Process Summary

Below are steps to integrate PyCaret in Power BI to create a Dashboard

1. Environment Setup.

2. Importing data in Power BI and Model Training using Python script.

3. Dashboard.

Step 1 Environment Setup.

Install Anaconda Distribution with Python.
Click here to download Anaconda Distribution with Python 3.7 or greater.
Open Anaconda Prompt from the start menu and execute the following code.

conda create — name **myenv** python=3.6

Installation of PyCaret can be done by executing the below code in Anaconda Prompt:

 conda activate **myenv**
pip install pycaret

Check in Power BI Desktop.

The Anaconda environment needs to be connected with Power BI to ensure proper functioning. This can be done using Global Settings in Power BI Desktop (File → Options → Global → Python scripting) *.

* Note: If you have an existing Python installed, kindly uninstall it and install Anaconda by doing the above steps so that you can see Anaconda installation in detected home directories.

Step 2: Import data in Power BI and Model Training using PyCaret.

Open Power BI Desktop from the Windows menu or shortcut icon on the desktop.

Click on the ‘Get Data’ option on the ‘Home’ ribbon and Import the ‘Training’ file from the URL mentioned in the problem statement.
In the ‘Transform’ Tab, click ‘Run Python Script’. Inside the Python script editor, use the below code shown in the screenshot.

While downloading, please note in your CSV that if you get an additional blank column at the end, please either remove it in Power Query or directly from the CSV file and then do the below step.

Note that in the above code clf1=setup(dataset,target = ‘prognosis’) variable cf1 has the attribute ‘prognosis’ for the target. This is the column in which diseases are mentioned. Since this is a training dataset, diseases are defined in this column.

Also, the line final_xgboost = finalize_model(xgboost) means that we are using the ‘xgboost’ algorithm in PyCaret, which is used for the Classification explained above.

This training model with the algorithm applied is saved on location, which mentions the ‘save_model’ function.

Types of Classification algorithms are vast. In our case, data can be complex and non-linear. Namely, we can use Artificial Neural Networks (ANNs) and eXtreme Gradient Boosting (xGBoost). When comparing the two, xGBoost is not computationally expensive compared to ANNs. Hence, in our case, xGBoost is the best fit!

Click on the ‘Table’ option. The Python script is successfully executed, and the Training dataset is now ready, as seen below:

Now import the Test Data by the ‘Get Data’ option on the Home ribbon. Note here that I have intentionally deleted the ‘prognosis’ column so that we only have the raw data, and ML would do its job to predict the disease.

Open the Python script editor in the ‘Transform’ tab.

Click on Table, and you should see the output below.

You will see that two new columns named ‘prediction_lable’ and ‘Prediction_score’ are populated. The ML has done its job and, based on training data, can determine the type of disease for the patient.

Click onClose and’ apply table data to start loading.
The Tables are loaded, and we can make reports based on the same.

Step 3 Create a Dashboard
We can create a dashboard based on the data loaded to provide meaningful insights to users. This dashboard shows us the prediction of diseases for patients.

Conclusion

PyCaret is a powerful tool that can be integrated with Power BI. This integration allows Power BI users to leverage the benefits of machine learning algorithms with a low-code approach. This helps create effective Power BI dashboards with data predictions within them.

In the blog post, we presented a case study on hospital data where we used the classification algorithm to train a model using PyCaret. The model was then used to make predictions based on test data, and the results were displayed on a Power BI Dashboard.

References

--

--