Microsoft Fabric: Another Hype or the next golden standard of Data analytics & Machine Learning

Published in

Sogeti Data | Netherlands

9 min readDec 5, 2023

Microsoft has been one of the industry leaders in the field of integrated data-solutions with Azure Synapse Analytics, Azure Machine Learning Studio, Data Factory and fully managed Databricks integration. Now they are offering a new service named Microsoft Fabric. Microsoft Fabric is a Software as a Service offering that comes with a long-term vision where they strive to simplify the whole lifecycle management of any data-centric software development project. It uses a Lake centric approach which stores all data at one place. This aims to solve the data integration problem that coexist with the variety of available tools available these days! They envision bringing infrastructural simplicity, and therefore, finding the most value out of all the available data.

In our blog series last year, we talked about divergent phases involved in the lifecycle management of MLOps projects. Microsoft Fabric aims to simplify the most crucial phases involved, like Continuous Integration & development (CI-CD), Monitoring, and Real-time analytics with PowerBI.

In this blog we will be exploring how well Fabric features and personas support Data Science and AI/ML use-cases? We will provide you with a sneak-peak into some of the new features that Microsoft Fabric offers.

To assess the different features and personas of Microsoft Fabric we chose a nice and exciting use case in which we could test the capabilities of Fabric end-to-end, thus from data ingestion up-to performing Machine Learning prediction. In the process, the data would be accessed, processed and analyzed by different data experts such as data engineers and data scientists. The end goal of the project will be to train a model that can predict the price per night of an Airbnb listing based on its’ characteristics.

Dataset

Via Opendatasoft we found a comprehensive and large Airbnb dataset. It is a dataset that holds a lot of data about Airbnb-listings all over the world. The dataset has around 500k records and more than 70 columns. The dataset has the real-world challenges that lack in other demo datasets such as the iris, and cars datasets. In this Airbnb dataset you will find missing values, variables that hold natural language descriptions, and other challenges that we will tackle during the Exploratory Data Analysis — and data preparation part. Find more information on the dataset via the Opendatasoft page here.

Data ingestion

The first step of the data supply chain is the data ingestion. For this purpose, Microsoft Fabric offers embedded Data Factory. It looks and feels familiar for data engineers who have experience with Azure and Azure Data Factory Orchestration. Because Fabric is a SaaS solution, we do not have to worry about configurating resources and setting up integration runtimes. It is pretty much plug and play and feels almost like an arcade like experience.

Opendatasoft provides an open API endpoint to get access to their data sources, no authorization needed. We will be using the building blocks of Data Factory to create the connection to the REST API and copy data from source into sink. Guided by a Wizard menu we were able to easily create the REST API connection to Opendatasoft.

What about storage in Fabric? Where do we want to copy the data to? Fabric comes with a new storage solution called OneLake. OneLake is a web-based application for data storage following the same SaaS fundamentals as Fabric. The user does not have to deploy a storage account with specific specs but instead just pays per amount of storage used. It is all in a single data lake instead of in different data silos. OneLake also supports Delta Lake capabilities and even automatically stores data files in delta parquet format when the user tries to convert files to tables via lakehouse functionality.

Let us get back to the Airbnb dataset, because we get the data from a REST API, the raw data is in a nested json format. Unfortunately, Fabric Data Factory does not support complex data transformations like parsing nested json data to flat csv files. To parse the json files, we have chosen to write a script with Fabric Notebooks. This is similar to Jupyter — or Databricks notebooks. This notebook could be seamlessly added to the data ingestion pipeline as an activity in the ingestion process. Another benefit of notebooks in Fabric is that there is no need to actively provision or spin up a Spark cluster which can save a significant amount of time (3–5 minutes) to run code. You only pay when the code is being run instead of risking paying for costly underused Spark clusters.

Below you can see the Pyspark code that was used to process and parse the json file and write to the data lake.

# First load in json and flatten the data structure, alle we need for now are the fields

from pyspark.sql.functions import explode, col

df = spark.read.format('json')\
          .option("inferSchema", "true")\
          .option("multiLine", "true")\
          .load(f'Files/store/opendatasoft/{filename}.json')\
          .select(explode("records").alias("data"))\
          .select("data.*")\
          .select("fields.*")

# Save dataframe to delta table

df.write.format("delta")\
    .mode("overwrite")\
    .save(f'Files/loaded/opendatasoft/{filename}/')

Data exploration and transformation

When working with a data lakehouse a customary practice is to store data in different layers of maturity. The raw json files which were extracted from the API, I have stored in the store layer, while the delta table version is stored in the loaded layer.

Loading the data requires some basic PySpark commands which might not be familiar for every data scientist. However, when the data is loaded into your session you can easily convert to the preferred dataset format such as (Pyspark) Pandas dataframe or NumPy arrays.

After the data transformation is done, you likely want to store the new dataframe. You can either store the dataframe as a flat csv or parquet file into your data lake. We would suggest storing this file as a delta parquet file directly. This way it will be easier to govern and maintain your data. Fabric will allow you to see a preview of the table within the Lakehouse view as can be seen in figure 1.

Data modeling

Now we have arrived at the exciting part, building our prediction model! What does Fabric offer on the part of building, monitoring, and maintaining Machine Learning & Deep Learning models? When we click on the Data Science persona, we see the following features: Model, Experiment and Notebook. This looks similar to the MLFlow functionalities that are available in Databricks. All these features will redirect you to the notebook environment. I chose to start from a code template that is provided in MS Fabric. The template provides basic code for how to set up a MLFlow experiment which is quite helpful if you do not have experience with MLFlow.

For our use case we must generate the code ourselves, but the template gives us a bit of a head start. Because we are predicting prices, we are facing a regression problem, which means that not all ML algorithms are suitable. We decide to go for Linear Regression, Decisions trees and a Random Forrest. We use the MLLib package as this supports distributed Machine Learning training on the Spark cluster, which will speed up performance. We also try a bit of a more complex algorithm, so we also chose to build a deep learning multilayer perceptron (MLP). MLLib does not support MLP’s for regression so we implement this algorithm with PyTorch. We will use MLFlow to store all model runs and model artifacts. Furthermore, to track the performance of the models and be able to compare them later on, the mean squared error and mean absolute error were stored for each model run.

When training the models, we experienced the performance limits of Spark. Especially with the MLP architectures there are numerous possible different parameter combinations. Because of degraded performance we were not able to run the number of combinations we would ideally want to run. Fabric unfortunately does not offer any GPU forced computation resources at the time of authoring this article. Also, for more complex deep learning architectures we suspect the development environment of Fabric to be too limited as for example it is not possible to store plain Python files or Python objects within the workspace.

Below a code example of how we had setup our MLFlow experience in Fabric notebook.

# Generate function in which Ml Flow experiment is kicked off
from mlflow.models.signature import infer_signature
import numpy as np

def train_evaluate(model, name, train_df, test_df):
    with mlflow.start_run(run_name=name) as run:
        # Train the model
        mlflow.log_param("num_training_rows", train_df.count())
        model = model.setFeaturesCol("features").setLabelCol("price")
        trained_model = model.fit(train_df)
        
        # Make predictions
        predictions = trained_model.transform(test_df)

        input_sample = train_df.drop("price")
        output_sample = predictions.select("prediction")
        #signature = infer_signature(input_sample, output_sample)
        
        # Evaluate the model for mse
        evaluator_mse = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="mse")
        mse = evaluator_mse.evaluate(predictions)

        # Evaluate the model for mae
        evaluator_mae = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="mae")
        mae = evaluator_mae.evaluate(predictions)        

        # Evaluate r2 if model is linear regression
        if name == "Linear Regression":
            evaluator_r2 = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="r2")
            r2 = evaluator_r2.evaluate(predictions)   
 
        #Store artifacts
        pred_array = np.array(predictions.select("prediction").limit(100).collect()).flatten()
        truth_array = np.array(predictions.select("price").limit(100).collect()).flatten()
        
        # Combine predictions and ground truth into a single array
        combined_array = np.stack((pred_array, truth_array), axis=-1)
        print(combined_array)

        # Save them to a .npy file
        np.save("combined_predictions_truth.npy", combined_array)
        
        # Log the combined numpy array as an artifact
        mlflow.log_artifact("combined_predictions_truth.npy")
        # Log metrics
        mlflow.log_metric("mse", mse)
        mlflow.log_metric("mae", mae)
        if name == "Linear Regression":
            mlflow.log_metric("r2", r2)

        mlflow.spark.log_model(trained_model, "price_predictions_airbnb")
        print("Model saved in run_id=%s" % run.info.run_id)

        #mlflow.register_model(
        #"runs:/{}/price_predictions_airbnb".format(run.info.run_id), name)
        
        print(f"{name} MSE: {mse}")

Model Evaluation

Now that we have trained and saved our models and model parameters , it is time to compare and select the best models based on their performance. Fabric under the hood generates experiment objects when running MLFlow experiments. We are going to use these objects to compare the different model runs. All the runs and specifics of each run are summarized in a table and there is functionality for model selection and model comparison which is shown in figure 2. Although the MLP model was the most complex model to implement and took by far the longest time to train (more than 8 minutes) it had the worst performance of all models. Based on our experiment, the Random Forrest had the lowest mse followed closely by the single decision tree algorithm. These models all did not perform all that well with a mean absolute error of around 55 euros per listing, so we would not bring any of these models into production until we have a model that performs a lot better than these models we trained so far.

Fabric does provide a GUI around MLFlow to compare different models over time and share results with colleagues. We must note that we did not find any options for exporting the graphs or persistently store the graphs somehow which would be useful for sharing these results.

When a model is logged using MLFlow, the model will appear in your experiment objects. You can choose to save models within an experiment as artifacts to your workspace. We are going to do this for our Random Forest model. In the model object we see that the following files are stored:

ML Model file with MLFlow information about the model
Conda configuration file
Python environment configuration file
Text file with requirements
SparkML model objects (the actual model)

This model object makes it possible to re-use, further develop and export your models. It is possible to store different versions of your model within a model object which makes it possible to track the evolution of your model over these different versions. Now that we have seen how the data scientist can make use of Microsoft Fabric we’ve reached the culmination of our Airbnb use-case. What have we learnt about Microsoft Fabric during this process?

Drawing conclusions

Microsoft Fabric offers an embedded Data Factory for data ingestion, making the development process of data pipelines very seamless and effective. However, complex data transformations proved to be challenging within Fabric’s Data Factory. The platform introduces a storage solution called OneLake, which advocates for a centralized data storage approach. The platform’s ML capabilities were explored by building models like Linear Regression, Decision Trees, and Deep Learning Multilayer Perceptron. Unfortunately, the lack of GPU computation resources and limited development environment capabilities posed challenges for complex modeling tasks.

Conclusively, Microsoft Fabric shows potential and delivers on some promises such as data ingestion with Data Factory and seamless integration between distinct parts of the data value chain. There are still areas where it did not meet our expectations, particularly with complex data operations and advanced machine learning tasks. Furthermore, we would have loved to evaluate the Co-Pilot feature and share the results with all of you, but unfortunately this feature was not yet in preview. We feel that Microsoft Fabrics’ SaaS data platform gives a peak into the near future of data solution integration. It, however, may take more time to become mature enough and be the complete solution for data-intensive use-cases.

Do you want to know more about Microsoft Fabric and other technologies that we used in this blog? Below we provided links to resources which you might find interesting. We also provided you with a link to the GitHub page where you can find the notebooks we used in Fabric and the configuration files of the Data Factory pipelines:

Microsoft Fabric: Another Hype or the next golden standard of Data analytics & Machine Learning

Written by Selim Berntsen