Microsoft Fabric: New Age Analytics

Amit Chandak
Microsoft Power BI
Published in
16 min readMay 24, 2023

Microsoft Fabric: New Age Analytics| Lakehouse| Warehouse | Dataflow Gen2 | Power BI | Data Pipeline | Spark | Notebook | KSQL

Evolution of Data

We live in an era of data and data-driven decision-making. Modern organizations depend on Data and Data Engineering processes to take quality decisions. This all started with emerges of RDBMS around 1970's

Relational Database Management Systems (RDBMS) have played a pivotal role in the evolution of data management and data engineering since their inception. From their humble beginnings to their current prominence, RDBMS have transformed the way organizations store, manage, and retrieve data.

With the need for analyzing large volumes of data, data warehousing, and OLAP technologies have gained prominence. We explore the integration of these technologies with RDBMS and their role in decision support systems.

The explosion of unstructured and semi-structured data challenges the relational model’s suitability. It led to the emergence of DataLake NoSQL databases and their different models, and they have significantly impacted the data engineering landscape. Recognizing the strengths and weaknesses of both approaches, a convergence between RDBMS and NoSQL technologies began to combine the benefits of both worlds.

With the evolution of machine learning and AI capabilities, the integration of predictive analytics, data mining, and machine learning algorithms has become part of the data engineering process. This has enabled advanced analytics and automation.

In the age of technological advancements, data has become the centerpiece of discussions. AI and BI have revolutionized industries, all driven by the immense amount of data collected over the past few decades. AI models like ChatGPT have emerged, providing intelligence that rivals human capabilities. It seems like a thrilling chapter in human progress.

However, amidst this transformative era, organizations grapple with data engineering challenges. Despite AI and BI promises, outcomes often fall short, either due to delays or the inability to achieve objectives. Why is this happening at a time when we enter into an era of AI and BI? The answer lies in the enigmatic nature of the data engineering process, particularly during its initial phases.

The Challenges

Data engineering, the foundation of successful data analysis and insights, involves various stages such as data collection, transformation, cleaning, and integration. It is a complex process similar to solving a complex puzzle. The initial phase, known as data wrangling, serves as the gateway to the entire journey. However, data engineering often becomes a black box.

Imagine an organization embarking on a data engineering project. They possess a wealth of data collected over the years, encompassing customer information, sales figures, market trends, and more. Excitement surges as they aim to unlock the hidden potential within this treasure trove. Yet, as the data engineers delve into the project, they encounter numerous obstacles.

The first challenge they face is data quality. The collected data often comes from diverse sources, each with its own format, structure, and accuracy. It is like gathering puzzle pieces from different sets, each with its own unique shape and color scheme. Aligning these pieces becomes an arduous task, as data engineers struggle to ensure consistency and reliability. Missing values, outliers, and inconsistent formatting plague the dataset, introducing complexities that must be resolved.

The next hurdle is data integration. Organizations often possess multiple data sources, each providing valuable insights into different aspects of their operations. However, merging these sources seamlessly becomes a daunting endeavor. Incompatible data formats, conflicting definitions, and varying granularity levels hinder the creation of a unified view. The jigsaw puzzle grows more perplexing as data engineers strive to establish relationships and linkages between disparate datasets.

Furthermore, the sheer volume of data overwhelms engineers. With each passing day, the data landscape expands exponentially. Organizations collect vast amounts of structured and unstructured data, including social media feeds, sensor readings, and customer interactions. It is as if the puzzle keeps growing in size, with new pieces added constantly. Managing and processing Big Data becomes a challenge, requiring robust infrastructure and efficient algorithms.

In the midst of these challenges, deadlines loom, and organizational stakeholders eagerly await actionable insights. However, data engineering operates behind closed doors, shrouded in complexity and uncertainty. It is no surprise that delays occur, leaving stakeholders frustrated and questioning the AI/BI revolution’s effectiveness

Data Replication

Data engineering processes often become complex due to the current technological landscape and choices made regarding data storage. Let’s examine how data flows from its source to the end user for consumption.

Multiple Data Source -> Data Lake -> Data Warehouse -> BI Tools

Multiple Data Source -> Data Lake -> AI Engine -> Data Lake -> Data Warehouse -> BI Tools

Multiple data sources feed into a data lake, which serves as a central repository for raw data. From the data lake, the data is then transformed and loaded into a data warehouse, which acts as a structured storage solution for analytics and reporting. Finally, business intelligence (BI) tools access the data from the data warehouse for analysis and visualization.

In some cases, an additional step involves routing the data from the data lake to an AI engine for machine learning and AI modeling. This allows advanced data processing and modeling capabilities.

Data lakes are particularly suitable for Data Science, ML, and AI modeling due to their column-based storage structure. They provide the flexibility and scalability required for these types of analytical workloads.

Data warehouses, on the other hand, store data in proprietary formats, which can vary from one database to another. When BI tools interact with data warehouses, they typically operate in import mode to optimize performance using their own proprietary format. This creates a separate copy of the data in their own proprietary format. This redundancy can lead to multiple copies of the data throughout the process.

As a result, we end up with multiple copies of data, each in a proprietary format, requiring data validation and quality assurance at each layer. This process can be seen as a black box due to limited visibility for the Quality Assurance team and end users until it reaches a BI tool. This lack of transparency can cause delays and limited data availability at certain stages unless there is a dedicated technical QA team involved. Also, I often feel that the process does not seem aligned with the true Agile/Scrum process

And these data copies do not include replication to achieve hyper performance.

Microsoft Fabric

Microsoft has taken a significant leap forward to address unnecessary data replication. Their goal is to ensure that data can be stored and accessed by all their tools in a standardized format. This approach eliminates data duplication within the Microsoft ecosystem, allowing others to read and utilize the data as well.

Microsoft Fabric aims to revolutionize data engineering and data analytics by promoting the concept of having only one copy of data stored in a central lake(one lake), using the industry-standard Delta Parquet format. This means organizations can avoid unnecessary data replication and have a true single source of data.

The advantage of this approach is that it enables seamless compatibility with various Microsoft tools, including data pipelines, data flows, notebooks, SQL engine, Kusto engine, and Power BI. All of these tools can read or write data in the standardized parquet format, ensuring consistency and ease of use across the entire Microsoft ecosystem.

Image Source: Microsoft

By adopting this unified approach, organizations can now create a more efficient and streamlined data engineering process. They no longer need to replicate data solely to work with different technologies. This eliminates redundancies, simplifies data management, and promotes a true single source of data for analysis and decision-making.

In summary, Microsoft’s Fabric represents a significant step towards redefining the data analytics landscape. It allows for the storage and utilization of data in a standardized format, eliminating unnecessary replication and enabling seamless integration with various Microsoft tools. This advancement empowers organizations to work with a true single source of data, streamlining their workflows and maximizing their data assets.

The Microsoft Fabric Platform provides customers with a SaaS-ified, open, lake-centric, full-featured data, analytics, and AI platform that meets all of their data Analytics needs.

Microsoft Fabric target different set of user like C-level executive, Department head, Data Scientists, Business Analyst, Data Scientist, and Data Analyst by providing a single source of truth, democratized access, reducing duplication, removing complexity, creating greater visibility, an integrated environment, enhanced productivity, and better collaboration. Data on the new oil will become a new advantage.

These are the key features:

  1. Power BI, Azure Synapse, and Azure Data Explorer are all integrated into Microsoft Fabric in a single integrated environment, so the user is presented with a customized experience that is tailored to their own needs.
  2. There are many advantages to integrating Fabric into your organization. For example, Fabric offers a wide array of deeply integrated analytics capabilities, shared experiences across familiar interfaces, easy access and reuse of assets for developers, a unified data lake that is compatible with your preferred analytics tools, and central administration and governance.
  3. A simplified user experience: Microsoft Fabric’s SaaS experience integrates data and services seamlessly, allowing IT teams to centrally configure core enterprise capabilities, thus simplifying the user’s experience. Across all services, permissions and data sensitivity labels are automatically applied, so creators do not need to worry about the underlying infrastructure when focusing on their work.

Please Note: Information source for Microsoft Fabric is Microsoft's official documentation

Microsoft Fabric provide
1. Unified Analytics Platform
2. Lake-Centric and Open
3. Empower every user
Image Source: Microsoft

Components of Microsoft Fabric:

  1. Data Engineering: Provides data engineers with a powerful Spark platform that has been designed to be used to perform large-scale data transformations and to democratize data through Lakehouse. Using Data Factory integration, you are able to schedule and orchestrate notebooks and Spark jobs with ease.
  2. Data Factory: Combines the simplicity of Power Query with the scalability of Azure Data Factory, allowing connection to over 200 native connectors for on-premises and cloud data sources.
  3. Data Science: Enables seamless building, deploying, and operationalizing of machine learning models within Fabric. The Azure Machine Learning integration provides experiment tracking and model registries.
  4. Data Warehouse: Provides industry-leading SQL performance and scale with separate compute and storage components. Data is stored in the open Delta Lake format.
  5. Real-Time Analytics: Designed for analyzing observational data collected from various sources, such as apps and IoT devices. Offers efficient analytics for semi-structured data with high volume and shifting schemas.
  6. Power BI: Power BI is the leading Business Intelligence platform for accessing data within Fabric quickly and intuitively for better decision-making based on the data.

OneLake

It is on the foundation of the data lake that all Fabric services are built upon. The Microsoft Fabric Lake is called OneLake.

There are two primary ways to start using Fabric

Lakehouse:

Microsoft Fabric Lakehouse is a cutting-edge data architecture platform designed to efficiently store, manage, and analyze both structured and unstructured data within a unified storage system. By seamlessly integrating structured and unstructured data, the Lakehouse enables organizations to derive valuable insights from their data assets in a single, consolidated location. Notably, the Lakehouse simplifies data access through the automatic generation of a read-only SQL endpoint and a default dataset upon creation. This SQL endpoint provides users with convenient access to the data for analysis and querying purposes.

Warehouse(Synapse Data Warehouse):

The warehouse provides comprehensive support for both transactional Data Definition Language (DDL) and Data Manipulation Language (DML). It functions as a data warehouse and possesses robust T-SQL capabilities similar to those found in enterprise data warehouses. Managed by a SQL engine, the Warehouse facilitates data updates through pipelines, SQL, Spark, and shortcuts. Files are stored in the delta Parquet format and are presented in the form of tables. Users can utilize SQL for data updates and transformations.

Unlike the SQL Endpoint, which automatically generates tables and data, you have complete control over creating tables, loading data, performing transformations, and executing queries within the data warehouse. This can be achieved through either the Microsoft Fabric portal or T-SQL commands. Additionally, the warehouse provides a default dataset for your convenience.

What does this mean for data engineering and data analytics processes?

With the default dataset, you can seamlessly bring data into Power BI as soon as it arrives at the Lakehouse or warehouse. This enables the quality assurance team and ends users to accurately analyze raw data quality and explain the underlying logic directly within Power BI.

As a data engineer, you have the flexibility to bring in data using data pipelines, and data flows, and transform data using options such as SQL and Spark/Notebooks (Python/SQL/Scala), depending on your technology preferences. The choice of technology depends on
1. How much data do you have?
2. How large of data do you have?
3. What technology landscape you are comfortable with?

In Microsoft Fabric, you can embark on your data journey from the Lakehouse, which provides a default SQL endpoint and a Power BI dataset. You can enhance the dataset by adding measures specific to your analytical needs.

Alternatively, you can start your journey directly from SQL-managed warehouses, which also come with a default Power BI dataset.

To avoid unnecessary data replication, users can create shortcuts in the Lakehouse. This allows them to access and view data from other Lakehouses and warehouses. This promotes data efficiency and reduces redundancy.

Now, let’s delve into some further details and explore the intricacies of this unified data analytics platform.

Lakehouse

Microsoft Fabric Lakehouse is a data architecture platform that enables storing, managing, and analyzing structured and unstructured data in a single location. It offers flexibility and scalability to handle large volumes of data with various processing and analysis tools.
1. Store, manage, and analyze all data in one location for easy sharing across the enterprise.
2. Flexible and scalable solution for handling diverse data types and sizes.
3. Ingest data from multiple sources and store it in an open format.
4. Easily import data from different sources.
5. Directly upload files from your computer.
6. Connect to 100+ data sources and apply transformations using Dataflows.
7. Copy large-scale lakes through the copy activity in pipelines.
8. Store data in an open format for accessibility.
9. Provides a fully managed process of converting files into tables.

The Lakehouse creates a serving layer with an auto-generated SQL endpoint and default dataset, allowing users to work directly on delta tables for a seamless experience from data ingestion to reporting.
1. Tables are automatically discovered and registered.
2. SQL endpoint and semantic dataset enable working on delta tables.
3. Default warehouse is read-only and limited in T-SQL capabilities.
4. Only tables in Delta format are accessible via the SQL Endpoint.

Data Factory- A persona for Data Engineers

Data Factory in Microsoft Fabric provides a cutting-edge data integration experience that empowers both citizen and professional developers to ingest, prepare, and transform data from a wide range of data sources. These sources include databases, data warehouses, Lakehouses, real-time data, and more. With intelligent transformations and a comprehensive set of activities, users can efficiently manipulate and optimize their data.

One of the key features introduced in Data Factory is Fast Copy, which brings rapid data movement capabilities to both dataflows and data pipelines. With Fast Copy, users can seamlessly transfer data between their preferred data stores with blazing speed. This functionality is particularly significant as it allows users to bring data into the Microsoft Fabric Lakehouse and Data Warehouse for advanced analytics and insights.

Data Factory offers two primary high-level features: dataflows and pipelines.

Dataflows provide a robust data transformation environment with a dataflows designer offering over 300 transformations. This comprehensive set of transformations ensures that users can easily and flexibly manipulate their data, surpassing the capabilities of other tools. Moreover, Data Factory leverages intelligent AI-based data transformations to further enhance the efficiency and effectiveness of data manipulation.

Data pipelines deliver powerful data orchestration capabilities out-of-the-box. Users can leverage these capabilities to compose flexible data workflows that align with the specific needs of their enterprise. The rich set of data orchestration features empowers users to seamlessly manage and optimize their data processing workflows, enabling smooth data movement and transformations.

Data Science — A persona for Data Scientist

Microsoft Fabric provides comprehensive Data Science experiences that empower users to perform end-to-end data science workflows, enabling data enrichment and generating valuable business insights. With these experiences, users can seamlessly execute a wide range of activities throughout the entire data science process. This includes tasks such as data exploration, preparation, and cleansing, as well as experimentation, modeling, model scoring, and serving predictive insights to BI reports.

The Data Science experience within Microsoft Fabric enables users to effortlessly build, deploy, and operationalize machine learning models. It seamlessly integrates with Azure Machine Learning, offering built-in capabilities for experiment tracking and model registry. This integration enhances the workflow for data scientists, enabling them to enrich organizational data with predictions derived from machine learning models. These predictions can then be seamlessly integrated into business intelligence (BI) reports, allowing business analysts to incorporate valuable insights into their reporting processes.

OneLake

Microsoft Fabric’s foundation lies in the data lake, which is also known as OneLake. OneLake is an integral part of the Fabric service, providing a centralized location to store all organizational data, ensuring seamless operation across various experiences.

OneLake, similar to OneDrive, is automatically included with every Microsoft Fabric tenant. It serves as the single repository for all analytics data, catering to the entire organization. This logical data lake, OneLake, is built on top of Azure Data Lake Storage (ADLS) Gen2, offering a unified Software-as-a-Service (SaaS) experience for both professional and citizen developers. By simplifying the user experience, OneLake eliminates the need for users to have a deep understanding of infrastructure concepts like resource groups, Role-Based Access Control (RBAC), Azure Resource Manager, redundancy, or regions. In fact, having an Azure account is not even a requirement to utilize OneLake.

OneLake embraces openness at every level, leveraging the capabilities of Azure Data Lake Storage Gen2. It supports any type of file, whether it is structured or unstructured. Within the Fabric ecosystem, all data items such as data warehouses and lakehouses automatically store their data in OneLake in the delta parquet format. This means that regardless of whether a data engineer loads data into a lakehouse using Spark or a SQL developer loads data into a fully transactional data warehouse using T-SQL, all contributions are directed towards building the same unified data lake. Tabular data, specifically, is stored in OneLake in the delta parquet format.

You can find more details here- https://learn.microsoft.com/en-us/fabric/

How to enable Fabric, find details here- https://learn.microsoft.com/en-us/fabric/admin/fabric-switch

Microsoft Fabric licenses details here -https://learn.microsoft.com/en-us/fabric/enterprise/licenses

Microsoft Fabric administration here- https://learn.microsoft.com/en-us/fabric/admin/microsoft-fabric-admin

My next blog: https://amitchandak.medium.com/microsoft-fabric-what-why-who-when-how-638b391df4f1

Microsoft is revolutionizing the data landscape, offering an unparalleled data experience that removes concerns about the ever-changing technological landscape. Moreover, it provides enhanced visibility into the data engineering process, aligning it more effectively with Agile/Scrum methodologies. Let’s delve into the exciting realm of data engineering with the public preview.

Comprehensive Free Course for Your Team to Learn Microsoft Power BI, SQL, and Data Factory

Microsoft Fabric: A Brief Introduction

How to enable Microsoft Fabric; on Power BI Service| Start the Fabric (Preview) trial

Microsoft Fabric-What it is, Why Use, Who Should, When and How to use

Microsoft Fabric: Various Components| Microsoft Fabric Create Workspace, Lakehouse, Warehouse

Microsoft Fabric: How to load data in Lakehouse using Dataflow Gen2

Microsoft Fabric: How to load data in Warehouse using Dataflow Gen2

Microsoft Fabric: How to load data in Warehouse using Data Pipeline | End-to-End Scenario

Microsoft Fabric: How to load data in Lakehouse using Spark; Python/Notebook

Microsoft Fabric: Integrated Machine Learning- k-Means clustering

Microsoft Fabric: Create a Date table in Dataflow Gen2, and use in Lakehouse and Warehouse| Time Intelligence with Power BI

Microsoft Fabric: How to load data in Lakehouse using Data Pipeline | End-to-End Scenario

Microsoft Fabric: What are the options to load local files in Lakehouse

Microsoft Fabric: Shortcut with Fabric and for Azure Data Lake | File vs Table Shortcut

Microsoft Fabric: Import Azure SQL Data to Warehouse | Multiple tables using Pipeline

Microsoft Fabric: Incremental Data Load(ETL) for Warehouse using Dataflow Gen 2, SQL Procedure, Data Pipeline

Microsoft Fabric: How about incremental without Dataflow, Only using pipeline and SQL procedure for Microsoft Fabric Warehouse

Microsoft Fabric: Load local SQL server data in Lakehouse using on-premise Gateway and Dataflow Gen2

Microsoft Fabric: Dataflow Gen2: Native warehouse Destination- Power Query Transformation — Pivot, Unpivot, Transpose Data

Microsoft Fabric: Load Azure SQL Data to Lakehouse & Warehouse simultaneously with Dataflow Gen2

Microsoft Fabric: Load Snowflake DB Data to Lakehouse & Warehouse simultaneously with Dataflow Gen2

Microsoft Fabric: How to append only Incremental data using Data Pipeline in Lakehouse

Introduction to Microsoft Fabric in Hindi | What it is, Why to Use, Who Should, When and How to use

Microsoft Fabric in Hindi Part 2: Create Workspace, Lakehouse, Warehouse

Microsoft Fabric in Hindi Part 3: How to load data in Lakehouse using Data Pipeline | End to End Scenario

Microsoft Fabric in Hindi Part 4: How to load data in Lakehouse using Dataflow Gen 2 | End to End Scenario

Microsoft Fabric in Hindi Part 5: How to load data in Warehouse using Dataflow Gen 2 | End to End Scenario

Microsoft Fabric in Hindi Part 6: How to load data in Warehouse using Data Pipeline | End to End Scenario

To take your data journey to the next level, I’ll be providing more examples in my upcoming blogs and videos.

Complete Power BI in one Video — 11 Hours

Click here to access all my blogs and videos in a jiffy via an exclusive glossary using Power BI. Please like, share, and comment on these blogs. I would appreciate your suggestions for improvement, challenges, and suggestions for topics so that I can explore these in more depth.

In addition, I have over 500 videos on my YouTube channel that are related to Power BI, Tableau, and Incorta. With the help of these videos, you will learn hundreds of DAX and Power Query functions, in addition to hundreds of Power BI, use cases that will assist you in becoming an expert in Power BI. Make sure you subscribe, like, and share it with your friends.

Master Power BI
Expertise in Power BI
Power BI For Tableau User
DAX for SQL Users
Learn SQL

--

--

Amit Chandak
Microsoft Power BI

Amit has 20+ years of experience in Business Intelligence, Analytics, Data Science. He is Chief Analytics Officer at Kanerika & Power BI community Super User