Microsoft Fabric

21 min readMay 29, 2023

Reading notes

What is Microsoft Fabric

Microsoft Fabric is an analytics SaaS offering, covering end-to-end activities. It groups under a single umbrella all the services required to implement a full, coherent, all-encompassing data science solution across an organization.

Architecture

Microsoft Fabric is divided into physical and logical layers. As a managed solution, the implementation of the physical layers is of no concern to admins, developers, or end users.

The logical layers are a storage layer and a compute layer, upon which services are built and exposed.

The storage layer is implemented as OneLake, the Data Lake equivalent of OneDrive, which can store all of an organization’s data in a single place and make it available to all and any Microsoft Fabric that might require it. The compute layer supports multiple APIs to access its underlying data, and exposes them to the various services built on top of it.

There is only a single OneLake instance per Fabric tenant, and a Fabric tenant is usually aligned with an Azure AD tenant.

A Fabric capacity is a set of resources made available in a tenant. A single tenant can have multiple capacities active at the same time.

A Fabric domain is a logical grouping of Fabric workspaces. They present a convenient way to manage access and permissions to groups of resources.

A Fabric workspace is a collection of items working together to exploit a capacity and its functionalities in order to bring a service to life.

A Fabric item is an object that can be created and managed in Fabric.

Security and governance are transverse concerns implemented for each layer.

Storage

OneLake is accessible by all components of the Fabric platform. Data does no longer need to be copied explicitly between tools and to be safeguarded.

OneCopy enables data to be copied between regions and clouds.

OneLake combines data from physically different regions and clouds into a single logical Data Lake. It is built on top of Azure Data Lake Storage (ADLS). Data can be stored as Delta Parquet, CSV, or JSON files. Other formats are supported as well, but these 3 are the most common ones. Delta Parquet is the default. This storage layer can contain warehouses, lakehouses, KQL databases, semantic models,…

Shortcuts enable data that is stored in separate storages to appear as if part of OneLake, thus becoming accessible to all and any Microsoft Fabric services.

Lakehouse

A Lakehouse combines the storage capabilities of a Data Lake, enabling it to hold structured, semi-structured, and unstructured data in the same logical container, with the analysis capabilities of a Data Warehouse, providing a querying environment with rich analysis functionalities.

The Lakehouse is seen as a database from the outside, and it is built on top of data held in OneLake using Delta format tables. Relevant properties of a Lakehouse are:

Querying is accomplished using Spark and SQL engines. These are well-known, standard ways to write data manipulation code and they can handle big data quite efficiently.
Data is organized in a schema-on-read format. This makes it possible to expose the same data in multiple ways without the need to conform to a single, predefined schema during design and ingestion.
ACID properties are supported when working with Delta tables. This provides the atomicity, consistency, integrity, and durability required from data management systems.

Data can be uploaded manually to the Lakehouse, or through ingestion pipelines developed using Data Factory/Synapse pipelines or Gen2 Dataflows (PowerQuery Online).

Data can then be transformed using Spark Notebooks, Spark Job Definitions, and Gen2 Dataflows.

Delta Lake

The open-source Delta Lake storage layer is used in Microsoft Fabric to add relational semantics to Spark-based data lake processing. It uses Delta Tables, which separates a table’s metadata (schema) from its actual data (stored in the delta format). Each table has a corresponding folder containing parquet files for its data and a subfolder containing JSON files as its transaction log.

It is the Delta Lake layer that is responsible for providing the ACID properties required by a relational system.

Delta Lake also provides versioning of data, enabling time travel.

Tables can be managed or external. A managed table has both its data and metadata stored in the Delta Lake (the path to the data is relative to the root of the Delta Lake), and any operation on the metadata impacts the data as well. Deleting a managed table deletes both data and metadata. An external data sees the Data Lake store metadata for data that is managed separately (the path to the data is absolute). Deleting an external table only affects the metadata. External tables schemas are created from the actual data schema.

Spark SQL adds the using delta option to create table and in the case of an external table, the location option is used to give the path to the data.
The describe history <table> can be used to see the transactions that have been applied on a table. Time travel itself can be achieved by specifying either the version as of <version number> or the timestamp as of <date-time> modifier to a table reference in the from/join clauses.

Lakehouse Explorer

The Lakehouse Explorer tools enables browsing data through files, folders, shortcuts, and tables.

Microsoft Fabric Lakehouse Artifacts

Creating a Lakehouse resource in Microsoft Fabric generates multiple artifacts, all of which are stored in OneLake.

A Lakehouse resource, which will hold the data and various definitions.
A default dataset, a data model of the underlying data, that can be used for querying.
A SQL endpoint, enabling T-SQL queries against the data.

An Apache Spark pool is also created.

Shortcuts

Shortcuts are logical pointers to data stored in external systems, that make it available to Microsoft Fabric services as if it were part of it.

Compute

Microsoft Fabric provides the following services, all relying upon OneLake for their data storage:

Data Warehousing: Synapse Data Warehouse.
Data Engineering: Spark
Data Integration: Data Factory, PowerQuery Online, Synpase pipelines
Data Science: Azure Machine Learning, Spark
Real-time Analytics: Synpase Real-Time Analytics
Business Intelligence: Power BI

These services access data through a compute layer, which is physically and logically separated from the OneLake storage layer. It supports, amongst others, T-SQL, Spark, KQL, and Analysis Services queries.

Microsoft Fabric is natively integrate with Microsft Purview Information Protection. This enables the use of sensitivity labels across all data assets, which are understood by all services parts of the Microsoft Fabric platform, from ingestion to export.

Power BI

Power BI has been enhanced with a DirectLake option. This allows the lineage view in the Power BI service interface the provide more detailed information.

Spark

Each Microsoft Fabric workspace has its own Spark cluster, whose settings can be adjusted by an administrator. These include:

Node family: Virtual machine configuration (by default, Memory Optimized)
Runtime version: If there is a specific Spark version to be used
Spark properties: Spark-specific settings applied to the cluster

If the libraries included by default are not sufficient, the Library management menu option can be used to manage additional libraries.

Notebooks can be created and executed directly from within the Microsoft Fabric interface, with an experience comparable to Databricks.

Job definitions can also be created in the interface, and the code for the job can be uploaded into its definition or referenced in an Azure Data Lake Gen2 storage.

Using dataframes, it is possible to load and process data in Spark. They can also be used to save the transformed data back into the storage, using the write method; various formats can be used, although the recommended one remains parquet. Data can be partitioned.

The Spark catalog can be manipulated to make data structures available to Spark SQL scripts. It is possible to create a Spark view, whose life will end at the same time as the session the code runs in, or a table, which will persist.

Tables in the catalog are purely metadata. They can be created over existing data, as well as from a dataframe which will persist its data. The two concepts remain separated, even though both the metadata creation operation and the data persistence operation are executed in a single instruction. It is also possible to create an empty table, ie adding the metadata for a table and creating the storage space but not filling it in the same operation.

Data Factory

In Microsoft Fabric Data Factory, Copy activities can now specify whether the source and/or destination is the workspace or an external store. Selecting the workspace provides the option to indicate its type and then configure further options.

Pipelines can be validated before being run or scheduled.

The usual monitoring and tracing functionalities are available.

Dataflows

Dataflows can be created from either the Power BI or Data Factory workspaces. They can also be created directly from the Lakehouse.

The Microsoft Fabric Dataflow editor provides the ability to indicate the destination (Lakehouse, for example) directly in the query pane of the editor.

Dataflows can be triggered from a Microsoft Fabric Data Factory pipeline, using the Dataflow activity.

Data Warehouse

Microsoft Fabric data warehouses are built as a relational model of tables in a schema following multidimensional modeling principles. The data warehouse experience itself does not provide any tables or models, but rather the tools and functionalities to built the ingestion, storage, transformation, and analysis artifacts.

Data warehouses can be created from the Create menu option or from within a workspace.

Data can be ingested through multiple means, including pipelines, dataflows, cross-database querying, and T-SQL copy into . Tables can be created on the fly from an ingestion tool or actively through the Fabric UI.

The recommended approach to follow involves the following steps:

Ingest new data into a Data Lake, applying basic transformation and cleansing.
Load data from the Data Lake into staging tables in the relation warehouse.
Merge dimensional data into dimension tables from the staged data.
Merge factual data into fact tables from the staged data, resolving dimension lookups.
Apply post-load adjustments, in particular indexing and statistics updates.

The difference between Microsoft Fabric data warehouses and the usual platforms is that most, if not all, data is already available in OneLake and thus the need to import and copy data is reduced. Instead, cross-database querying can be put to the task.

When a data warehouse is created, a corresponding default dataset artifact is also made available. This is a semantic model to which reporting tools can connect. In the case of Microsoft Fabric data warehouses, default datasets are managed and maintained automatically. New tables declared in the Lakehouse are automatically added to the default dataset. Custom datasets can also be created explicitly.

Querying the data warehouse can be accomplished through the UI using either the Query editor UI or the Visual query editor UI. External tools such as SSMS and Azure Data Studio can also execute queries using the SQL endpoint made available for each workspace.

The data warehouse interface offers different views to perform different actions.

The Data view shows the tables available in the dataset.

The Query view shows the queries that have been used to define and populate the dataset.

The Model view shows the tables and their relationships. This is where relationships can also be defined and DAX measures created.

Securing data in a data warehouse is achieved through multiple layers, and the same processes of authentication and authorization are enforced regardless of which means of access is used.

RBAC can be applied
Data in transit is secured using SSL encryption
Data at rest (and in transit) is secured using the Azure Storage Encryption Service
Azure Monitor and Azure Log Analytics are fed with audit and activity data
Azure AD MFA and other conditional access policies can be applied
Managed identities can be used
Workspace permissions are controlled using workspace roles
Workspace items can be further secured using item-level permissions such as Read (allows connection), ReadData (allows reading data from tables and views), and ReadAll (allows reading data the underlying parquet files).

Monitoring can be achieved using dynamics management views as is the case with Azure SQL databases. At the time of writing the available views are sys.dm_exec_connections , sys.dm_exec_sessions , and sys.dm_exec_requests.

Synapse Real-Time Analytics

Microsoft Fabric’s Synapse Real-Time Analytics uses a KQL database for storage and the Kusto Query Language (KQL) for querying and analyzing data. KQL querysets can be used by Power BI to build reports on. Streaming time-series data is ingested, analyzed, and exposed through these tools.

The KQL database, or Kusto database, contains tables, stored functions, materialized views, shortcuts, and datastreams. Tables have a well-defined schema (columns and their respective data types) and data is viewed as rows. Functions are named, reusable pieces of queries, that can accept parameters. Materialized views store precomputed query results. A datastream is a representation of KQL eventstreams connected to the KQL database.
KQL querysets are used to run query and manipulate their results. Queries can be saved, exported, and shared.
The Eventstream is used to integrate streaming data from various sources, including Azure Event Hub.

KQL in Microsoft Fabric can make use of notebooks as way to save and share queries. Notebooks can exploit telemetry data and be saved along with the rest of a solution in a Git repository.
KQL queries are read-only, although they can produce an output which can then be ingested into a table.

Data Science

Experiments are written and run in notebooks, and the resulting model can be saved. Notebooks run on Spark.

Experiments can be created from the Data Science module in Microsoft Fabric. The MLflow library is already installed and configured in new notebooks.

Experiments saved as models can be versioned, and an experiment can either create a new model or update an existing one.

Governance

Components of Microsoft Fabric are managed from one place, the admin center, including security features. Data can be secured and exposed on a need basis, and can be made discoverable for the various services part of the platform.

Microsoft Fabric administrator are responsible for and have tooling enabling them to:

Manage security and access control to data and services. This includes not only access to items and data, but also whether or not users can share these resources further and how they can distribute content inside and outside of the organization.
Design and enforce data governance policies.
Customize and/or configure the Microsoft Fabric platform for their tenant. Amongst others, the look and feel of reports and dashboards can be adjusted; data classification policies can be defined; private links can be put in place.
Monitor and optimize the usage of platform resources, although a very large part of these tasks is managed by the platform itself, and many of the required tools are implemented by default.

The Microsoft Fabric Admin Center provides management functionalities for all of these aspects through its Web portal.

As with most Azure offerings, its API are publicly available and SDKs are provided, including PowerShell cmdlets.

Built-in governance features include:

Content endorsement. Promoting and certifying items as trusted sources makes them more visible to users across the organization.
Metadata scanning helps cataloging and reporting on the organization’s items. Once set up, the scanner API can be used to check data warehouses, data pipelines, datasets, reports, and dashboards for sensitive data exposure.
Data lineage through Microsoft Fabric flows and transformations.

Lifecycle Management

Microsoft Fabric integrates with tools that enable communication and collaboration between members of development teams.
As a development workspace shared by all team members will be constantly overwritten with the latest changes from source control, the recommended approach is to have a different development workspace for each team member. Alternatively, development can be done in Power BI Desktop and/or Visual Studio Code, and committed to the development branch.

Git Integration

Microsoft Fabric workspaces and their contents integrate with Git for version control (currently only with Git repositories in Azure DevOps, which must be available to the user developing in the workspace). At the moment, only Interactive Reports (ie, not Paginated Reports) and Datasets (except Push Datasets, Live Connections, and Model v1 Datasets) are supported, but plans exist to include more items in the future. Currently unsupported items will simply be ignored by the different processes. Custom visuals are not supported at the moment either.

A Microsoft Fabric development workspace can be connected to a Git repository by a workspace administrator. Checking out a branch will apply the supported items to the workspace so that they match the contents of the branch. The process then is the same as with any other software project source control tool. Changes are applied to the development environment; once tested they are committed to the development branch, then merged into the main branch (either directly or through a pull request, depending on the policies in place for the specific project). The resulting state of the main branch is then deployed to the integration environment.

Once source control integration is set up, workspace items display their current status (synced, conflicting, unsupported, uncommitted, update required, problematic). The bottom of the workspace interface also includes additional information, such as the branch’s name, the date and time of the last synchronization, and the short identifier of the current commit.

Synchronization can then be performed in both directions.

Git integration in Microsoft Fabric has to be enabled by a Microsoft Fabric tenant administrator.

A workspace can only be connected to one repository and one branch at a time.

Any operation resulting on a modification of the connection between a workspace and a Git repository has to be performed by a user with the Workspace Administrator Role in the workspace and the Read=Allow Git permission in the repository. These include connecting to and disconnecting from a repository, changing the branch, creating a new branch from within the workspace (also required the Write role and Create branch permission in Git), and synchronizing the workspace with the repo. The details of an existing connection and the status of of the items in the workspace can be viewed by any of the Admin, Member, or Contributor workspace roles.

Committing workspace changes to Git and updating the workspace from changes in Git require the Contributor workspace role (with write permission on all items), being the owner of the items (if the tenant is configured to block updates for non-owners), and the Build role on external dependencies (if any). From the workspace to Git, on the Git side, the Read=Allow and Contribute=allow permissions are required, and the development branch’s policies must allow direct commit; from Git to the workspace, only the Read=Allow permission is required.

Committing changes to the repository is blocked as long as there are updates from the repository that have not been applied to the workspace.

There are a few limitations applying to various aspects of the system:

The maximum length of a branch name is 244 characters.
The maximum length of a file’s path is 250 characters.
The maximum size of a file is 25 MB.
.pbix files can no longer be downloaded from the workspace after they are deployed with Git integration.
The maximum size for a commit is 25 MB.
Exporting items with sensitivity labels is currently not supported; these items might be blocked.
Duplicated names are not allowed, even if the Power BI service allows them.
B2B is not supported.

Each item is saved in a subfolder as a collection of files. There are also two files that are always present, irrespective of the type of item:

item.metadata.json: this file describes the type of the item, its name, and its description if it has been provided. The name property can be changed to rename the item in the workspace; renaming the subfolder will not rename the item.
item.config.json: this contains a version number for the item, as well as its logical identifier, which is used to uniquely identify an item in a workspace and in a branch. It is the key used by the synchronization process to match items in both systems.

A dataset will have two files describing it:

definition.pbidataset
model.bim

A report will have these two files:

definition.pbir
report.json

Conflicts can be resolved using one of two approaches:

Revert to a previous state. Using the Undo command will revert the workspace to the state of the previous synchronization. Using the git revert command in the repository will revert the branch to the previous commit. It is also possible to simply disconnect the workspace from the source control and reconnect it, and applying an initial synchronization in one direction that will overwrite the other side.
Resolve in Git. By creating a new branch, it is possible to avoid losing any changes that cannot be committed due to the conflicting pending update. By then merging the new branch into the standard development branch using Git tooling, it is then possible to handle the conflict in the source files the usual way. The workspace should then be reconnected to the original branch, and synchronized from Git to the workspace.

Updates can fail to apply, for example under the following circumstances:

Dependent items were changed or deleted
Circular dependencies were created
Renamed items
Invalid item data

An error message will be displayed, and any user accessing the workspace source control pane will also see a banner indicating that there is a problem.

Correcting the situation might require reverting some / all of the items, or correcting things outside of the workspace, directly in the branch.

Deployment Pipelines

Deployment pipelines can be used to automate the deployment of workspace items to different environments.

Pipelines can connect to three different workspaces, each one representing a stage in the development lifecycle.

Development: A workspace shared by the team, in which development activities take place.
Test: A workspace were updated items can be made available to testers and validation processes.
Production: A workspace actively used by end users.

Deployments can be performed in either direction, forwards and backwards, although the latter requires the destination workspace to be empty.

Only one workspace can be assigned to a stage in a pipeline. A workspace cannot be assigned to multiple stages.

Pipelines currently recognize the following types of items:

Datasets (except datasets not created from a pbix file and Push datasets)
Reports (unless based on an unsupported dataset)
Dataflows (except streaming dataflows)
Datamarts
Dashboards
Paginated reports

Template app workspaces are not supported, and neither are workbooks and metrics.

App contents is not deployed. Apps have to be updated separately.

When a pipeline is asked to deploy to a stage that is not linked to a workspace, a new workspace is created and assigned. All the selected items from the source workspace are copied to the destination workspace (only if the user has the required permissions). The user becomes owner of the datasets and workspace. If the user has a Premium Per User license, the new workspace is associated with that license, and only users with the same type of license can access the items.

Deploying to an existing workspace will completely replace the selected items, matched by name. Item ID, URL, and permission are not changed. Links between items in a workspace are automatically maintained when deploying to another workspace. Both sides of the dependency must be present in the destination workspace or the deployment will fail. The Select related option prevents this situation from happening by automatically identifying dependencies and making sure that dependent objects are taken along. Links between items in different pipelines can also be maintained as long as they are in the same stage and available.

Parameters can be configured for each stage, usually to indicate connectivity details for the items. These settings will take precedence over the automated linking performed by the pipeline in case of dependencies between items. Parameter rules are used to provide the values. Deployment rules can be set for dataflows, datasets, datamarts, and paginated reports (only data source rules). The can be configured as data source rules or parameter rules. When a data source is parameterized, the parameter is not available for rules if the data source itself already is. Supported data sources at the time of writing are Azure Analysis Services, Azure Synapse, SQL Server Analysis Services, SQL Server, OData, Oracle, SapHana (import mode), SharePoint, and Teradata. Rules cannot be created for the development stage. Deleting an item also removes its associated rules. If the item is added back, the rules have to be recreated. The same applies when workspaces are reassigned.

Deploying an updated version of an item does not automatically refresh its contents. This still has to be triggered explicitly.

Before deploying, it is possible to add a note giving context to the deployment. These notes, as well as the details of the deployment, can be seen when looking at the history of deployments in the pipeline.

Pipelines can be controlled through an API, to automate the process further. An extension is available for Azure DevOps pipelines to control Microsoft Fabric pipelines. The PowerShell cmdlets are available in the MicrosoftPowerBIMgmt module.

Pipeline management (sharing, editing, deleting) requires the Admin permission. Deploying to a stage requires being pipeline admin and being either an admin or a contributor in both the source and destination workspaces. Sharing a pipeline with a user makes them pipeline admin.

Best practices

Content preparation
Use separate development workspaces for different development teams.
Plan the permission model beforehand.
Connect different stages to different databases.
Use parameters for configurations that will change between stages
Development
Back up work into git from a separate development environment (either a dedicated workspace or working locally).
Use an individual branch until work is ready for merging into the main development branch.
Test
Simulate the production environment (data volume, usage volume, capacity).
Use deployment rules to point to a dedicated data source, not to the development data source.
Deploy related items.
Update data items, such as datasets, after deploying them.
Production
Limit who can deploy to production.
Use deployment rules.
Update the production app.
Use release branches when deploying to production from Git.
Deployment to production should not require any manual modification or configuration in the source code or deployment scripts.

How to…

Enable Microsoft Fabric

Microsoft Fabric must first be enabled at the organization level. It can be activated for the whole tenant or at a capacity level. This can be done by someone with either one of these roles:

Power BI tenant admin
Power Platform admin
Microsoft 365 admin

Microsoft Fabric can be made available to the whole organization or to specific users or groups (both Microsoft 365 and Azure AD security groups), with either a specific list of groups, or an exclusion list. There is also the possibility of delegating this ability to manage this to other users (capacity level).

Once enabled, Microsoft Fabric is available to existing Power BI users. Others need to first get a Power BI license, even if it’s only a trial one.

Access Microsoft Fabric

Once enabled, Microsoft Fabric’s tools can be accessed through the Power BI service user interface, using the Power BI icon at the bottom of the navigation pane.

Microsoft Fabric items are accessed through Fabric workspaces, which require a Premium capacity (either a full Premium instance or a Premium per user license).

Once a Fabric-enabled workspace is created, it can contain all the resources that can be created in Microsoft Fabric. Creating resources is accomplished using the “Create” option in the Power BI navigation menu from the workspace.

From the Power BI service user interface, it is possible to switch to the other experiences using the Power BI icon at the bottom of the navigation menu, as shown above.

Assign Microsoft Fabric licenses to users

Licenses are managed through the Microsoft 365 Admin Center portal.

Connect a Microsoft Fabric workspace to an Azure Repo

In the workspace, open the Workspace settings interface (either through the Workspace settings button in the menu or through the three dots in the menu).

In the interface’s navigation menu, select the Git integration option to access configuration options for connecting the workspace to an Azure repo.

Selecting the Organization causes the Project drop-down to populate; picking a Project will make the Git repository drop-down available, which will in turn make Branch selection possible. It is also possible to create a new branch from this interface. Finally, a folder can be selected or created, which will host all the item definitions. If the workspace is empty, connecting to a branch will copy the selected folder’s contents to the workspace. If the workspace is not empty, the user is prompted to pick the direction of this first synchronization; the contents of the destination will be overridden by the contents of the source.

Commit workspace changes to source control

From the workspace, select the Source control button, which will be showing the number of uncommitted changes.

The Source control pane will display the changes ready for committing and provides a zone to input a commit message. Each item has a status, indicating whether it is new, modified, conflicting, or deleted. By default all items are selected, but it is possible to commit only some of them.

Clicking on the Commit button will update the display by removing committed items, and align the workspace’s contents with the new commit.

Update a workspace with changes from the Git repository

When changes are committed directly to the branch to which the workspace is connected, a notification appears. Deploying the changes is done through the Source control button. The Source control pane has an Updates tab, showing how many changes are available for deployment.

Updating the items removes them from the list of available updates and updates the workspace items status.

References

Get started with Microsoft Fabric, Microsoft Learn

Lifecycle Management Documentation, Microsoft Learn