Itlize

Blog

Data Science and Microsoft Fabric

Billy Yann
Data Scientist
Deep learning and machine learning specialist, well-versed with experience in Cloud infrastructure, Block-chain technologies, and Big Data solutions.
November 06, 2023

Introduction: Microsoft Fabric

Recently, in Microsoft Build 2023, the Microsoft Data and Analytics Platform team announced the offering of a new service named Microsoft Fabric. The message of Microsoft Fabric is simplicity as the current data analytics products and services have a high complexity level to lead a successful project. Here, one has to invest and spend a huge amount of time, learning about these products. Currently, each product and service comes with its own licensing plans that make things super complicated to work with. Simplicity is the need of the hour and Microsoft Fabric offers the same for any data analytics projects from now on.

Microsoft Fabric got launched for simplifying products and services. The Microsoft team invested in this new offering for two years and devised ways to simplify things. Now, the data analytics teams need not worry about the technology and focus on results. Microsoft Fabric is an umbrella that offers data platform services. It consists of three main data analytics products of Microsoft that include Power BI (an umbrella of Power View, Power Query, and Power Pivot together), Azure Data Factory, and Azure Synapse.

In Microsoft Fabric, a lot of data analytics products and services are included. These are storage from OneLake, data integration from Azure Data Factory, data engineering from Azure Synapse, data warehousing from Azure Synapse, data science from Azure Synapse, real-time analytics from Azure Synapse, business intelligence from Power BI, action platform from Data Activator, and governance from Purview. When each one of these products and services gets collected under one umbrella of Microsoft Fabric, the developers get one place to create and edit them as required. There would be one structure to tie them together in workspaces with one setup for security and configurations. These get done with far simpler licensing procedures than now as simple licensing plans get used for all the products and services listed above. As aforementioned, Microsoft Fabric simplifies everything through a simpler licensing process. There is one single environment to deal with all products and services with data engineers, data scientists, and data analysts working in similar environments. The focus of data analytics projects stays where it should be, which results in the success of analytical projects. The Fabric gets built from products with proven track records of successful implementation.

Data Science and Microsoft Fabric

The cloud-hosted data lake and Lakehouse platform of Microsoft has gained new data science tools these days that open up Power BI datasets to Python, R, and SparkSQL. Nowadays, the modern enterprise gets powered by data, which brings together information from organizations to use business analysis tools that deliver answers to any relevant questions. The business analysis tools give access to real-time data and information along with the use of historical data to provide predictions of future trends based on the current state of the business.

It is necessary to have a common data layer across an enterprise. This helps to bring many different sources to provide a single place to query the data. Here, a common data layer or data fabric provides the organizations with the baseline of truth, which informs them of short-term and long-term decision making. It powers instantaneous dashboard views and machine learning models, which help identify all trends and issues in this area.

Designing from the Data Lake

Without any surprise, Microsoft delivered the best product and service of the hour by bringing many of its data analysis tools together under the umbrella of the Microsoft Fabric brand. It got done with a mix of relational and non-relational data, which got stored in cloud-hosted data lakes that are managed with Lakehouses.

The Microsoft Fabric got designed and built on an open-source Delta Table format and the Apache Spark engine. Here, Microsoft Fabric takes big data concepts and makes them accessible to common programming languages and specialized analytics tooling. The examples include visual data explorations and complex query engines provided by Power BI.

Incidentally, the initial preview releases of Microsoft Fabric were focused on designing the data Lakehouses and data lakes, which are essential to develop at-scale and data-driven applications.

With Microsoft Fabric, a whole lot of heavy lifting needs to be done to shape the data estate in the way it is needed as per the scale of a project. Hence, it is essential to get the data engineered before starting to design and build more complex applications with the acquired data.

Integration of Data Science with Data Engineering

As Microsoft Fabric remains in preview, Microsoft continues to add new features and tools. Here, the latest updates address the software developers by adding integration with familiar developer tools and services, and other features that go beyond the basics of a set of REST Application Programming Interfaces (REST APIs). With Microsoft Fabric, the data scientists get a simplified and powerful tool that integrates Power BI datasets into the existing data science platform of Microsoft Azure.

For instance, the Power Query in Power BI is one of the most important tools in the data analysis platforms of Microsoft. Here, Power Query is best thought of as an extension to pivot table tools such as Microsoft Excel. The Power Query is a way of slicing and dicing large amounts of data across multiple sources. Relevant data gets extracted quickly and easily through the Power Query. The key to its capabilities include Data Analysis Expressions (DAX). DAX is a query language to analyze data, which provides the proper tools that filter and refine data as required.

Use of Semantic Link within the Microsoft Fabric workspaces

The new semantic link feature of Microsoft Fabric provides a connecting link or a bridge between the data-centric world and data science tools provided by languages like Python. It is done using familiar Pandas and Apache Spark Application Programming Interfaces (APIs). These new libraries get added to Python Code wherein the developers can use semantic links from inside notebooks to design Machine Learning models in AI tools like PyTorch. Then developers can use the Power BI data with any of Python’s numerical analysis tools. This way, complex analysis can be applied to datasets.

The semantic link feature of Microsoft Fabric facilitates collaboration between the developing and operational teams effectively. The Power BI team uses tools like DAX to build report datasets that get linked to models and notebooks used by data science teams. It ensures that both teams keep working with the same data and models. As aforementioned, the semantic link Python API uses familiar Pandas, and from these methods, one can discover and list the datasets and tables created by Power BI. With the associated measures, the developers can write codes that get evaluated before running DAX from the Python Code.

With Microsoft Fabric, one can use standard Python tools to install the semantic link library. The semantic link library is available from the Pip Module repository. Here, once the library gets loaded into the Python workspace, all which is left to do is to import sempy.fabric to access the Fabric-hosted data. It then gets used to extract data, which is used in the Python Codes. The semantic link package is a Meta Package as it contains several different packages, which get installed individually as per preferences. The semantic link package is a set of functions, which lets the developers use fabric data as geodata that aids in quickly adding geographic information to fabric frames and use the geographic tools of Power BI in reports.

An interactive notebook, a useful feature for developers working with semantic links, can execute DAX code directly by using an iPython Interactive Syntax. Just like writing the Python code, the developers need to install the library in their environment before loading sempy as the external module. A proper DAX command gets used to run and view the desired output as this approach works well for experimenting with fabric-hosted data. Here, the data analysts and scientists work together in the same notebook. DAX queries get run directly from Python with the evaluate_dax function of sempy. The semantic link package helps data scientists validate information. Another option includes the ability to visualize the dependencies between the data entities, which help to refine the results of queries and understand the structure of datasets.

Foundation for Data Science at Scale

With Microsoft Fabric, the developers are not limited to Python notebooks. To use the Big Data tooling, one needs to work with Power BI and Spark data in a single query. Here, Power BI datasets get treated as Spark tables by Fabric. It means that one can use PySpark to query across both Power BI data and Spark tables hosted in Fabric. Other developer preferences include the use of R and SQL tools of Spark.

Conclusion

As of now, there are lots that keep happening with Microsoft Fabric. New features are getting added to the service previews on a monthly cadence. Now, it is clear that the semantic link package library is only the beginning of bridging the divide between data analysis and data science. This makes it easier for users to build data-driven applications and services.