Power Query, R is here to help you!
Microsoft Power BI has an embedded ETL tool called Power Query that is great for basic data transformation. You can handle many data transformation tasks with just a few clicks. But when your data transformation tasks become complex you often have to result to writing custom M code. M is a relatively new programming language and is not that intuitive. The community of M developers is small so finding help is limited in comparison to finding help with other programming languages.
Fortunately Microsoft added the R programming language to Power BI which allows you to overcome many of the limitations above. There are three ways you can leverage R in Power BI:
1. Use R to bring data into Power BI
2. Use R in your Power Query scripts
3. Use R to create custom visualizations for your reports and dashboards
I already wrote about the first way you can leverage R in Power BI. Here is a link to that blog. In this blog I will illustrate the second way. You can use R within Power Query scripts to do simple to advance data transformations. You can do something simple like applying a function that is in R but not M against a column in your data. You can also do something advanced like use R to impute values for missing data using advanced algorithms via packages like hmisc or mice. Here are the basic steps you need to follow in order to use R in Power Query:
1. Open the query in Power BI in edit mode.
2. Highlight the last step of your Power Query Script
3. Go to the “Transform” tab of the Power Query menu
4. Click on the R icon on the far right
Clicking the R icon on the far right opens up the R editor and passes R the data set from Power Query in the form of a data fame. A data frame in R is like an Excel table but with MANY more features. Once this data frame is in R you can make transformations to the data frame then pass the resulting data frame back to Power Query. Once back in Power Query the data frame will be exposed as a table that you can add to your data model.
Because this is meant to be at the intro level I will illustrate this feature using a simple example. I will used the dplyr library in R to select the columns I want to extract from a data set based on a simple pattern. Being able to select columns based on a pattern is very helpful when you have to select columns from a very wide data set. I love T-SQL but the ability to do what I just described is one of the features that R is better at than SQL. You can download the *.pbix file to see how I did it here. Here is the code that enables me to select Country and all of the columns that had “Trade Balance” in their field name. The data came from Census.gov:
Lie to me and say that the above did not impress you! LOL.
I want to leave you with two more things. If you look at the trade balance data set you will notice that it is not in a good format for data analysis. Here is a link to the file if you want to take a closer look. When you are doing data analysis you want your data to be in a “tidy” format. A “tidy” format means that each column represents a variable and each row represents an observation. To make this data set “tidy” you need to reformat the data into the following format: Country, Year, Trade Balance, Exports, and Imports. Here is the M code needed to reformat the data into the described “tidy” format:
I will admit that I was able to do most of this through the interface but it did require some custom M code. You can look at what I did more closely by downloading the pbix file here.
Here is the R code I wrote to accomplish the same thing:
It does not take a rocket scientist to see which code is simpler! The difficulty of writing M code grows exponentially. Not the case with R.