R is better than M for difficult data wrangling tasks in Power BI

Yep, I said it! In most cases I believe that it is true. R has been around for over 20 years so it is much more mature than M. It has close to 10K packages many of which are meant just for data wrangling. If you don’t agree with me now I think by the end of this post it’s a good chance that you will think otherwise. This blog is written for those with little exposure to R so I will keep it simple. I am going to start with a very basic file import example and I will end with a somewhat simple but slick multiple file import example.

We are going to use data from the 2008–2009 NBA season. This data set is broken out into individual csv files at the game level. There are 1,176 files in the data set. The actual data set can be found here. The files have a consistent naming convention. The first part of the file name is a 8 digit string that represents the game date in a “YYYYMMDD” format. The second part of the file name is the 3 character abbreviation of the away team followed by a 3 character abbreviation of the home team. The last part is the file type. So, “20081028.CLEBOS.csv” is the game file for the NBA game that was played on 10/28/2008 between Cleveland and Boston at Boston.

There are few things we must do before we get started with our examples. We need to decide on which R IDE we want to use. The R script editor in Power BI is not good for development so it is best practices to use a R IDE such as R Studio or R Tools for Visual Studio (RTVS). Then we must make sure that we have all of the game files in the same folder and designate that folder as our “working directory” in R. We set our “working directory” using the setwd() function. After that we will develop our R script and when we are satisfied with it we will copy and paste it into the R script editor in Power BI. Your R script must return at least 1 R data frame. When you execute the script in Power BI every data frame that your R script returns will be exposed to Power BI as a table. Let’s get started with our first example!

We will start with a simple example of loading a single game file into Power BI. To do so you only need the following 2 lines of code:

Image for post
Image for post

This short script can be used to load a game file into Power BI. In the above example the “read_csv” function from the “readr” package is used to load the file into a variable named “df”. The “read_csv” function returns a data frame so the “df” variable will be exposed to Power BI as a table. The only parameter that is required in the “read_csv” function is the file name as a string. Because the file is in our working directory we don’t need the rest of the file path. There are other parameters in the read_csv function that we could have used that would have allowed us to do things like define the column types, tell R to skip rows, tell R not to use the first row as a header, tell R how many rows to retrieve, and many others.

I know what you are saying, “Ryan, I can easily do that in Power BI through the Power Query interface.” You are right. But if you are a person like me that would rather right the code then you must agree that R is much more succinct than M.

Let’s do another example that is a little more complex. We will use R to combine multiple game files into one data set. The following code combines all 1,176 game files into one data set:

Image for post
Image for post

The script above puts all of the names of the files in the working directory into a character vector named “files”. It then iterates over each name in the “files” character vector and append all of those files into one data set.

I know now you are like, “Come on man! I can easily easily do that via the Power Query interface and I don’t have to write that funny looking R code!” Again look at how succinct and readable the code is. Also, if I were to do the same in M the code will not be as intuitive as the code above.

Now let’s do something that I think is pretty slick. Let’s create a data set that combines the home games of the Pacers (IND) and the home games of the Hawks (ATL). Given the naming convention used by the files we will have to identify the files in our working directory that starts with an eight numeric digits > then a period > then a 3 character team abbreviation for the away team > then either “ATL” or “IND” > then finally “.csv”. We can create a regular expression to find the files that matches that pattern. I did so in the code below:

Image for post
Image for post

In the 4th line a regular expression is used to reduce the character vector of files down to only games that meets the criteria that we want, games where the home team is either IND or ATL. Now our “files” character vector only contains games where IND or ATL is the home team. The script above is very succinct and more intuitive than a “M” script that would accomplish the same. The regular expression in the str_subset function enables R to accomplish in one line of code something that would have taken multiple lines of code in M. Regular expressions enables you to do string pattern matching in a very succinct way. You can leverage regular expressions in R but you can’t in M.

You can do much more with R for Power BI then just import single or multiple files. You can also use R to score data using using models built in Azure ML, bring data into Power BI from a database based on sql statements or a result set from a stored procedure using the RODBC library, use R to perform advance data transformation using libraries like dplyr or tidyr, as well as many other scenarios. Stay tuned for part 2 when I show you how you can use R in a Power Query script to do some advance data transformations.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store