Data Analytics & Visualizations — Replicating the Tableau Example In Excel — Part 1
Extracting The Dataset Into The Excel Data Model
In my previous post, I discussed where the article authors post regarding Excel was incorrect as to Excel’s limitations.
So, I’m going to do a series of posts that will take us through completing the same (or similar) steps — using Excel — to demonstrate the current capabilities of Excel to perform analytics projects many do not consider possible.
In this first step, we will source the data into Excel and start some basic data clean up. Let’s go.
If you would like to download the Excel file created from the steps below, you can access it here in a ZIP format. Since it is almost 1.6 million records, with numerous fields, it’s a large file!
Understanding The Data Source
The data source that the authors provided the following:
“The data set used comes from the National Center for Charitable Statistics(NCCS) website (http://nccs.urban.org/), which maintains the IRS Business Master Files (BMF) using data derived from the IRS Forms 1023 and 1024 of the over 1.5 million U.S. not-for-profit organizations that file IRS Form 990. The authors chose this data set to demonstrate Tableau’s ability to handle big data and the types of data relevant to accountants. It is worth noting that Excel could only load 1,048,576 of 1,584,587 records, leaving 536,011 records unloaded. The data fields used are shown in Exhibit 1; the data itself can be downloaded from the NCCS (https://urbn.is/2IEu5EA).”
When the data is queried, it contains many fields of information. Step 1 is to understand the data structure and NCCS provides a Data Dictionary to enable that — take a look at the most recent version here.
Querying and Importing The Dataset
The data source is provided above.
Using this link, I opened Excel (I use Office 365, Version 2016) and copied the link into Get & Transform (under “Data” tab).
- Paste URL from website source into the dialog box. Click OK and the query begins.
- A preliminary view of the data appears. Notice the URL source and then click on “Transform Data” at the bottom.
- Power Query opens in Edit mode. At this point, I only want to do 2 things — change the name of the query AND notice the “Applied Steps” that have started to be created (at this point, Power Query has done these by itself).
- Then, we load the data into via the Data Model into Excel. Since there are so many records (1,584,587), the data cannot be loaded directly into Excel. However, contrary to the reference article above, Excel can easily handle more records than rows in a spreadsheet.
By the way, the Data Model handles the data much more effectively and efficiently versus dropping it into an Excel table.
If you aren’t sure about all the Power Query load options, this 5 minute clip from #ExcelIsFun channel on YouTube is an excellent overview:
Next, open Queries & Connections and you can see that 1,584,587 records have been loaded into the Data Model.
Let’s open the Data Model and see what this looks like —
Next Steps
The next phase of this process is to shape or “transform” the dataset into what we need for our analysis.
So, the “Extract, Transform, Load, Report” sequence of analytics moves to the Transform stage.
At this point, we have 49 fields:
We will only retain 13 fields, since the others are not relevant to the analysis we are performing.
I won’t go into detail of this process, but it is necessary to review the Data field definitions to understand what each field represents. That information can be found here.
This will be explored in the next post — stay tuned!
About Don
“It’s time for DIFFERENT”
Don is passionate about helping professionals and organizations keep up, develop and adapt to the changing business world that we operate in.