How to Excel with the Graph

Data Engineers have long struggled with extracting data from Excel, but now Microsoft’s Graph API makes things so much simpler.

Carl Follows
Ricoh Digital Services
11 min readMar 5, 2020

--

The problem is that it’s just too useful

I’ve worked with data for 20 years and still get amazed by the amount of reliance companies have on Excel Spreadsheets. I’m not talking about the normal uses of Excel; I’m talking about the big horrible spreadsheets. Probably created by some consultant or part time techie 10 years ago, who loved a bit of VBA and has since moved on to greater things. But long after they’ve moved on, their work is still fundamental to the business and no-one really understands how, it’s just a big lump of risk that no-one can afford to fix. I shouldn’t really be surprised; Excel is a brilliant and very powerful tool which pretty much everyone is comfortable using. And what’s the alternative anyway? Corporate IT departments don’t typically allow business users access to buy or create software, I don’t blame them; I’ve seen greater monstrosities created in MS Access which fewer people could support.

So, we’re left with Excel Spreadsheets flying around the business storing large amounts of critical data. In recent years SharePoint and Office 365 have improved the governance and reduced the amount of emailing of these documents. But we’re in an age where companies are trying to maximise their use of data, they want it consolidated, aggregated and analysed. The CEO want’s (on his mobile) up to the minute stats about how the business is performing, but how can he get that when the information is spread over countless files. This is the job of the Data Engineer.

It’s not a new problem, we’ve been extracting data from Excel for years, but it never “just worked”. Just as you thought you had a tried and trusted methodology you found out the customer was running an ancient version of office, never mind challenges over 32bit Vs 64bit.

The requested OLE DB provider Microsoft.ACE.OLEDB.12.0 is not registered. If the 64-bit driver is not installed, run the package in 32-bit mode. Error code: 0x00000000.An OLE DB record is available. Source: “Microsoft OLE DB Service Components” Hresult: 0x80040154 Description: “Class not registered”.The ‘Microsoft.ACE.OLEDB.12.0’ provider is not registered on the local machine. For more information, see http://go.microsoft.com/fwlink/?LinkId=219816. 
SSIS Error Code DTS_E_CANNOTACQUIRECONNECTIONFROMCONNECTIONMANAGER.

Even if the versions aligned there was still another problem, most techniques involved having some Office components (If not full Office, then at least Microsoft Access Database Engine 2010 Redistributable) running on the computer performing the data extraction. Whilst the adoption of SharePoint ( O365 or Teams ) has meant that the Spreadsheets are no longer sitting on file shares or individual computers, we have lost a natural host for these components. In the world of cloud hosted SAAS (Software As A Service) there just isn’t room for a VM running Office just to allow us to get some data out of Excel, so what next?

The Generous Graph

Although Microsoft Graph has been around for a few years, I’ve been too busy getting on with tried and tested architectures to look up from the weeds and find out about this new technology. As is often the case it takes a different scale of problem, which can’t be solved by any existing techniques, to make us go and learn about how we can adopt something new. That’s exactly what happened with a recent customer and caused me to discover and love Microsoft Graph.

Microsoft Graph is the gateway to data and intelligence in Microsoft 365.

That’s how Microsoft describe it, all delivered through single a REST API endpoint. Providing access to SharePoint / Teams / Outlook / Azure Active Directory and much more besides.

Whilst some of the data is read only, there is much that can be written to as well. I’m not going to try and describe all its capabilities here since there’s plenty of documentation already available (plus API reference). Rather I wanted to share an example of how it’s helped me solve a real customer problem and become one of my go-to technologies.

Excel on Steroids

The recent challenge which put Microsoft Graph firmly into my toolkit was with a customer who had built a whole application in Excel and then created several thousand copies of it. These Spreadsheets were used by a large number of people in the business; each planning and manage their individual area. The problem was that the company wasn’t able to see what was happening across the organisation; they wanted to be able to compare each area’s performance back to a baseline. A significant amount of functionality was built into these Spreadsheets plus the users relied on so much that Excel just provided out of the box, which meant the option of rewriting them into a modern application was just too costly for the benefit it would deliver.

The first part of our project brought these documents under control within a document library in SharePoint online and we modified the Spreadsheet to collate all the data we were interested in into a single worksheet, but how to extract it?

Make the data Flow

My first thought for interacting with documents in SharePoint was to use Flow, for it has connectors for Excel Online. Alas the Excel Online connectors in Flow were unable to interact with the Spreadsheets because of “Unsupported Features”, i.e. they were full of Macros and saved as a binary files (xlsb) to help performance. Whilst trying to figure this out I stumbled across one of John Liu’s Blogs and then his walkthrough using the Graph API and this changed everything.

What I learnt was the ability of the SharePoint API and the Graph API to:

  1. List the Drives (i.e. document libraries) within the SharePoint Site
  2. List the Folders within the document library
  3. List the documents within each folder to get their DriveItemID
    Using the SharePoint GetItems Flow activity with an ODATA filter query
    This DriveItemID is the unique identifier for a document on O365
  4. Interrogate the structure of the Excel document
    To check for the existence of a specified sheet
  5. Extract the contents of the sheet

With a few steps I could iterate through all the files in the document library checking for the existence of worksheet I needed and extract the data to a csv. The MS Graph API didn’t care that my documents were full of macros, or the format they were saved in. Nor did it base the structure of the table on the first 8 rows (another favourite feature of the Microsoft Jet OLE DB 4.0 Provider we once used extract data from Excel). It just pulled the data straight out of the cells.

https://{companyname}.sharepoint.com/sites/{sitename}/_api/v2.0/drives?select=namehttps://{companyname}.sharepoint.com/sites/{sitename}/_api/v2.0/drives/{driveid}/root:children?select=namehttps://graph.microsoft.com/v1.0/sites/root/drives/{driveid}/
items/{driveitemid}/workbook/worksheets?$select=name
https://graph.microsoft.com/v1.0/sites/root/drives/{driveid}/
items/{driveitemid}/workbook/worksheets/{worksheetname}/UsedRange(valuesOnly=true)?$select=values

You may notice that my first couple of calls were made to the SharePoint API, whilst the later were to the MS Graph API. They key difference between these seem to be how you authenticate (Flow simplifies this) and (more importantly) that the MS Graph requires the premium Flow connector “HTTP with Azure AD / Invoke an HTTP request”. Using a Premium connector can significantly change what licence you’d need, and if you’re not sure about that then I’d recommend reading Matt’s blog.

Whilst I was very happy with this architecture, my customer wasn’t and so I found myself looking for alternatives.

The Factory

Data Factory is great at moving files between lots of cloud sources, however there isn’t an out of the box connector (linked service) for a SharePoint document library.

What there is, is a linked service for REST API’s, which brings us nicely back to my first experiments with the new MS Graph API. Interacting with the Graph API follows a common pattern of interacting with many other REST API. We request a token using a set of credentials, and then send that token into each request. The token is valid for a period of time but needs to be submitted every time, because by its very nature the Graph API is RESTful and holds no state.

The key difference in the development experience between Flow and Data Factory is that Flow simplified the authentication experience, whilst in Data Factory we have to get our hands dirty and manage this ourselves. There is a nice blog from Lars Bouwens which help me get started, and the Microsoft documentation helped fill in the gaps.

To get this working we need to:

  1. Create an Application ID
    To call the REST API we need an application registering in Azure AD which has the permissions “Sites.Read.All” granted
  2. Get a Token
    Before you can get any data from MS Graph you need to authenticate with the Microsoft identity platform and get an access token.
    This token contains information (claims) that Microsoft will use to validate the caller and their permissions.
  3. Call the API
    Including the token in the authorisation header
    Making the same calls as described above to navigate to the DriveItemID

Service Principals

We’re authenticating to the MS Graph API through a Service Principal. If you’re unfamiliar with a Service Principal, think “service account” or “daemon”. The idea being that it is just an account that an application uses to access resources without being tied to or logged on as an interactive user.

Plainly there is security aspect here which needs to be considered. MS graph API is incredibly powerful and can literally drill into information around your data estate on OneDrive on SharePoint online amongst other things. So, there needs to be control. This is achieved again through the Service Principal by only allowing it to access sections of the MS Graph scope.

In this example, we need to grant our Service Principal that is calling the MS Graph API access to the SharePoint Online folder where our Excel files are going to be stored. This needs to be done by someone with Azure AD Administrator Permissions. If you need to only read information, then the permissions that the Service principal needs are:

  • Sites.Read.All

If you need to add or update lists and their contents, then you need :

  • Sites.Manage.All

Once these permissions have been requested, you then need to have them approved by the Azure AD administrator.

So whilst Data Factory can absolutely be used to invoke the Graph API, this would have introduced a new service into our architecture. My preference is always to limit the number of technologies used within a solution to help with ongoing support and maintenance. So before implementing this solution I thought I’d have one more try.

Brick by Brick

We already had an instance of Databricks running in our solution, and current thinking seems to be that there is no data challenge which Databricks can’t solve, so this was the obvious next candidate to provide a solution.

First I wanted to see how well Python (key languages of Databricks) could shred the Excel documents. The go-to library for manipulating data in Python is pandas, and because Excel is so prevalent this has a method for reading Excel files into a DataFrame (the data structure used to hold a table).

import pandas as pd
df = pd.read_excel(io=file_name,
sheet_name=sheet)

Alas whilst .read_excel supports both .xls and .xlsx file types, it doesn’t support .xlsb. But Python is full of open source libraries and of course I’m not the first person with this problem, so eventually I found an Excel 2007–2010 Binary Workbook (xlsb) parser for Python in pyxlsb.

import pandas as pd
from pyxlsb import open_workbook as open_xlsb
tableList = []
fileName = 'MoreThanJustASpreadsheet.xlsb'
with open_xlsb(fileName) as wb:
with wb.get_sheet('SummaryInfo') as sheet:
for row in sheet.rows():
tableList.append([item.v for item in row])
df = pd.DataFrame(tableList[1:],
columns=tableList[0]
)

So Databricks could perform the shredding of the .xlsb file into a usable format, but my solution is not yet complete. The Excel files are managed within a O365 SharePoint document library, so I’d first need to move them onto the Databricks cluster, but they are large files and I didn’t like the idea of continually copying these files around to extract such a small amount of information.

But what if ?

Could we invoke the Graph API direct from Databricks ? of course we can !

Any programming language worth it’s salt these days is able to interact with REST APIs as they are the lingua franca of the Internet in terms of integrating systems. Databricks support Scala and Python amongst other languages and seeing as Python is a commonly adopted language for data scientists and analysts, we’ll look at an example in Python.

We first need to get a token which we can pass through to the Graph API. This is very simple. In our notebook we can declare the following function:

import requests
import json
import pandas as pd
from adal import AuthenticationContext
# Set some variables
tennantName = "mytenant.onmicrosoft.com"
sharePointName = "mytenant.sharepoint.com"
siteName = "MySite Name"
clientId = "some big GUID"
def get_restsession():
clientSecret = dbutils.secrets.get
(scope='azurekeyvault_secret_scope',
key='access-app-secret')
auth_context = AuthenticationContext
(f"https://login.microsoftonline.com/{tennantName}")
SESSION = requests.Session() token_response =
auth_context.acquire_token_with_client_credentials
("https://graph.microsoft.com",clientId,clientSecret)
SESSION.headers.update
({'Authorization':"Bearer " + token_response['accessToken']})

The requests library allows us to communicate with the REST service, json allows us to manipulate the REST call results easily and adal is to allow us to use the AuthenticationContext for Azure AD. Our service client secrets are stored in Azure KeyVault.

We create the REST session (not a server side session), then use the acquire_token_with_client_credentials method of the Authentication Context object to obtain a token for subsequent calls. This returns us a token_reponse object from which we can then construct the correct authentication header for each REST call we make.

From this point on, it’s easy to drill into the MS Graph API as we see fit.

# initialise a session  for the microsoft Graph REST API calls.
SESSION = get_restsession()
# Now make the call to get the driveId (doc library)
driveIdJson = SESSION.get
(f"https://graph.microsoft.com/v1.0/sites/{sharePointName}:
/sites/{siteName}:/drives").json()
# Now get the driveItemIds (Excel files within the folder)
drivesData = pd.DataFrame(driveIdJson["value"])
driveId = drivesData[drivesData['name'] == 'Documents']['id'][0]
folderPath = "MyFolder/MySubFolder"
driveItemIdsJson = SESSION.get
(f"https://graph.microsoft.com/v1.0/drives/{driveId}/root:
/{folderPath}:/children?select=name,id").json()

we can now build a list of the Excel files (DriveItemIDs)

driveItemIds = []for driveItem in driveItemIdsJson['value']:
driveItemIds.append(driveItem['id'])

and enumate over the tabs

for driveItemId in driveItemIds:   tabsJson = SESSION.get
(f"https://graph.microsoft.com/v1.0/drives/{driveId}
/items/{driveItemId}/workbook/worksheets:
/children").json()
tabNames = [] for tabJson in tabsJson['value']:
tabNames.append(tabJson['name'])

So that we can extract the data from the tab we’re interested in, using the UsedRange method to return the grid which contains all non-null values

for tabName in tabNames:   excelDataJson = SESSION.get
(f"https://graph.microsoft.com/v1.0/drives/{driveId}
/items/{driveItemId}/workbook/worksheets
/{tabName}/UsedRange(valuesOnly=true)?$select=values").json()
rows = excelDataJson['values']

Everything is awesome

That’s it. Once you have a grasp of the Service Principal and the hierarchy of the Graph API it’s simply a matter of playing with the syntax to drill to the data you want. If you want to experiment with the syntax before integrating everything into Databricks, then Microsoft Graph Explorer is a must

Of course not everything is smelling of roses, because Excel doesn’t know the difference between formatting and data. It’ll still “helpfully” remove leading zero’s from your phone numbers and convert what should be a code “5E2” into the number 500, but we can only fix one thing at a time. Unfortunately the end users will still ask “can you load cells coloured green”, but there’s just no helping some people.

As is often the case, what started out as an interesting requirement has lead to a exploration and reevaluation of the latest technologies. I’ve discovered a new architecture to add to my toolkit. So when a customer now talks to me about their problems with Excel documents I’m thinking about the opportunities to use the Graph API, rather that the fear of what I might uncover.

I just wish I’d found the Graph API sooner.

--

--

Carl Follows
Ricoh Digital Services

Data Analytics Solutions Architect @ Version 1 | Practical Data Modeller | Builds Data Platforms on Azure