Data Analytics: How to Analyze SharePoint List Data in Python
What Are Multi-Choice Columns?
If you have been using Office 365 or SharePoint as a data management tool or business application platform, you most likely have come across checkbox columns (aka multi-choice columns). Creating a list with multi-choice columns is effortless. As is the data entry.
For instance, you have a custom list to keep track of office locations in various companies. In order to expedite data entry, you build a Choice column and select Checkboxes to accept multiple selections. There you have a multi-choice column!
What Challenges Do Multi-Choice Columns Present?
The fun really begins when you start analyzing the data and generating visualizations from the collected data. The challenges lie within how the mulit-choice data is stored. Similar to HTML, all selected choices are stored as a single field with delimiters. In this case, SharePoint uses “;#” delimiters. This requires parsing and data massage.
In this example, you have a SharePoint list called Company Offices. Each company may have 1 or more global offices. You have been asked to crank out a report on the number of offices in each country. The underlying data in SharePoint looks like this:
There is not an easy mean to perform aggregation within SharePoint. Solutions like SQL Server Reporting Services (SSRS) and SQL Server Integration Services (SSIS) are potential candidates and require a lengthy setup. What if the token requirement is “We needed this yesterday for an executive report”? It will take a little longer to spin up the above infrastructure!
If you excel in Excel, you can pull in the list data and combine formulas and pivot tables to achieve the above objectives. Such solution, however, is not particularly reusable.
Why Python, Pandas, and Seaborn?
Python comes into the solution beautifully. Python is a general purpose scripting language and can be set up on virtually all operating systems within minutes. (Basic Python is beyond the scope of this post. Google many of the excellent primer on the interweb). In addition to core Python libraries, we are importing two libraries for to extend functionalities. Pandas is a Python library providing high-performance, easy-to-use data structures and data analysis tools. Seaborn is a Python visualization library providing a high-level interface for drawing attractive statistical graphics.
Where on Github?
The source code is published on Github (https://github.com/klopmp/sharepoint-data-analysis). Let’s give this a quick run-through.
- In Office 365/SharePoint, export list data as CSV file. We have a tutorial on Exporting Data to Excel.
- In Python, read data into a Pandas DataFrame.
- Parse delimited data into a temporary DataFrame.
- Join original and temporary DataFrames.
- Generate a Seaborn bar graph.
- Office 365/SharePoint excels in data collection and basic reporting. More advanced data analysis is typically handled by another technology stack.
- Microsoft SQL Server Reporting Services (SSRS) and SQL Server Integration Services (SSIS) are a popular server-based solution, which requires additional setup and configuration.
- Excel is the defacto client-based solution, especially for those familiar with Excel formulas and pivot tables.
- Python is a viable alternative, especially with the use of its vast extensibility thanks to an ever growing list of libraries.
Originally published at Kattelo.