Hacking up a reporting pipeline using Python and Excel

When I started as an Analytics Manager in an E-com company, the reporting process was manual, prone to errors and was taking over a full day to consolidate. We were looking to scale up and continuing the way it was wouldn’t allow us to do that effectively. We needed a solution that would be provide a higher degree of automation.

Data Sources

Our reporting process involved extracting order files out of 7 different shops, getting multiple extracts from google analytics for each of these shop, extracting information related to our marketplaces activities on Amazon, Groupon and bol.com, getting spend metrics from more than 8 different advertising channels.

For some of the datasources, we had access to the API, other only a reporting interface that could export to CSV or excel. This was due to some of our operations being outsourced to different agencies at that moment, but we were looking to insource as much of these operations as possible.

Selecting Python & Excel

Given that our business was quickly evolving, we needed a solution that would cater to integrate a growing list of datasources under short notice, the reporting requirements also being evolving. We needed a solution that was flexible and that could still be automated.

Excel provided a flexible way to handle reports, but by itself didn’t quite offer the level of automation that we were seeking. VBA was a poor choice for handling the load and transformation of data into a database & the database connection within only allowed for single data fetch and would have to be worked around to create multistep data process.

We needed a complement to Excel to provide the level of automation we wanted, at the same time as we were looking to automate our report we were also setting up airflow to cater to some of our more advanced analytics use cases. Python had an ecosystem around data processing with libraries such as pandas, SciPy, scikit-learn and a way to interact with Excel through libraries such as openpyxl or xlsxwriter. Libaries such as xlwings, further pushed the interaction between Python and Excel. Given its different qualities python seemed to be a great fit for us.

Automating the process

We were querying order data from our web-shops during the week for multiple purpose and it made sense to consolidate their extracts into a SQLite on a local drive. This made it easier to query one single table rather than dealing with 7 different file extracts.

Decision to integrate directly with Google Analytics reporting API to source the data was quickly made. The workload of have to handle multiple extracts per shops, for 7 shops and the limitation of export size to 5000 rows, justified spending time on integrating the API. Google Analytics’ integration with Adwords allowed us to deprecate one extra data source from our reporting process. Other sources could be initially handled as file exports and processed by the reporting script or imported from the SQLite database.

The data was exposed onto a pandas data-frame before being pushed to an excel report. Using pandas as the data interface and processing layer allowed for flexibility and interoperability. Using pandas datasources could be progressively migrated from raw files, to an api call or a database query without requiring any changes in the downstream pipelines. We were also not losing in terms of functionalities compared to a database, SQL queries could still be performed on the data-frame using pandasql and other operations could be done using the standard pandas API, our data-volume didn’t require using a distributed system and could perfectly run on a local laptop.

The processing step involved merging of datasets with master data, consolidation and structuring of datasets, scoring of orders’ logistic cost and leveraging Google Analytics API. Since we were using the e-commerce tracking feature of Google analytics, we were further able attribute and tie back source channels and campaigns to actual purchase data from our webshop and past user behavior.

We were looking to generate reports by country and wanted to be able to re-use single template for all the shops. Pandas can directly export to excel, but I wanted to be able to use the structured reference functionality of excel tables since they make functions within workbooks easier to understand and manage and this type of export was not supported by pandas. This required writing a wrapper around openpyxl to handle table updates since operations on tables are not natively well supported by the libraries.

Further improvements

Further improvement can be done in phases to increase the degree of automation and increase re-use.These can be done in a phased approach as the possibility to automate certain part of the reporting process become available.

Integrating all the flows into database and API calls, will improve the degree of automation in the process, and remove a time consuming step of the process, ie: extracting the different reports from the backend of websites.

Migrating the different dataset obtained through API call will allow at a later stage to fully decouple the solution. This will allow for the source data being fetched to also serve powering web-dashboards, ad-hoc analysis query…

We have already moved certain data-pull operations to Airflow, but only once every data fetch operations will be automated through API pulls, will it be possible to fully automate the generation of the reports. At that point depending on the business maturity it will be possible to see from the decoupling if using Excel for generating these automated report still make sense or if we should be looking at investing more in the setup of different dashboards.