8 Tips for Improving Your Data Analysis Workflow

Useful tools and practical tips to improve the workflow of your data analytics team

yanyan.li
Slalom Data & AI
10 min readNov 30, 2022

--

Photo by Pavel Danilyuk from Pexels

“Data analysis workflow” refers to a coordinated framework for data analysis tasks. This workflow involves:

  • Discussing business requirements with stakeholders and framing analysis targets
  • Methodology research
  • Data preparation and analysis
  • Building ML models
  • Presenting results
  • Documentation

The daily tasks for a typical data analyst (DA) involve one or more pieces in this workflow, not matter if it’s an ad-hoc project or static project.

Challenges for data analysis workflow

One of the most popular tools for data analysts is Jupyter Notebook. It provides a simple interface that allows data analysts to configure and arrange their analysis workflows. However, along with the convenience provided by Jupyter Notebook, the DA team is also facing some challenges in data analysis workflow:

  • Lack of reproducibility with the increasing data processing and model complexity
  • Hard to find relevant documents or reuse the code due to staff turnover
  • Hard to share knowledge/experience within the team
  • Hard to maintain the code quality
  • Any many other common challenges software engineering faces without robust code management.

Standardizing your data analysis workflow with the appropriate tools can reduce the workload and mitigate impacts from the challenge described above.

Tools

Before I outline my preferred architecture and flow, let me introduce some of the key tools you can use to help standardize your data analysis workflow.

Databricks

Databricks is a unified analytics engine that aims to help clients with cloud-based big data processing and machine learning (ML). It provides a fast, simple, and scalable way to build a just-in-time data warehouse that eliminates the need to invest in costly ETL pipelines. It scales storage and computing resources independently on-demand, supporting traditional ETL and directly accessing data to accelerate time-to-insight.

Why this tool is great:

  • Easy for DA team to start with
  • Multi-user support
  • Manages Apache Spark clusters
  • Flexible job scheduling
  • Easy multi-cloud integration

GitHub

GitHub is a code hosting platform for collaboration and version control. You and your team can use GitHub to store, track, and collaborate on data analysis projects. Other options that provide similar functions include Bitbucket, Gitlab, and Azure DevOps.

Why this tool is great:

  • Central repository to manage source code
  • Version control
  • Keeps track of development progress
  • Code review

MLflow

MLflow is an open-source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. It’s extremely useful for data scientists to manager their ML projects.

Why this tool is great:

  • Keeps track of all the parameters
  • Keeps track of key KPIs
  • Model versioning
  • Manager model artifacts

Delta Lake

Delta Lake is an open-source data storage layer that unifies atomicity, consistency, isolation, and durability (ACID) transactions, scalable metadata management, and batch and streaming data processing.

What makes this tool great:

  • Time travel and data version control

Example architecture and workflow

In this section, we’ll outline how the above tools can help standardize our data analysis workflow.

Tip 1: Change your mindset

A typical ad-hoc analysis can be arbitrary and full of chaos. For that reason, it isn’t wise to invest too much effort into a one-time job.

As an example, we received an analysis request when we were pretty sure our team has done something similar before. However, we couldn’t make use of analysis codes or even find the codes we had previously built.

One of the causes is that there wasn’t a central repository; the code was written in Jupyter Notebook and held by each of the team members, so there was no way to locate the code and reuse it. Moreover, it would be impossible to guarantee the accuracy and readability of the code due to the lack of a proper review process, particularly when the code was written a long time ago.

It’s time to change our mindset from building one-time notebooks to a project that can be shared, maintained, tracked, and reused within the team.

Data analysis projects can be categorized into two types: ad-hoc projects and static projects. However, these two types of projects aren’t contradictory — they can be framed into the same data analytics workflow.

Ad-hoc project
Static project

While we don’t advocate over-engineering simple tasks, this one-time investment on planning and standardizing your data analysis workflow can be reused in the future and reduce the efforts on managing dispersed notebooks.

Tip 2: Organize your project folder structure

Once you decide to standardize your data analysis workflow, you can begin organizing folders, as it’s good to have separate folders for each project. Below are a few examples.

Structure for ad-hoc projects:

  • README.md

A README.md file is a top-level introduction for users to use this project.

  • .gitignore

A .gitignore file specifies intentionally untracked files that Git should ignore, such as your practice Notebooks.

If you have a very complex project — such as automated campaign analysis — that will consume the campaign information, calculate the predetermined KPIs and generate the reports for stakeholders. Both data processing logic and environment configuration should be included.

Structure for static projects:

  • setup

A setup file help you set up the environment to run the project.

  • models

This folder stores the main logics for data cleaning, data processing and result generation.

  • utils

This folder contains the logic for module configuration and utility functions. A config file converts the input campaign information and model configuration information to choose the right module and method to process the data and conduct the analysis. A tools file contains the utility functions for completing the whole campaign analysis process.

  • reports

A report file is the template for the end users to automatically generate and customize the campaign analysis report. It is the entry point for this project.

Tip 3: Don’t hardcode input value in your Notebook

Sometimes for convenience, we write input variables directly into data analysis logic. This isn’t ideal because if someone wants to use your code in the future, they need to know which values need to be changed in your script, which may lead to mistakes. For this reason, it’s better to use default values, variables, and arguments.

For example, in the automated campaign analysis project mentioned above, we write the campaign information and model configuration in a JSON file and pass it to the analysis module instead of modifying each of the notebooks in the analysis module.

There is no fixed rule on how you pass in parameters. We decided to use JSON in our project for the following reasons:

  • The key-value pair paradigm is simple
  • It can handle data of different levels of structure
  • It’s human-readable

Tip 4: Manage your data analytics using MLflow

MLflow model tracking allows you to create metrics or KPIs such as foo and foo1 below that can then be tracked and visualized across models.

You can also visualize your KPIs that you logged:

MLflow model tracking

For more complexes case, MLflow also provides automatic logging function that allows you to log metrics, parameters, and models without the need for explicit log statements.

The MLflow client caches artifact location information on a per-run basis and you don’t need to worry about forgetting the location of artifacts. The artifact store is a location suitable for large data (e.g., Amazon S3) and is where clients log their artifact output (e.g., model files).

Currently, MLflow supports the following storage systems as artifact stores:

  • Local file path
  • Amazon S3
  • Azure Blob Storage
  • Google Cloud Storage
  • SSH File Transfer Protocol (SFTP) server
  • Network File System (NFS)

You can easily get the artifact location information for a specific run by tracking UI or query programmatically.

Tip 5: Use time travel for data versioning

With large amounts of data continuously coming in, it’s not a simple task for professionals or organizations to audit the data changes or roll back to the previous version in case of accidental deletes. As a data analyst, you sometimes may need to access a historical version of data in order to reproduce, debug or audit you work. It requires versioning of data. Delta Lake time travel allows us to query an older snapshot of a delta lake table by version number or by timestamp.

Tip 6: Use GitHub for project version control

To better control the model development process, improve code quality and transparency, we build the analysis logic in Databricks Notebook and sync it with a remote GitHub repository.

The first step is GitHub integration. Databricks Repos provides repository-level integration with Git providers. You can develop code in a Databricks Notebook, sync it with a remote Git repository, and use Git commands for updates and source control.

After integrating GitHub, the next step is to set up a branching strategy. In a simple case, we will have two main branches:

  • MASTER: The analysis logic that’s ready for use.
  • DEV: We’re building new features and fixing bugs. After reviewing/testing, the new logic built on this branch will be merged into the MASTER branch.

With this design, the DEV branch will contain commits ahead of the MASTER branch.

After hours of coding, it becomes more and more difficult to catch bugs and edge cases for the code authors. Some examples when code authors may need a reviewer include:

  • The code review process is a learning opportunity for reviewers.
  • To work as a team, it’s important to maintain code quality for the shared codes.

Therefore, it’s also important to set up code review standards and internal processes to ensure code quality.

Tip 7: Automate your workflow

There are two ways to automate your data analysis workflow on Databricks:

  • Scheduler: You predetermine time to run your job.
  • On-demand: You can trigger your job any time you want, or by events.

This function is extremely useful for static analysis projects. Additionally, after setting up the GitHub integration, we can always get access to the up-to-date version of code. On Databricks, you can set your analysis module source code to remote repository.

A few things to note when creating the job:

  • In the Type dropdown menu, select Notebook.
  • In the Source dropdown menu, select Git provider. The Git information dialog appears.
  • In the Git Information dialog, enter details for the repository. For Path, enter a relative path to the Notebook location.
Databricks job

Tip 8: Present results using Notebook

There are a few options for how to present your results:

  • Notebook only
  • Notebook and dashboards or a slide deck

The combination of reporting tools you choose depends on who you’re reporting to and what you’re going to report. The focus here should be on whether they meet your requirements.

In our case, we use Notebook for internal review and built a slide deck to present the insights to business stakeholders. We used Notebook to do an internal review since it could provide details on the entire analysis process. We can also explore the raw data along with the results data during the discussion to make sure the process is both accurate and transparent.

At the same time, business stakeholders may not be interested in the details of the process. What they’re looking for is KPIs and insights gained from the campaign, so a slide deck is often the best choice.

With the standardized results automatically generated by the campaign analysis tools, you can simply open your notebook and pick out the most useful information from the tables and plots to save you and your team valuable time.

Summary

In this article, we introduce the workflow for data analysts and explain the challenges we meet when working with client data analytics teams. While we find these eight tips extremely useful, it’s essential to have a team that’s working toward the same goal and putting effort into maintaining and upgrading the workflow together.

Slalom is a global consulting firm that helps people and organizations dream bigger, move faster, and build better tomorrows for all. Learn more and reach out today.

--

--