Unlocking Efficiency: The Power of Reusable Workflows in Data Science

Başak Tuğçe Eskili
Marvelous MLOps
Published in
3 min readJun 6, 2023

You are a developer, MLOps engineer, data scientist, and turning coffee into code. None of you would enjoy copy-pasting the same code many times to execute the same tasks. Just like how we want automation in (Dev & ML)Ops world, we also like reusability. When it comes to implementing CI & CD pipelines for multiple projects, reusable workflows are here to rescue us from monotony.

GitHub Actions is a great platform to write CI / CD pipelines to automate testing, building, and deployment.

It has a powerful feature that allows us to write reusable workflows. In this article, we’d like to show how to utilize GitHub Actions workflows for your projects.

Start with a central repository. This repository will host pipeline templates and actions.

Example structure for central repo
  1. workflows/: In this directory, we have reusable workflows, in YML format. You can think of testing, building, and deploying pipelines. These workflows will be directly used. You can define inputs, and pass values when calling. Each YML file is a workflow with jobs and steps defined.
  2. actions/: Here we define different reusable actions. Each action resides in its own directory and includes an action.yml file that defines the action’s inputs, outputs, and behavior. Those actions can be called within other workflows.

You can change the pipelines according to your specific requirements and preferences. These are example workflows and actions. The main goal is to have a centralized repository that provides reusable components for data science projects, to increase efficiency, and collaboration, and standardize certain implementations.

Some Best Practices

Modularity: Keep your actions as small as possible. Avoid complex tasks in one action. They should be self-contained and easily combined with different actions in different workflows. It will boost flexibility and increase usage.

I/O: Define the inputs and outputs clearly in your actions. What is the format of the input? str, int, etc. Is it mandatory? What is the default value? Add example inputs in the documentation.

Example: ./github/actions/set_env_vars/action.yml

name: set_env_vars
description: >
Define environment variables
inputs:
azure_credentials:
description: Azure credentials
type: string
required: true
databricks_host:
description: dbr host
required: true

runs:
using: "composite"
steps:
- name: Log in with Azure
uses: azure/login@v1
with:
creds: ${{ inputs.azure_credentials }}

- name: Setup env vars
id: setup_env_vars
shell: bash
run: |
echo "DATABRICKS_TOKEN=$(az account get-access-token --resource 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d | jq .accessToken --raw-output)" >> $GITHUB_ENV
echo "DATABRICKS_HOST=${{ inputs.databricks_host }}" >> $GITHUB_ENV
echo "GIT_TAG=${{ github.sha }}" >> $GITHUB_ENV
echo "PROJECT_NAME=$( echo ${{ github.repository }} | cut -d "/" -f2 )" >> $GITHUB_EN

Documentation: Include a detailed explanation of each workflow and action and how to use them. You can add a README.md file for every action and workflow.

Example usage in ./github/actions/set_env_vars/README.yml

  - name: Set up env vars
id: define_env_vars
uses: {organization_name}/awesome-templates/.github/actions/set_env_vars@master
with:
azure_credentials: ${{ secrets.AZURE_CREDENTIALS }}
databricks_host: ${{ secrets.DATABRICKS_HOST }}

Versioning: Keep tagging each release and make sure updates or changes do not impact existing workflows. Utilize versioning, to maintain a stable experience for users.

Testing: How can you test reusable actions and workflows? You can write simple tests by calling the workflow within the same repository, but is it really testing? To guarantee, each action and workflow works, we suggest creating another repository for testing purposes. You create testing workflows where you call each template and check the status. We will deep dive into this part in the upcoming articles.

As machine learning engineers, supporting multiple teams and projects, we are big fans of reusable code. Anything that can be shared among multiple projects, instead of reimplementing, saves us a lot of time. Therefore, we implement reusable actions and workflows to execute the same tasks in each data science project repository. Of course, the types of tasks provided in templates vary from toolkit to toolkit.

Next week we’ll deep dive into some actual code & examples.

--

--