Setting up Self-Service Analytics Engineering in a Business Intelligence Team

Published in

HeyJobs Tech

9 min readJan 16, 2024

In collaboration with Kim Friedel

A story of how we defined, implemented, and improved the self-service framework based on the existing needs of our organisation.

Background

As Data Engineers in the Business Intelligence (BI) team at HeyJobs, part of our responsibilities includes fulfilling the requests made by the Data Analysts to create or enhance the tables in our Data Warehouse. This allows them to explore datasets from new perspectives, and answer different and more complex business questions, the type of which started to increase in frequency as the company keeps growing and maturing.

Over time, we realized that the BI Engineering team was becoming a bottleneck for some of the analysts’ investigations, mainly because of two reasons:

The amount of requests we could attend to was limited.
The code changes remained on hold for several days until the next weekly release.

While we understood the need to fix this situation, we did not simply opt for increasing the number of people in our team or de-prioritising some of our existing projects, but instead decided to adopt a different BI framework, Self-Service Analytics Engineering.

As the name suggests, Self-Service Analytics Engineering refers to the BI paradigm where users such as Data Analysts and Data Scientists are enabled and encouraged to access and transform data on their own without having to rely directly on Engineering specialists. For our team, this approach meant a significant shift in how tasks and responsibilities are allocated between Engineers, Analysts, and, in the future, Data Scientists.

The two main aspects that would change with this redefined framework are the following:

Engineers’ Responsibilities: Engineers would be tasked with developing the major data products. These are foundational tools and datasets designed to be broad in scope and highly applicable across a wide range of use cases. This includes crafting robust, scalable solutions that form the backbone of our data architecture, ensuring that these resources are versatile enough to serve the diverse needs of our organization.
Analysts’ (and Data Scientists’) Autonomy: Analysts would be granted the liberty to develop custom models tailored to their specific needs, such as creating cohort models for our core Reporting platform, experimentation, or Financial Planning & Analysis (FP&A). Looking ahead, this autonomy and self-service capability can also be extended to Data Scientists, who would then be enabled to apply advanced analytical techniques in response to complex business questions.

In the upcoming sections, we describe how we managed to implement this self-service framework within the BI team at HeyJobs, detailing the setup we chose, its initial implementation, the initial learnings after the first few months, and, finally, the enhancements we implemented based on these learnings.

Choosing the Right Setup

As part of our continuous initiative for improvements within the BI Engineering team, we had previously completed an investigation on dbt, the data transformation tool. Early on, we recognized dbt’s potential for streamlining operations, such as table creation and dependency tracking, which made it an attractive option for empowering our analysts with self-service capabilities.

We then embarked on an in-depth evaluation of its two primary setups: dbt Core, their open-source command line tool and dbt Cloud, which provides additional features such as task scheduling and data model investigations through their web-based UI. Our evaluation criteria was centered around five key requirements vital for a seamless integration into our existing architecture. The table that follows offers a detailed look at these requirements, providing a description of each, and explaining how dbt Core and dbt Cloud respectively serve to fulfill them.

Comparison between dbt Core and dbt Cloud. The setup that was considered best for fulfilling that particular requirement is highlighted in green.

As it turned out, the winner was dbt Core. While an argument could still be made for dbt Cloud given its ease of use and encompassing features, setting up the necessary permissions for the cloud project to access the Data Warehouse would require dbt’s Enterprise plan, which we did not want to commit to at the time given the possibilities that were already available for us with dbt Core.

Self-service Analytics Engineering in BI

Implementation and Configuration

The dbt Core project configuration was established in a new repository specifically designed to bridge with our Amazon Redshift Data Warehouse. As part of the setup, we implemented the following measures to ensure both efficiency and security:

Dynamic Credential Management: This setup permits analysts to conduct tests in the staging environment with staging credentials and automatically employs production credentials once the code is released.
Schema Defaulting: We defaulted all new data transformations to a new schema to keep our environment organized and avoid any potential overlap with existing structures.

These isolation measures helped us ensure that our new integration did not interfere with our previously established processes.

The next step involved deploying our dbt project code into our Apache Airflow instance, allowing us to reliably execute the new data transformations according to our predefined schedules. By installing dbt-redshift 1.3.0 in our Airflow instance, we were able to execute dbt commands directly from Airflow DAGs.

The diagram below illustrates how the code is deployed to the Airflow instance. Once the new models are ready to be deployed, the updated code is stored in an Amazon S3 bucket designated for the BI data pipelines. Finally, once our Airflow instance detects a new version of the code, it downloads the files so that the most recent dbt models run in our production environment.

Deployment Process

To better describe how the deployment process for analysts differs from the one followed by engineers, we can look at the diagram below.

In this revamped framework, analysts have the capability to test their models locally and, following a rigorous review process, deploy them directly to our production environment. This process includes a peer-review among analysts, focusing on the business logic applied, and a review by the Engineering team, which is also pivotal in this workflow. Engineers conduct a thorough review of the analysts’ pull requests (PRs), with a keen eye on code optimization, adherence to coding conventions, and ensuring no duplication of logic. This careful scrutiny is instrumental in preserving the integrity and trust in our BI department.

Once the changes are approved, analysts can bypass the regular release schedule, facilitating immediate work with their new or updated models, which emphasizes agility and flexibility. It’s a strategic balance that maintains rigorous standards of quality and consistency while offering Analysts the speed and independence necessary for responsive and dynamic data handling.

Once the framework was finalized, we did the handover to the analysts, which consisted of a training session and a User Manual that included the initial installation steps, how to run and test their models locally, and the review process. This allowed the analysts to start using dbt, officially kicking off our first version of the Self-Service Analytics Engineering framework.

Initial Learnings

Several months after introducing our data analysts to the self-service framework, we decided to touch base with them through a survey to pinpoint any challenges, identify potential bottlenecks, and gauge their overall contentment with the existing functionalities.

We sought to uncover specific areas where the data analysts might be encountering obstacles or where our support could streamline their development workflow. The feedback we received was telling.

To begin with, our data analysts were initially unaccustomed to the rigorous protocols that we, as data engineers, adhere to during our pull request processes, especially those aimed at maintaining code quality and documentation standards. This typically revolves around code formatting — a criterion that has been a staple in our reviews. However, this focus on formatting over substantive code logic review turned out to be a sticking point and, frankly, a suboptimal use of everyone’s time.

Additionally, there was a noticeable demand for an expansion of our dbt toolkit capabilities. Despite the availability of dbt’s advanced features like Snapshots and incremental model building, we initially held back on introducing these to keep the learning curve manageable. There was an assumption that the analysts would only need to build simple models that would not require these features. Yet, as the analysts grew more proficient with dbt and accustomed to its intricacies, including the Jinja syntax, it became clear that enabling them to leverage these robust features was necessary. In response, we’ve been working on enhancing our documentation to empower them to construct more sophisticated and efficient data models, fully embracing the power of dbt in their data engineering ventures.

Main Takeaways

In light of our enhanced Self-Service capabilities, we see a potential positive shift in the dynamics of team scaling, particularly concerning the hiring of Engineers and Analysts. The advancements in our self-service framework have meant that Engineering is no longer a rate-limiting factor in the execution of data projects. This development has opened up new possibilities for scaling our team in a more efficient and targeted manner.

On the Analysts’ side, the enhanced self-service tools have significantly increased their autonomy and ability to contribute directly to business impact. These tools have not only empowered Analysts to independently tackle more intricate data modeling and analysis tasks, which were traditionally managed by Engineers, but have also expanded the versatility of their output. With the self-service framework, their data models can be seamlessly integrated and utilized across multiple other platforms, such as Tableau, Pigment, and Growthbook, with which they can build reports, conduct A/B testing, and work on sophisticated analytical tasks. This shift not only increases the value each Analyst can provide but also allows us to scale our Analyst team more efficiently.

For Engineers, the move towards self-service means that their expertise can now be channeled into improving and maintaining the infrastructure that empowers our Analysts, rather than being tied up in day-to-day data handling tasks.

Not only are we witnessing considerable improvements within our team dynamics, particularly among Engineers and Analysts, but we are also actively refining our infrastructure and feature set to bolster these advancements. These technical upgrades serve to automate and refine our workflows further, complementing the organizational enhancements and enabling our teams to work with greater efficiency and impact.

1. Enhancing Test Capabilities with dbt-expectations

Recognising the necessity for robust testing, we incorporated ‘dbt-expectations’ — a testing framework inspired by the ‘Great Expectations’ library. This expansion not only improved the rigor with which data models could be tested but also provided our analysts with a broader range of pre-built tests. This enabled them to assert data quality with greater precision and confidence, thereby reducing the time spent on identifying and rectifying data inconsistencies.

2. Implementing Pre-Commit Hooks

To ensure that every code commit adhered to our high standards, we introduced pre-commit hooks. These automated checks run predefined tests before a commit is finalized, verifying that documentation is complete, models are properly tested, and other vital criteria are met. This preemptive measure significantly decreased the instances of commits that would otherwise fail to meet our pull request standards, leading to a smoother and more efficient review process.

3. Code Formatting with SQLFluff

In order to address the issue of time-consuming code formatting reviews, we integrated SQLFluff, a SQL linter and formatter, into our pre-commit hooks. By automating the enforcement of consistent coding styles, we were able to shift our focus from formatting to the substantive review of code logic, thereby enhancing the effectiveness and efficiency of our code reviews.

4. Simplifying Data Exports with Macros

To facilitate an easier way for our analysts to export their models to Object Storage (S3), we developed macros that could be invoked with a single line of configuration. This approach abstracted away the complexities of data exports, allowing our analysts to focus on model building without worrying about the intricacies of data storage and transfer.

5. Automating Documentation with Doc Templating

Finally, we recognized the importance of accurate and up-to-date documentation but also the tediousness of its maintenance. To address this, we built a tool for doc templating that generates the necessary schema.yml files from the model SQL code. This tool, invoked with a simple make command, drastically reduced the manual effort required to create and update documentation, ensuring our models were well-documented with minimal extra effort from our team.

Each of these enhancements was chosen not just for its individual merit but also for how it integrates into the holistic development ecosystem, creating a seamless workflow that encourages best practices and promotes high-quality output. As we continue to refine these capabilities, we anticipate further benefits in the efficiency and strategic alignment of our hiring practices.

In summary, the self-service enhancements have not only improved the day-to-day efficiency of our existing team but have also laid the groundwork for a more agile and impactful approach to scaling our workforce, setting a new standard for excellence within our team.

Interested in joining our team? Browse our open positions or check out what we do at HeyJobs.