Data Pipelines for Analytics: Starting from Scratch

John Lee
Hybrid Cloud How-tos
5 min readJan 3, 2023
Photo by Stephen Dawson on Unsplash

When our data and analytics team started modernizing our data access platform, we envisioned where we wanted to be a year later. We had a lot of work ahead of us, and we’ve come a long way since.

As the Data and Analytics squad within IBM’s CIO division, we use data and automation to transform our internal application portfolio and drive hybrid cloud platform adoption. When we began this project, sharing data using CSVs was the norm. This article shares the six steps we took to modernize from a world of CSVs to setting up automated end-to-end data pipelines and live dashboards, which is what the organization is using now.

Step 1 — Collect stakeholder requirements

We began our modernization project by working closely with our stakeholders to understand their expectations. One group of stakeholders is the IBM internal application teams that want to move their applications from legacy infrastructure to our hybrid cloud platform. We create modernity indices to drive hybrid cloud platform adoption and track progress.

Our other stakeholder group is the hybrid cloud platform team, which is also part of the CIO division. They consume the data for getting feedback on hybrid cloud platform performance and improvement.

After many interviews and interactions with both groups of stakeholders, we came up with three categories and 13 modernity indices. The categories are:

· Monitoring and security

· Adoption

· Automation

Modernity indices are metrics we defined to track usage of the key features and best practices we expect the CIO division to implement. For example:

Monitoring and security metrics include the Four Golden Signals from the Google Site Reliability Engineering (SRE) team’s best practices for monitoring.

• Adoption metrics include multi-zone and multi-region for high availability and redundancy with load balancing.

• Automation metrics include horizontal and vertical pod autoscaling for automated resource optimization.

In addition to the three categories, which include each metric, we also created a scoring system to compare stakeholders’ hybrid cloud migration process. (Please leave a comment if you are interested and want to know more about our modernity metrics and scoring system.)

Step 2 — Create ETL scripts and data pipelines

Once we collected all the requirements, we built the extract, transform, and load (ETL) scripts that target specific datasets to produce consumption-ready output. Our data sources provide various interfaces for data integration: APIs, SQL database queries, CSV files, and more. Our stakeholders also have various reporting mechanisms: dashboards, CSVs, APIs, and data warehouse access.

To accommodate these disparate requirements, we established a robust pipeline with the following capabilities:

Data-ingestion module with:
o Reusable API module for multiple data sources
o Database connections management, including DB2 and PostgreSQL
o Database queries
o CSV file readability

Data cleansing and normalization operations for:
o Removal of stale or invalid records
o Splitting into granular fields
o Case normalization

Data transformations to support:
o A reporting dashboard for the front-end user interface
o CSV generation for different teams’ needs

Step 3 — Package scripts to Python libraries

As our projects grew in size and complexity, we built reusable and robust libraries across all our pipelines to handle data ingestion, cleansing, and transformation. For reusability, packaged Python libraries help us encapsulate functionalities in every class, hide the implementation, and provide simple-to-use interfaces that our peer teams can use. Regarding robustness, package Python libraries make the unit tests much more maintainable to our code base, and we are on the way to meeting a high unit-test coverage rate in Q1 2023.

Step 4 — Establish a data warehouse

We use a data warehouse to store our metrics persistently for reporting, analyzing adoption progress, forecasting migration with machine learning and artificial intelligence (ML/AI), and more. We decided that IBM Db2 Warehouse on Cloud solution was the most effective solution for our team because:

· It is highly optimized for read-query performance

· It is optimized for storing large amounts of data

· It is low maintenance

· It enables high availability and disaster recovery by default

· It is hosted on IBM Cloud

Step 5 — Set up cronjobs

We started with ad-hoc reports as proofs of concept to ensure stakeholders agreed upon the output. For each use case we defined, we went through a few iterations to refine and iterate upon the data output’s quality and clarity.

Once we met the requirements and were happy with the outcomes, we needed a scalable way to produce our consumption-ready data on a regular schedule. Using our existing tools, we containerized our ETL applications, hosted them in our IBM hybrid cloud platform, and ran them on a cron schedule. The cron jobs produce a steady stream of data loaded into our Db2 Warehouse on Cloud for consumption using dashboards and other reporting approaches.

Step 6 — Report with dashboards

No data pipelines are complete without a proper consumption mechanism. Because we set up our data warehouse to be consumption-ready at the outset, it is straightforward to create dashboards and reports by reading from it. We chose the Apache Superset data exploration and data visualization platform as it offers a complete self-service package for end users. It provides:

· Granular permissions for different user groups

· Custom SQL query capability

· Dashboard accessible through a public link

· Report downloads

Where are we today?

Today our pipeline readily ingests, transforms, and loads consumable data into our warehouse. Our dashboards read from our warehouse to provide vital key performance indicators (KPIs), statistics, and reports to stakeholders across our organization. And Superset gives our stakeholders the flexibility to explore data and create dashboards for their purposes.

What’s next?

As with many organizations, data access is not quite streamlined. We are working with stakeholders to encourage API development and use wherever possible, as they provide the advantage of a data abstraction layer. APIs also have security benefits and streamlined access methods.

Second, we are investigating how we can structure future data sources for ML operations, especially graph databases.

We learned a lot through this process. Our main takeaways are:

· Take the time to architect the complete pipeline to identify any gaps in requirements and create checkpoints and timelines to completion.

· Communicate closely and often with stakeholders to ensure the execution is on the right track.

· Always think about how the current work will affect future projects to identify areas of opportunity for optimization.

John Lee is a Data Engineer at IBM based in RTP, NC. This article is co-authored by Chloe Wang — Leader/Architect of Data & Analytics based in Armonk, NY. The above article is personal and does not necessarily represent IBM’s positions, strategies or opinions.

--

--