How Standardized Tooling and Metadata Saved Our Data Organization
At KeepTruckin, data is at the center of how we operate. It powers all of our customer-facing products, such as our fleet management platform, AI dashcams, fuel optimization and more. Data drives how we execute internally as a team, driving sales, marketing, growth and financial analytics.
Our data has exponentially increased with the growth of our company. Accordingly, we must scale our internal processes, tooling and best practices. In this blog, we discuss the data challenges we face at KeepTruckin and how the Data Platform team has tackled them by building standardized tooling that exposes powerful metadata we can leverage to scale our data organization.
We’ll dive into how we use standardized tooling to populate data catalogs, release bug fixes and feature enhancements across all data pipelines, and grow our data ecosystem while remaining operationally stable.
What’s So Difficult About Data Processing at KeepTruckin?
Any data organization will face some of the challenges below. But there are a few nuances that make them an even more fun and challenging endeavor at KeepTruckin.
Diverse Data Sources
We have a diverse data ecosystem that requires us to support a wide variety of data pipelines and customizations. We are a software company powered by custom-made hardware, selling multiple integrated products that work closely together. KeepTruckin sells vehicle gateways to track commercial vehicles, asset gateway devices to track trailers and other heavy equipment, and dashcams that provide our customers with road and driver facing videos. We’ve built software products for tracking, load matching, fuel management, fleet management, driver coaching, and more. We also ingest data from third-party sources and vendors to augment and enhance our products and internal analytics.
Data from multiple sources are ingested into KeepTruckin’s internal data ecosystem as both structured and unstructured data — in all types of sizes and volumes:
- IoT data (AWS IoT)
- Dashcam data (videos)
- KeepTruckin Application data (stored in PostgreSQL / Kafka / AWS S3)
- Third-party data (Salesforce / Pendo / Mixpanel / etc…)
- Adhoc data (CSV uploads / Google Sheets)
Each piece of data can present its own unique challenge and may require specialized technologies and solutions. For example, IoT data generally comes in many small files and must be compressed first; otherwise, Spark pipelines cannot efficiently read all of the small files. In addition, diagnostic data constantly changes schema as we add more logging and parameters, which requires a solution able to store evolving semi-structured data or blobs. When building a data platform, we have to consider how we can scalably support different types of data pipelines and customizations.
Large Volume of Data
While our company is growing at a very fast pace, our data is growing even faster. Our data growth requires us to build systems that can support large amounts of storage and processing.
As of the time of this blog (December 2021), we’re working with:
- 100s of TB of data
- Raw data: 3k+ tables
- Derived data: 700+ tables
- Complex dependency graphs between all of these tables
Reliable, Fresh, and Accessible Data
To be useful, data must be reliable, fresh and accessible. It is not enough to simply hoard tons of data. Customers, internal and external, must be able to trust and rely on the data so that they can build great products.
For a data engineer, building a data pipeline may be the first step to providing insights to our customers and building data-driven products. But how will people know the data is correct? What happens when the data pipeline breaks? When was the data last updated? These are some questions that we have to answer in order to build trust in the data products we are managing at KeepTruckin.
If we have done our job right, customers will be able to trust and rely on our data products and data insights.
Big Data Tooling Is Not Enough
There’s a lot of data engineering challenges presented above. Over the last 10–20 years, different technical systems have solved different parts of the problem, enabling us to democratize data and data pipelines across organizations.
For example, Hadoop solved our big data storage challenge, allowing us to store large amounts of data in distributed file systems like HDFS with query engines and database layers on top of HDFS like Hive / HBase. Big data processing tools like MapReduce and Spark allowed us to run complex analytics over terabytes of data. And technologies like Oozie, Airflow and Luigi enabled us to orchestrate complex data engineering workflows. Visualization tools like Redash and Superset help us build insights from the data, and the newer area of data catalogs, such as Amundsen, Datahub and Marquez, allow us to understand the context of the data better.
At KeepTruckin, we use the types of tools mentioned above and still risk ending up with an unmanageable data ecosystem. To demonstrate why, let us look at an example in the next section.
Avoiding a Data Jungle Within an Organization
Imagine we have three teams within a small organization.
- Say you start with a single pipeline, syncing one table from the production database to the data warehouse to create the first data artifact. The team building it is mostly business analysts who work with SQL. They might use DBT and SQL to build the table transformation.
- Then let’s say you have a data science team, working with Pandas and Python notebooks every day. They want to use the data artifact from the first team and build a pandas transformation that will create a second data artifact.
- Finally, let’s say a third team has a large dataset they want to join with the second data artifact. Their data is in s3 and they need to perform distributed processing. So, they code up the transformation job in Spark to create a third data artifact.
The data pipelines within this example can quickly span multiple technologies, storages and (possibly) multiple teams. Now, imagine the organization isn’t limited to three data artifacts, but hundreds or thousands of data artifacts across multiple teams, as depicted below.
This is no longer simply an issue of how you can store big data, process big data or orchestrate multiple pipelines. As the number of data artifacts increases, the complexity of understanding and managing all of the data pipelines and data artifacts also increases. Stakeholders and developers end up unable to trust the data quality or understand the data context and data lineage within a complex data ecosystem.
The Solution: Standardization
We took a platform approach to managing the complex data ecosystem through the following steps:
- Build interfaces / APIs around design patterns (ETL)
- Promote best practices through these standardized APIs and tooling:
- Documentation / dependency lineage
- Ownership / alerting
- Testing / quality
- Standardizing the tooling around common patterns standardizes data pipeline metadata
We focused on standardizing how data moves through our system, building a layer around all our transformations and how those transformations are orchestrated. We avoided being too prescriptive about the actual underlying technologies, letting authors continue to build Spark, SQL or Pandas and store their data in different storage technologies.
Through this layer, we then have control over all types of transformations and data moving through our data ecosystem. By leveraging our standardized tooling, we can promote best practices and continuously push improvements across all of the data pipelines at KeepTruckin. We can ensure all pipelines are documented and have dependencies defined so data context and data lineage can be easily understood. We also make sure owners and alerts are added so questions and issues can be routed to the right people. And we also help make sure there are data tests added to ensure data quality.
Finally, standardizing our tooling generates powerful metadata we can use for data discovery, data lineage, monitoring and insights.
In the next section, we’ll introduce the TableAPI and Pipeline frameworks built at KeepTruckin that help scale our data organization.
TableAPI and Pipelines at KeepTruckin
TableAPI builds a layer around how transformations are built. Pipelines build a layer around how those transformations are orchestrated and scheduled. We’ll take a look at an example of building a new data artifact for fleet management: tracking how much distance a driver has driven in a given week.
First, the data engineer will need to define a Schema for the data artifact that is being created. In the example below, we have three fields with data types and comments defined. Under the hood, we can deploy this schema to different data sources with database DDL statements. You can see here we encourage data engineers to write descriptive comments about the table and the columns upfront.
Then, data engineers can define their transformation function in their language of choice. They can write raw SQL, or code against Pandas or Spark DataFrames. In this example, the transformation will need to sum up the distance traveled, grouped by driver and week, so we can easily implement it in either SQL or using a DataFrame as seen below. When operating on DataFrames, the TableAPI framework will automatically fetch the dependent tables to be used directly in the implemented function.
Example: SQL Transformation
Example: Spark Transformation
Next, the engineer will tie the transformation function and the schema together in a Table definition, which will define other information like dependencies (which is how the DataFrames above are auto-injected into the function call), what storage(s) to deploy this table to, and a table level comment.
Finally, we can schedule this table transformation by attaching it to a Pipeline and attaching a cron schedule to it. Here we can force the data engineer to define owners and add alerts, which is an extra push for best practices. Under the hood, this can deploy the pipeline on Airflow and the transformations to run on scalable infrastructure, whether it is Kubernetes or Spark. This layer gives us control over the infrastructure and the orchestration of the actual code and pipelines. The data platform can help optimize the infrastructure that runs all of the data pipelines. Then, data engineers can focus on the business logic without having to write Airflow pipelines or knowing how to package up their code to be able to run on Kubernetes or Spark.
TableAPI Successfully Scales Our Data Ecosystem
In this section, we’ll describe the many benefits of standardizing how data pipelines are developed.
Improving Platform and Infrastructure Development
TableAPI gives us a central way to push out improvements across all pipelines. In the last year, we’ve been able to deploy:
- Data store migrations (Snowflake / Hive)
- Bug fixes (Airflow scheduling bugs / Spark configurations / performance optimizations)
- Environment agnostic pipelines (production / staging / development environments using same code)
- Automatic schema evolution for tables
- Data testing frameworks across all pipelines (Great Expectations integrated with existing pipelines)
These types of changes would not have been possible without a central way to manage all of the different data pipelines and data artifacts that we are building at KeepTruckin. Data platform plans to continue and make new and bold changes to improve our infrastructure. TableAPI helps hide the complexity of the infrastructure, so data engineers can concentrate on the business logic while the platform team can concentrate on improving the infrastructure.
Standardizing Data Pipeline Metadata
As mentioned earlier, standardizing our tooling allowed us to standardize our data pipeline metadata across all of our data pipelines, which allowed us to better monitor our data ecosystem. We can now look across the whole data landscape and help identify pipelines that are failing abnormally or missing tests. We can also grade different data pipelines on their health based on whether they are adhering to the latest recommendations, documentation, alerting, and tests to help push data engineers to build better data pipelines. This metadata helps us plan on what to work on next, which pipelines might need more support, or which best practices to promote across the organization.
Standardized metadata gives us lineage and documentation tooling. Most lineage tooling has to gather this by inferring it from SQL queries or Spark jobs with complex logic and profiling. We get this explicitly from our pipelines at a level that is higher quality and consistent, even when we have 100s of tables and data pipelines deployed. We’re able to use data catalogs and input richer metadata for data discovery and exploration.
Growing the Data Ecosystem
During the last year, we’ve seen the number of data artifacts and pipelines double at KeepTruckin! Investing in TableAPI and Pipeline tooling allowed us to support this growth and remain operationally stable.
Below is the number of derived tables created from TableAPI during the last year.
KeepTruckin has a complex data ecosystem because we collect and process a large amount of diverse data for our many products. As a Data Platform team, it is our job to create trust in our data. The TableAPI and Pipeline tools we have built allow us to maintain high quality data pipelines and promote best practices. This framework also helps our team scale by allowing the Data Platform team to make continuous updates to our data infrastructure across all data pipelines. This framework also exposes rich metadata that can be used for monitoring and data discovery to help us continuously enrich our data ecosystem.
We built TableAPI internally to solve the growing number of ETL pipelines that we were building and subsequently found a lot of success. We hope this helps teams consider how they might scale data engineering with similar principles. We’re looking forward to seeing how the data engineering ecosystem changes toward standardization and core principles.
Finally, many thanks to the awesome KeepTruckin engineers and specifically our current Data Platform team (would like to call out Tianyao Zhang, Tina Bu, Angelica Heeney, Cynthia Xin) for helping scale KeepTruckin data engineering efforts as our data grows!