Armatron — ETL, done well
By: Eric Slick
If you missed the first part on why we chose to build our own Swiss-Army knife ETL solution, click here to read it!
Logistics Wins Wars
TrueCar is in a constant struggle for access to data. It is our lifeblood. To succeed, we must deliver a constant stream of data to our frontline troops, so they can bring the best possible consumer experience to the car buying process.
Armatron is an important part of that logistics backbone, and it solves a deceptively simple problem: the need to transfer files and database data from one location to another in a timely and reliable manner.
Our Original Solution Was Problematic
Our original solution grew organically over time and was highly decentralized. We used many different (and often expensive) third-party tools, and each team was responsible for how it obtained its own data. Some solutions were designed for engineers and others for data analysts, and still others were simple one-off scripts.
As we grew, this approach created an unacceptable amount of risk, resulting in a logistics backbone that was growing more and more brittle: There were just too many pieces in too many different locations working in too many different ways. And worst of all, knowledge of how they worked was being lost over time.
Our fix for this problem was Armatron. Armatron provides us with a single tool that centralizes all our data transfer and encourages shared knowledge, which makes managing jobs much easier.
For a more in-depth post on the “why” behind the decision to build Armatron, see out “Why Armatron?” post.
This blog post talks about some of the interesting things Armatron does. It’s not a comprehensive or in-depth exploration of all its capabilities. But we hope you’ll find what it does cover interesting and useful in your own struggle with data transfer.
Easy to Use
Armatron was designed from the ground up for the non-technical (but savvy) user, with its straightforward interface for defining, scheduling and monitoring jobs. Users do not need to be programmers or data engineers.
Contextual, Adaptive User Interface
We chose to keep all job definitions on a single user page. To help with the potential clutter of a single-page interface, we show and hide different interface elements based on the selections being made. We intentionally avoided a multi-step process for defining jobs as we felt a single page would be easier on our end users.
While this added complexity to the development of Armatron, we think the tradeoff was worth it. On a single page, we can define a job (source, destination/s, transforms, options, etc.), control access to it and manage how to receive notifications.
Passports and Encryption Keys
Connection information is often shared between jobs. So we developed the concept of a passport. Passports are objects that contain all the information needed to connect to a server (FTP, S3, Postgres, etc.) and control who has access to the passports. Once defined, they are available to use in any job with a simple selection from a drop-down.
We handle encryption keys the same way. Our users can define them and control access to them, and they are available from a simple drop-down selection.
This allows jobs to share passports and encryption/decryption keys easily and securely, which provides us with a great deal of re-usability and makes managing keys and connection information much easier.
We have a powerful notification system that allows users to select events that will trigger notifications: success, failure, abort, file not found, among others. Notifications can be sent to PagerDuty, Slack, email, and even a database. Plus, we can define multiple notifications with different rules for who gets notified for which events. This has proven very useful for multi-layered monitoring.
Access and Transparency
Users have control over who can see or edit their jobs, and we provide them a user-defined dashboard so they can monitor the activity of their jobs. In addition, they can view the history of all job runs and view all the logging associated with each run.
Fun Fact: We use Armatron jobs to test Armatron functionality and inform developers when jobs are having problems.
While we strive for ease of use, we also provide powerful options to help job creators control exactly how their jobs should behave.
Many Sources and Destinations
Data can be moved from many different sources (FTP, FTPS, SFTP, S3, BigQuery, Postgres, Reshift, SQL Server, Google Cloud Storage, GoogleDrive, HTTPS, HDFS, Web APIs) to many different destinations (FTP, FTPS, SFTP, S3, HDFS, Postgres, Redshift, SQL Server).
Our users have a need to determine certain things at runtime, such as the path to the source files and the path of the files on the destination server. Also, database queries often need to insert values into their queries that are known only at runtime.
So we wrote a small but powerful DSL that allows our users to insert just about anything they need into the source paths, destination paths and database queries. As a simple example, time is often an important factor in determining the location of files or data in a database query. A file path might use DSL like the following.
When dynamic substitution is applied to this string, it could result in the following.
In the example above, all files inside the folder “files/2019/01/23/” that start with “data_” and end with “.txt” will be found and moved.
Date/time substitutions are common, but this is only one aspect of the DSL. We can also create variables, apply regex, access parts of the source path and more. This gives job creators plenty of control over where the source files are located and where they will end up on the destination server. The number of options available to job creators is too extensive for this post, but suffice it to say that if you need it, it can probably be done with Dynamic Substitutions.
We support a number of transforms. Some are quite common, and others are more suited toward our specific needs. For instance, users can…
- Remove CSV Columns
- Rename CSV Columns
- Flatten Avro files into csv (and control what data is converted into the CSV file)
- Compress or Decompress files (gzip and zip)
- Decrypt and Encrypt files (single or multi-key)
Note: Transforming data is not the primary role of Armatron, but we’ve included common transforms based on our own use cases as a convenience for the job creator.
Armatron can talk to PostgreSQL, Redshift, SQL Server and BigQuery. We also take advantage of the ability of Redshift to send directly to S3 and BigQuery to send CSV/JSON files directly to Google Cloud services.
For database operations, we support multi-step processes, such as ETL queries before pulling data from the aggregate table.
Jobs Trigger Other Jobs
Jobs often need to run one or more jobs after they are finished. This is easy in Armatron. We also make it easy to see which jobs are triggered by other jobs, including showing the entire chain of jobs triggering jobs.
Armatron currently uses Sidekiq to run our jobs. This suits our needs quite well as Sidekiq assumes chaos and is good at rerunning jobs that fail. However, we do not stop there.
Every job that runs is monitored, and the monitor will notify us if a job is running longer than expected. In addition, we have another monitor that checks if a job has failed to run at its scheduled time. In both cases, we let the job owners know via the notifications they’ve already set up.
Extensive, Structured Logging
Armatron lends itself well to structured logging, All logs are stored in our database and tied directly to the job run that generated them. Those logs are then displayed on that job run’s details page. This makes viewing the right logs dead simple. All logs are sent to a traditional logging store as well. However, we rarely need to look at them, because in nearly every case, all we care about is what happened with a specific job run and viewing them directly in Armatron is extremely convenient.
Since these logs are visible to our users, they can see exactly what their job is doing. This allows our users to self-diagnose issues that they can often fix themselves without needing to talk to a developer.
Fun Fact: Old structured logs are archived to S3 and then safely deleted from the database using Armatron jobs.
Guaranteed Job Runs
All jobs will run in the context of their scheduled time regardless of when they actually run. If a job was scheduled to run at noon, but it actually ran at 12:30pm, when the job does run, it will still behave as if it’s running at noon. This allows job creators to reliably use time as a means to determine the location of files or the destination of transferred files.
Some of our jobs need to run every possible scheduled time regardless of outages. So even if Armatron were down for hours or days, once it came back up, these jobs would still run for every missed time.
We’ve worked hard to keep Armatron stable. To that end, we use Armatron to test Armatron. We do this through a suite of jobs in our QA environment that exercises a variety of Armatron sources, destinations and features. Using a notifier, we inform a special Slack channel when any of these jobs fail. Like canaries in a coal mine, if our code breaks something, it will most likely show up through one of these jobs. This method has allowed us to significantly reduce the number of regressions introduced to our production environment.
Fun Fact: Armatron runs thousands of jobs every day and has been running stably for two years.
We could go into more detail. This is only a limited, high-level view of Armatron, which is an easy-to-use, powerful and centralized tool for delivering data from one location to another that anyone can use to build and manage jobs.
In the end, it’s all about logistics and how good we are at moving data from its source to its proper destination. Armatron tries to make that easy, robust and powerful.
We hope you enjoyed reading about how we have solved our ETL woes as Truecar. If you missed the first part of this two-part post, click here!
We are hiring! If you love solving problems please reach out, we would love to have you join us!