By: David Giffin
TrueCar Has a Heart and We Need to Keep it Beating
At the heart of TrueCar is data — lots of data that comes from many different places. Generally, automotive data is messy. It’s manually entered by dealers into a wide assortment of systems. Also, the automotive industry has a heavy reliance on file transfer protocol (FTP). We pull and push data to FTP servers for partner reporting, inventory feeds, transaction feeds, etc. We also must consume various APIs and gather data from many automotive data vendors such as Chrome and Vast.
For TrueCar, by 2014, our data movement solution was a mishmash of organically grown tools and scripts spread across many code bases and platforms. When developers encountered a new data movement need or “extract, transform, load” (ETL) challenge, they would often create a new tool for the job.
This led to data movement sprawl. We had DEX, Data Importer, Data Exporter, and many other internal services, all built by various teams over time. They all provided the same basic functionality: get data from a source and move it to a destination. Additionally, many of these tools were in scope for SOX compliance, which required additional controls around creating, modifying, and deploying data movement jobs.
For example, one of the tools we relied on heavily was Talend. Talend is an off-the-shelf product designed for data integration and ETL, and was purchased for our data warehouse team. It provided an Eclipse plugin for editing jobs, but that proved to be difficult to work with. Our teams had issues with Java dependencies, so they just used Talend as a DAG and shelled out to perform ETL actions. So, due to certain complexities and user unfriendliness, Talend wasn’t being used to its full capacity, given the way we implemented it.
Finding the Right Tool for Data Movement
Around 2015, we started looking into consolidating data movement into a single tool. We originally considered using Talend as that tool, but after talking with developers and surveying our current jobs, we quickly realized that developers simply did not enjoy using Talend.
The other internal tools we had for data movement really only dealt with getting data to and from a single source and destination. They were not built in a way that would deal with all of our data movement and ETL woes. For example, they didn’t handle moving data from FTP to PostgreSQL, or PostgreSQL to FTP, or SQL Server to PostgreSQL.
Around that time, we had made the decision to replatform the technology stack at TrueCar. Our new stack used Ruby and Rails for back-end services, hosted on the AWS cloud. None of our existing data movement tools at the time aligned with the new technology stack, so we decided to write yet another data movement tool using the new technology stack. We were obviously adding to the data movement tool sprawl, but we wanted to invest in the future and build a tool that people actually enjoyed working with.
Building Armatron and DataMover
When we started building Armatron, we didn’t originally intend for it to be the one tool for moving all our data. We initially decided to take a focused approach and tackle the one thing that we knew we would do all the time in AWS: move files between S3 and FTP servers. Using Sidekiq as a background processor with Cron-like scheduling gave us the basic building blocks for a simple data movement tool and solved an important part of our data movement needs.
Elsewhere in the company at around this time, another team was inspired to work on addressing data movement needs using our new technology stack. This effort focused on generalizing our handling of data sources (beyond just S3 and FTP) and performance when moving larger volumes of data. It ultimately led to the creation of DataMover, an internal gem that aimed to connect any kind of data source to any kind of destination.
At that point, we had Armatron, the easy-to-use interface for defining and managing jobs, and DataMover, packaged up as a Ruby gem to do the heavy lifting for moving all kinds of data in a performant way. This opened up possibilities for many new locations we could move data between, including databases.
With these new capabilities, we decided to make Armatron the de facto data movement tool and started the process of expanding Armatron and DataMover to handle the additional functionality needed to move legacy processes into Armatron’s care. That meant we now had the huge effort of migrating off of the legacy data movement tools. Between Talend and DEX, we had hundreds of data movement jobs that provided the majority of our internal reports as well as reporting to our external partners and dealers.
A Stronger Heartbeat
This was a massive project that took many months to complete, but one by one, we removed our reliance on legacy data movement tools, deleted one-off scripts, and moved them to be managed under Armatron. This finally gave us a single tool that resulted in a greatly simplified data movement backbone.
It’s hard to overstate the value of having a centralized place for all our data movement. It’s far easier to maintain jobs when knowledge of the tool is shared among everyone. Visibility into how data is moved has been massively improved. Brittleness has been removed from the system. Overall, risk and repetitive manual work have been dramatically reduced.
Our data movement heartbeat has never been healthier!
We hope you enjoyed this post on the motivation behind building our own ETL framework. Please read our next post in this two-parter on exactly how we did it and what makes Armatron tick!
We are hiring! If you love solving problems please reach out, we would love to have you join us!