Wormhole: Unlocking the Power of Instant Data

Driven by Code
Driven by Code
Published in
5 min readMay 31, 2019

By: Anil Gupta

This is the first post in a series that describes how we implemented an ESB (Enterprise Service Bus) we call Wormhole at TrueCar.

At TrueCar, we have been working hard to revamp our technology platform as part of our Capsela initiative. Capsela is the name we’ve given to the initiative to migrate our legacy platform from our two Data Centers to AWS (Amazon Web Services). Hadoop (Spark/MapReduce) gave us the ability to scale our data processing issues, but its batch processing had latency ranging from a few minutes to hours. TrueCar prides itself on providing real-time accurate vehicle pricing. It is one of our strongest value propositions.

It used to take a few hours to update vehicle pricing after our dealers updated their pricing on Dealer Portal, the inventory and prospect/lead management website for our dealers. Latency of a few hours in updating pricing is not optimal, since dealers need their changes reflected on our website as soon as possible. Besides having inaccurate pricing between our processing runs, dealers found it very time consuming and cumbersome to verify the latest price after our batch job executions.

These issues made it impossible to build new user engagement features, such as notifying users whenever there are changes to their favorite/saved vehicles, price drops, etc. In essence, consumers and dealers wanted to see the impact of their interaction with the product instantly. Hence, in 2017 we decided to build Wormhole, a highly scalable event processing system.

Choosing Event Bus/Publisher-Subscriber

We decided to do a POC with Kafka (Confluent Cloud) and Kinesis because:

  1. We wanted a persistent bus
  2. We would like to have multiple consumers subscribe to 1 topic/stream
  3. It should be highly scalable and fault tolerant
  4. Ability to reprocess messages without using external storage like S3/HDFS

We evaluated SQS initially, but SQS is not really built for multiple consumers and ability to persist and reprocess messages. So, we didn’t consider SQS for POC.

Note: In this article all data points and analysis are based on the features available on Confluent Cloud and Kinesis in August 2017. At the time of this POC, Confluent Cloud was only available in beta. Confluent Cloud and Kafka should be treated synonymously.

Kafka vs Kinesis

Kinesis and Kafka are pub-sub systems built on partitioned distributed log. Kinesis is essentially a “Kafka-esque” fully managed service offered by AWS. Both of them are distributed and have a similar data model comprising of shards, partitionKey, sequenceNumber.

For the POC, we tried Confluent Cloud (instead of a self-managed Kafka cluster), a fully managed Kafka service hosted in AWS, since we ran into challenges with running our own Kafka v0.89 cluster in deleting topics, daemons requiring restarts.

Integration with Our Stack

Since we were re-platforming the entire stack in AWS, ease of maintenance and integration of the event bus with the AWS ecosystem and our toolset (Spark, HBase, Elasticsearch) were our most important considerations. To test the waters for integration with AWS tools we wrote logs to Elasticsearch and Splunk via an event bus. AWS Lambda read data from the event bus and wrote to Elasticsearch and Splunk.

With Kafka, Lambda doesn’t support auto-scaling. We had to invoke an API Gateway endpoint every 200 milliseconds to trigger a Lambda function to pull data from Kafka and write to Elasticsearch and Splunk.

For Kinesis, we configured a Kinesis Firehose that was connected with a Lambda function to write to Elasticsearch and Splunk. Lambda supports autoscaling with Kinesis and it gets auto triggered at a predefined interval. Kinesis has a better “out of the box” integration with native AWS tools like Lambda, API Gateway, and Redshift because Kinesis is a native AWS service.

Summary: Kinesis integrates seamlessly with the AWS ecosystem and scales linearly with AWS Lambda. Integration with Spark Streaming, HBase, and Elasticsearch required a similar effort for both Kafka and Kinesis.

Performance

For producers, a Kinesis shard allows a maximum of 1000 puts or 1 MB of data per second with a max size of 1 MB per message. Kafka does not have any of these limitations at the shard or message level.

For consumers, a Kinesis shard can support up to five read transactions or 2 MB of data per second. Kafka does not have any of these limitations.

Summary: In our testing, Kafka performed slightly better than Kinesis. Kafka also provides more flexibility and performance per shard.

Cost

Kinesis charges per shard, while Kafka charges on the basis of read/write throughput of the cluster. In an event system that supports 10/5 MB read/write per second with data retention of 7 days, it would cost us $266/month with Kinesis by using a 5-shard stream. With Kafka, it would cost us significantly higher than $266/month.

Summary: Kinesis is much cheaper than Kafka.

POC Decision

Kafka provides slightly better performance, but Kinesis is easier to manage, it integrates well with AWS tools like Lambda, API gateway, etc., and is significantly cheaper.

Winner: KINESIS!

So, what is Wormhole?

Wormhole is the event processing system that comprises Kinesis as an event bus, Spark Streaming, Lambda, Ruby and Java applications as event processor, and Postgres, Elasticsearch, HBase, S3 as sink.

Wormhole has accelerated adoption of event-based processing and has laid the foundation for more interactive/responsive products. We have been able to make a fundamental shift in reducing data processing latency while providing a more interactive experience to both dealers and car buyers.

Over the past 2 years Wormhole has enabled:

  • Instant update to pricing of vehicles.
  • Building a user engagement system to notify users of changes in vehicle inventory.
  • Instantly processing vehicle configuration to provide accurate pricing on a vehicle.
  • Processing clickstream data in near-real time rather than a 24-hour delay.

In the near future, we would like to use Wormhole to process inventory and build a near-real time recommendations engine. Primarily, we will leverage Wormhole to build a highly scalable and fault tolerant “Event Platform” with near-real time latency so that we can build more engaging products.

Stay tuned for follow-up blogs on building PricingManager and overcoming production problems in Wormhole.

We are hiring! If you love solving problems please reach out. We would love to have you join us!

--

--

Driven by Code
Driven by Code

Welcome to TrueCar’s technology blog, where we write about the interesting things we‘re working on. Read, engage, and come work with us!