Ingesting Files from IoT Devices at KeepTruckin Scale
KeepTruckin Vehicle Gateways (VG) generate tons of data including high-frequency telematics info, dashcam footage, and sensor logs. This data is stored in the files on the local file system of the VG and is uploaded to our platform for further processing across dozens of workflows. To correctly and efficiently upload and process all of this data, we not only have to design for high scalability but also adverse field conditions. For instance, since a typical VG has a 4G LTE connection with bounded bandwidth usage and can be on a truck in any corner of the United States with choppy cellular network connectivity, an upload operation with minimal retries becomes imperative for efficient bandwidth consumption. Too many retries result in too much bandwidth and battery usage, too few retries results in delayed data. In this case, is exponential back-off sufficient?
This is just one example of challenging algorithmic and design decisions we deal with to ensure our drivers are compliant and critical data is available to our fleet operations managers.
In this post, we will discuss the end-to-end scalable architecture we have developed to ingest data from hundreds of thousands of VGs in the field to meet our scalability and reliability requirements.
Architecture and Workflow
File Upload Service (FUS)
- The FUS is a Unix daemon written in C that runs on a VG and handles the file upload requests from other services running on the VG. These other services interface with the FUS by storing requests in a database local to a VG using an API built into our core embedded library that requires the following data: UUID, File path, Type, and Delete flag. A typical local database entry looks as follows:
- The FUS queries the local database for new upload requests and on finding one FUS requests the temporary credentials from the AWS Security Token Service (if the previous one is expired). The credentials are composed of four different fields: Access Key, Secret Key, Session Token, and Expiration Time. The credentials are valid maximum for one hour after which FUS refreshes it. A typical temporary credential looks as follows:
- Leveraging the temporary credentials, FUS uploads the file to a dedicated AWS S3 bucket on an S3 path that is dynamically crafted based on the VG Identifier, File type being uploaded, the current Date and Time as follows:
- S3 Transfer Acceleration is pre-enabled on this S3 bucket and it leverages the AWS CloudFront’s geographically distributed edge servers to achieve fast, easy, and secure file transfer from a VG tagged onto a truck that could be in any corner of the United States.
- Once the upload operation is successful, the file is potentially deleted based on the value of the Delete flag. The local database is also updated to reflect the successful upload.
File Ingestion Service (FIS)
- Once FUS successfully uploads the file to S3, a pre-configured S3 Event triggers a notification that is appended to a queue in AWS SQS.
- The FIS is a platform service written in Golang. FIS reads the notifications from the SQS queue in the order they arrive and process them. The following diagram represents the architecture of FIS:
- FIS is mainly composed of N consumer goroutines (configurable, presently set to 10)and one producer goroutine. Consumer goroutines poll from the SQS queue for new notifications and on getting one, they kick off the file processing workflow.
- The file processing workflow involves downloading the file from S3 to the Kubernetes pod, untarring the file, processing the constituent files, and uploading them back to S3 in our Data Lake. A protobuf is also appended to the buffered channel (Fig 6) if the current file <type> (Fig 5) qualifies. As the final step, the notification from the SQS queue is purged and the input S3 file is archived. As the file age, the pre-defined S3 policies move the file to lower S3 tiers before purging it for cost optimization.
- The producer goroutine (Fig 6) reads the protobuf(s) from the buffered channel and dispatch them to Kafka on custom topics defined per team or feature. The upstream services listen on the Kafka topic of their interest and on receiving a new message kick off their custom workflow.
- To ensure that any single notification from the SQS queue gets processed only by one consumer goroutine, we leverage the Visibility Timeout feature of the SQS queue that makes the SQS queue message invisible to the rest of the consumers once it has been successfully read by another consumer for the configured time. This feature comes in handy to achieve the desired synchronization among the consumer goroutines.
- To ensure high scalability and availability, FIS runs on a Kubernetes cluster with appropriately configured PodDisruptionBudget and HorizontalPodAutoscaler. PodDisruptionBudget addresses availability by ensuring a minimum number of pod replicas always running. HorizontalPodAutoscaler addresses scalability by auto-scaling the pods as per the existing CPU load.
- During scale down/maintenance when FIS pod(s) need to be removed/replaced, we ensure the graceful draining of the existing requests by leveraging terminationGracePeriodSeconds in Kubernetes and Golang’s Context.
- Since FIS leverages SQS and S3 Golang SDK, we leverage localstack for better coverage of our unit tests. Localstack mocks AWS APIs and could be run on the local and continuous integration environments.
Statistics
Since the launch of FUS and FIS in production in Feb 2020 and as of writing this post:
- 35 million files have been successfully uploaded and processed. We expect this number to continuously grow (even at a higher pace) as more customer-facing and internal features start leveraging this system.
- 38 seconds has been the longest time a notification has spent sitting idle in the SQS queue so far. This is one of the key performance criteria for us as it measures the difference between the time a notification lands in the SQS queue and when it is read by one of the FIS consumer goroutines.
Key Takeaways
- AWS SQS: highly scalable; persist data maximum for 14 days; prevents loss of input data during service maintenance.
- Kubernetes Horizontal Pod Autoscaler: Fire and forget!
- Goroutines: blazingly fast; minimal footprint; really easy to work with.
- Buffered Channels: built-in synchronization; really easy to work with.
- Localstack: mocks AWS APIs allowing more unit test coverage.
- Datadog Metrics and Log Explorer: Simply awesome.
Contributions
Kudos to everyone involved in making it happen:
- Platform: Piyush Kansal, Chris Argeros, Siraj Memon, Luis Zaldivar, Filipe Martinho
- Embedded: Andrew Gieraltowski, Colin Maykish, Joe Pulver
KeepTruckin is an AWS shop. With the help of top talent and leveraging the latest in cloud technologies, we roll out new features at an impressive pace. This makes working at KeepTruckin an engaging and fun experience. If you are interested in learning more about the opportunities at KeepTruckin, check out our Careers page.