An Overview of Engineering (Cloud!) Challenges at Ather

Ather Engineering
Ather Engineering
Published in
6 min readFeb 11, 2022

by Vishwas Bhat

Ather has grown exponentially in the last few years, with thousands of scooters on the road. It has truly been a remarkable journey, having started out with a few hundred scooters in Bengaluru alone, to more than 29 cities across the country.

With all the learning accumulated the technology under-the-hood (the connectivity, the software and cloud) has seen massive advancement too. “The Data Platform” on the cloud has moved from processing a few GBs of data 2 years ago, to now processing close to 100TB real-time.

In this blog, I will delve into the challenges and the requirements of building the cloud data stack. We’ll explore them in detail, individually, in the future posts.

The growth from using managed services to building the connectivity stack as a self-managed platform, is something that has definitely enabled us to handle this scale with promise and deliver new and unique features. While it is critical to build this architecture design from scratch, the complexity of scale adds a new dimension. Especially, when you need to account for a 7–10 year horizon to make a design choice that lasts, and you have to keep up with the lifetime of the vehicle. The other dimensions are the addition of newer model variants, generations of vehicles, and unique customer features which are themselves evolving rapidly.

The Range of the Problems at Hand

Before addressing the problems at scale, I would like to quickly go through Ather’s use-cases that the data platform caters to, and split them into three categories:

1.Customer-facing features and value additions

Ride statistics gives a snapshot of the ride on the app, right after the user ends the ride. Ather Labs platform enables unique data driven experiences building on vehicle connectivity and intelligence

2. Diagnostics and Monitoring

Each device works like telemetry, it allows our service and engineering teams to monitor, identify and diagnose issues remotely via our internal application stack.

3. Research and Development

For newer features on the cloud and newer generations of the bike as well, the data from the fleet define our learning baseline and enable our teams to build better and unique features, with a faster turnaround time.

Image: A snapshot of the rides from across the country, through the day

Problem 1: Addressing Data Problems at Scale

Imagine handling the flow at scale; data is sent at a very high frequency via the IoT broker to the cloud. This stream of data is continuous and has aggressive reconnection strategies. The IoT broker pushes data via topics using a Messaging Queue(MQ). MQs are set up with different topics that work heterogeneously around raw data collection, sorting of data by time and vehicle, cloud storage at different stages of processing, and pushing the data to processing units to churn out useful information on the dashboard for users/riders. The dashboard data could be strictly internal like the vehicle’s health, current state of the battery, errors, etc. On the other hand, data sent over via backend applications to the mobile app, can help convert them into features which are customer-facing. (A good read on the Real-time data processing, by Ram Bhavaraju)

Image: Overview of the data platform

Ather’s self-managed cloud broker handles 40 billion messages per day and the various applications process anywhere between 3000 to 5000 messages per second for customer use-cases, in real-time. The broker gateway, along with kafka and a basket of other applications for different use-cases, need to deliver in various scenarios like:

  • Seasonality of rides during the day or week
  • Sudden surge of bikes connecting to the cloud at scale
  • Ride patterns and behaviors varying the data throughput
  • Network erraticity that can cause huge latency of communication to the broker
  • Multi-generations publishing data via different data models
  • Addition of newer bikes to the platform with every new purchase; >3000 bikes a month today (5TB additionally processed every month)

The data stack built on a strong foundation of Kubernetes and auto-scalable dev-ops processes enable us to quickly scale by tackling a combination of these resources and expertise. To top it all off, we moderate this most-effectively by balancing the resources among on-demand and spot resources (Preemptible VMs).

Problem 2: Data, Availability and SLAs

The SLAs and availability are critical for all application use-cases at Ather. The data stack at Ather follows a microservices architecture where multiple applications orchestrate the outcome of any feature. The requirement of ride stats is always to push the data snapshot on the Mobile App immediately (within 7 minutes) of completing a ride. The same goes for the features on Ather Labs where low latency is a need. Various challenges play in to cater to this highly critical SLA requirement; network, data ingest, data flow and processing and database availability. Also, in the case of any failures or intermittent latencies, the strategy put in place enables retries and rebalance nodes if needed to ensure priority is always maintained for customer-feature outcomes.

The other use-cases that are internal have slightly lower criticality of latency. However, the systems enabling our monitoring, service, diagnostics and other internal R&D are built resiliently with similar considerations for near real-time data availability.

We moved from single siloed systems to a multi-tenant architecture, enabling scalable and highly-available applications.

Problem 3: Accessing Data at Scale — Data science, Analytics and Data engineering

Image: Scale trend through the day — data rates that hit the cloud

There are umpteen number of use-cases across the organization, let’s break them down for easier understanding

1. Query raw data for a handful of bikes

Data queries generally used by engineering teams across the organization to optimize, analyze, test, debug algorithms on the ECU, hypothesize and test durability and reliability simulations and more. This data is fed into a plethora of tools — MATLAB, Python, simulation tools and more.

2. Analytics and fleet monitoring

Analysts query fleet data to identify issues, understand ride and charge behaviors and identify patterns across the fleet that could quickly help build hypotheses catering to various business problems.

3. Fleet-level analysis for distributions

More around engineering applications, these tools help with bi-variate and multivariate distributions and visualizations for using fleet data for research and development.

4. Data science, Data engineering and pipelines

Building and evolving algorithms; testing and productionizing them on the cloud and potentially moving the model to the edge for specific use-cases. A lot of the problem statements delve around identifying anomalies and patterns that could result in proactive diagnostics on the fleet and also set up corrective action — time is of the essence in such scenarios; The platform should be able to quickly help identify patterns across thousands of bikes over weeks and enable different engineering teams to address this faster; this has been a significant learning at Ather Energy.

At the scale at which Ather is growing, the data engineering platform has a huge task of being highly resilient, scalable and accurate . With a combination of managed and self-hosted services, it becomes increasingly critical to revamp a lot of the data engineering and data enrichment to cater to the evolving use-cases. Leveraging the HIVE metastore as a backbone, our partitioning and storage strategies have evolved through a few versions and will continue to do so as more mature use-cases emerge.

In a series of upcoming blogs, we will talk in depth about the specific problem statements and provide an overview of solutions and best-practices we’ve taken to address the above-mentioned engineering challenges and be future-ready.

Originally published at https://blog.atherenergy.com on February 11, 2022.

--

--