Meet the Adobe I/O Team: Jegan Thurai on Managing the BigData Processing Pipeline

Jegan Thurai is a tech lead for the Insights team that provides analytical insights on Adobe I/O. He develops the BigData pipeline that processes more than six billion records per day to provide different metrics and is responsible for making the pipeline handle the ever-increasing traffic and ensuring the metrics are available at all times.

We caught up with him to find out more about his role at Adobe, the tech stack behind the BigData Pipeline, what kind of challenges he has to deal with day to day, and more.

Jegan Thurai is a tech lead for the Insights team that provides analytical insights on Adobe I/O

How did you get started in the industry, and what’s your career path at Adobe been like?

I started my career as an intern at a company called Portal IDC in Bangalore, India. Within a couple of months Portal got acquired by Oracle. My intern friends and I used to joke that we brought good times to Portal.

I joined the I/O Insights team at Adobe’s San Jose office in December 2017. It has been a wonderful experience since then as there are unique opportunities to innovate and explore new technologies and ideas. The Insights team also offers a lot of challenges in terms of scaling and optimizing the pipeline and making it available all the time. I learned a lot in this one year, mainly on processing streaming data, containerizing Spark applications, Mesos and Kubernetes to name a few.

What exactly do you do in the Insights team?

The Insights team provides analytics on API Gateway and Events Gateway traffic, which is useful to get an understanding of how much a particular service is used and if there are any errors. I am responsible for the BigData Processing Pipeline that crunches the raw stream of events, aggregates and prepares the data in a queryable format and loads to a data store. We also have an Anomaly Detection component that monitors the real-time traffic and alerts if there are any anomalies in the incoming traffic based on different thresholds configured by the service providers. I’m also responsible for designing and developing new components that are future proof and can be scaled to our ever increasing traffic.

What’s the biggest challenge in managing the BigData pipeline, and how do you do it?

Today there are more than six billion events that the Insights pipeline processes per day, and it keeps increasing. Last December this was less than two billion events per day, in just one year it has grown three times! Making sure our system can handle this surge in traffic and we are not losing any events is a challenging task.

We are also looking at an option for the service providers to charge back for usage of their service. This means we have to be very accurate and can’t afford to lose any data. In order to achieve this we should make sure that our components are fault-tolerant, performant, scalable and highly available. When something unexpected happens, we have to have all the details and metrics about the system, so we can quickly find out the issue and provide a solution. A delay in the fix could mean that we may lose the data.

Can you tell us about the tech stack used for the BigData pipeline?

We use Amazon Kinesis as a streaming engine, which acts as a buffer between the producers of the data (API and Event Gateways) and Insights. We also use Scala and Apache Spark for the data processing layer: Apache Spark for its higher level abstractions on developing BigData applications and Scala for its functional nature and conciseness. Spark and Scala have become a de-facto standard for developing BigData applications. For Spark jobs, we use Spark History Server that keeps track of the past job (both successful and failed) details and its lineage. This is very useful in triaging what exactly happened to a particular Spark job.

We also use Amazon EFS (Elastic File System) for intermediate storage to backup the events. Amazon S3 is used as a permanent storage and could be used for running our Machine Learning pipeline. Elastic Search is used as a data store for querying different metrics on various dimensions. This data is exposed through a REST layer, which is built with Spring MVC, to the UI.

All the applications are dockerized and run on Kubernetes. Amazon EKS is used for running the Kubernetes cluster. We use the metrics exposed by the AWS components to understand how our pipeline is functioning. DataDog is used for monitoring and alerting various components. When there is an anomaly, it sends an alert that gets attended by someone.

The Insights Architecture.

What kind of achievements are you most excited about in your work?

In the last 12 months, we have accomplished many major milestones. We have provided analytics for the Event Gateway and migrated to Kubernetes from Mesos. There are also interesting things that are upcoming, like usage billing and service mesh. I am very excited about usage billing which will open us to a new set of challenges. These new features will definitely take Insights to the next level, and I am confident that it will put Insights in a unique position within Adobe.

What’s the best piece of advice you’ve received in your career?

“Keep learning.” It’s never too late to learn something new. I always try to stay up-to-date with the technologies I’m interested in, either by reading books, articles, or by trying something hands-on. This helps to improve problem-solving skills and allows me to make better design decisions. I also like the phrase “Always code as if the guy who ends up maintaining your code is a violent man who knows where you live. Code for readability.”

Follow the Adobe Tech Blog for more developer stories and resources, and check out Adobe I/O on Twitter for the latest news and developer products.