Evolution of Pluto TV Data Streaming

Young Kim
PlutoTV Data Engineering
6 min readJul 29, 2022

by Lin Chear and Young Kim

[Developing a data streaming application is both simple and complex. ]

They say “Change is the only constant” and that is certainly true in the world of data engineering. From its inception, Pluto TV has been a data driven organization and, our ability to skillfully adapt to these changes has helped us become a leader in the streaming space.

In The Beginning : The Legacy

In the beginning, there was nothing but a giant green field. Pluto TV was founded in 2014 as a startup dedicated to streaming entertainment. Data was essential in reacting to customer behavior and market changes and, as a result, Pluto TV set out to develop a data pipeline to capture the events being generated from its active users. The first iteration had simple requirements of capturing the event data stream and presenting it as digestible reports for the decision makers.

As a startup that needed to be cost cautious in its approach to a robust data streaming solution, the Pluto TV engineering team used what was available to it. With the increasing popularity of web based applications, it was natural to leverage the current infrastructure and build the data streaming application off of web application logs. This approach was simple and allowed for quick turnaround time and relatively low development effort.

Ad Hoc Custom Solution

Pros:

  • Simple to implement basic features

Cons:

  • Difficult to maintain
  • Complex to manage
  • Difficult to add features
  • Not feasible to add real time applications

Off-The-Shelf Software : The Migration

As Pluto TV grew, so did its data needs. The inherent drawbacks of the ad hoc solution gave way to solutions geared specifically towards data streaming. While the original implementation “worked,” it became cumbersome to manage and failed to provide modern features of a data streaming pipeline.

As a result, off-the-shelf software applications sprang up to meet the demands of providing a scalable and configurable data pipeline. If your organization’s data needs can be solved by the software package out of the box, then implementation is straightforward. In reality there will be some customization needed, but as long as you don’t stray too far from the prescribed solution, it should be a relatively painless affair.

But, even with dedicated turnkey solutions, we ran into situations where Pluto TV’s data needs veered away from off-the-shelf software. Attempting to shoe-horn vendor solutions into solving our data needs led us into less than ideal performance and limited us to what we could do with the incoming data.

As Pluto TV’s user base increased, so did the volume of incoming data. And as the organization grew, so did the way data was used and analyzed. Real-time applications, AI and machine learning were increasingly necessary to help sift through the data.

This quickly exposed bottlenecks and the upper limits of scalability and configurability of the off-the-shelf packages and led us to seek better solutions.

With this solution, the underlying data pipeline engine had to be migrated to one of data streaming technologies. After exploring several different data streaming engines, Pluto TV’s data engineering team chose Apache Kafka as the main data streaming engine for the following reasons: scalability, high throughput and high availability.

Off-The-Shelf Software Approach

Pros:

  • Relatively quick to deploy
  • Meets the immediate needs of the organization
  • Transition from the batch to the streaming architecture

Cons:

  • Future expansion limited by software vendor or implementation
  • Might not be optimized for future organization growth
  • Organization may outgrow the software sooner rather than later
  • Scaling is limited by the software architecture’s design
  • Limited customization
  • Expensive consultation fees for custom

Custom Built Data Streaming Software : Back To the Drawing Board

By adopting Kafka as the data streaming engine with the off-the-shelf turnkey solution, we were able to support the data streaming pipeline during the warp speed growth of Pluto TV in the international video streaming market.

As the rapid growth of viewership continues, we started to experiment with scaling up the data streaming application even further to support natural growth and viral events.

Ultimately, PlutoTV’s data engineering team had to develop its own software package to ingest and enrich data. We determined where our current pipeline was bottlenecked using a number of different methods. Using tools such as JMeter, JMX, Grafana, we were able to identify where the bottlenecks were. Computation heavy processes such as schema validation, data enrichment were profiled using available profiling tools.

This allowed us to inspect where data was being read and written from the pipeline multiple times and consolidate repetitive computation tasks. It also allowed us to identify parts of the pipeline that were bottlenecked by the data pipeline’s limitations and rewrite our collector and applications around them to avoid scaling problems. This is one area where we could squeeze out maximum performance from our pipeline. Certain off-the-shelf software solutions are generalized for multiple different pipelines which, while very convenient and simple to roll out, can lead to less than optimal implementation for every platform. With ownership of the code and intimate knowledge of the business’ data needs, we can engineer our software solutions to meet them through the short and long term.

There was more work to be done however, engineering our own solution meant we needed to develop the requisite software engineering practices to support the applications development. Instead of simply deploying someone else’s solution and calling it a day, we had to implement all the necessary practices of a software engineering team. This meant developing good unit tests, integration tests, load tests, code reviews, automated deployment pipelines and multiple other different quality assurance strategies. The infrastructure to build and deploy in-house software can be much greater than simply deploying a vendor’s solution. Each organization has to weigh the pros and cons of deploying an in-house solution. Having said that, for Pluto TV, the in-house solution is already reaping the benefits of more efficient data utilization, pipeline bandwidth optimization and lowered computation costs.

Pros:

  • Built to solve present and near future problems of the organization
  • Solving Industry specific problems
  • Maximum performance tuning
  • Self determined release cycle

Cons:

  • On-going development / support
  • Development costs

Looking Ahead : The Future is Bright

As Pluto TV’s viewership continues to grow exponentially and the need for data in real time increases, it is crucial to the data streaming application to continue to evolve and adopt the latest technologies and trends. We are just getting started on this exploration and looking forward to a new era of data streaming.

--

--