Exploring scalable systems for video intelligence powered by machine learning

Jesse Gumz
JW Player Engineering
6 min readSep 12, 2018

What is “video intelligence”?

Tens of thousands of new video assets pass through JW Player’s platform every day. The Media Intelligence team at JW Player is one of several teams at the forefront of researching how to derive meaningful information from these videos and build a greater sense of understanding about the assets. Knowing what is embedded within the content of a video can help drive insights that make video workflows easier, saving publishers and developers time.

When we talk about video intelligence then, we’re referring to deeper features from video such as:

  • Speech transcripts of a video’s audio
  • Video thumbnail quality scores
  • Video categorization
  • Language detection
  • Optical Character Recognition
  • Shot detection and scene detection
A quick preview of how some of these intelligent metadata can enrich our knowledge about videos.

In several cases these features required machine learning models to be deployed for inference in real time. In some cases we leveraged Tensorflow for training and deployment, but in other cases we made use of 3rd party APIs if we didn’t have enough data of our own to train a viable model.

But beyond thinking about what features we wanted to create, we also needed to think about the how. There is a bit of a leap when going from producing a feature as a proof-of-concept in a sandbox environment like ours to producing a feature repeatedly for a daily influx of video in production. While it took a few iterations and proofs-of-concept, we ultimately developed a small distributed system of workers that can chug away on different forms of metadata extraction around the clock. We call this experimental service the Metadata Extraction Platform (MEP).

MEP 101

The MEP is a queue-based system that orchestrates and runs work continuously to produce intelligent metadata artifacts for downstream consumers. In front of this system we have an HTTP API that:

  • Accepts a payload of input for one or more jobs.
  • Validates the payload’s fields, including a computation of an input hash to see if the same inputs have been provided before — useful for deduplication of work.
  • Enqueues new jobs for different features into the core MEP system, based on the inputs provided.
  • Returns one or more unique job IDs for the created jobs right away, without waiting for any of that enqueued work to be finished.

Even at the level of designing an API around a machine learning model that ranked thumbnails, we realized that we couldn’t always expect to apply a straightforward, synchronous API pattern to processes that can be long-running in nature. We decided on an asynchronous system backed by queues and pools of workers — a system that guarantees to the client that work will be completed eventually, just not right away.

We utilized Celery and RabbitMQ for our task management framework and queueing infrastructure. Working in Celery allowed us to define a feature’s concrete tasks as Python methods and associate those methods with different queues. For instance, preprocessing tasks such as audio extraction from a video fall on their own queue and have their own pool of workers. At the same time, there exist other queues for lightweight I/O tasks like making requests to in-house or 3rd-party machine learning models for inference. With the separation of work at the task and queue level, we could work on different steps of distinct features for many jobs simultaneously.

Sample MEP feature workflow for generating speech transcripts.

With the decoupling of tasks and workers, it now became possible for us to have greater degrees of control over system performance. If a machine learning service we rely on for inference were to become overwhelmed, we could scale down the workers making requests to that service or scale up the machine learning service (if it’s one that we own). Alternatively, if several large videos were causing a bottleneck in preprocessing tasks like audio extraction, we could simply scale up the number of preprocessing workers available to help chew through other pending messages on the preprocessing queue. In situations in which a hotfix might be needed, we could take a pool of workers offline for a patch while letting messages simply build up in the queue; we wouldn’t need to worry about losing jobs mid-flight. Although, it is good to not let messages build up to too large of a degree when using RabbitMQ. If the instances running the queues go out of memory they’ll start paging to disk, and that can wreak havoc on both performance and stability.

Of course, moving to an asynchronous system means having to deal with some caveats, such as the fact that clients can’t expect results immediately in the API response. Also, jobs may not necessarily finish in the order in which they were requested. With distributed systems it can be rather difficult to make firm guarantees that work will be picked up in a deterministic order. However, we felt that these tradeoffs were well worth the gains in performance and scaling that were previously unattainable.

Video in, metadata out

Even with figuring out the logistics of feature creation at scale, we still needed to think about how to expose the products of the MEP to downstream consumers. An issue that emerged over time was the concept of “canonical” artifacts for features — in other words, the most recent, valid artifacts for a specific feature and input media. By maintaining state in a database about the artifacts produced for a given media and feature, we would need to update the canonical artifact any time we re-ran the same feature for that media. However this soon turned into a headache because of the complex queries required and the potential for race conditions.

We eventually found a better solution in the way of log-based architecture. Earlier this year JW Player’s engineering team launched new infrastructure with Kafka that allowed for any application to become producers and consumers of various topic queues. One topic in particular broadcasts updates to media in real time by performing “streaming left joins on database tables” — we call this joining system Southpaw and we’ll be open-sourcing it in the near future so stay tuned!

Having a topic to read from for media updates was a good solution for the ingestion side of the MEP, and it served as inspiration for the production side of the MEP as well. We decided to start using a Kafka topic as the source of truth for artifacts, rather than trying to store and maintain state about canonical artifacts in a database. As a result, other engineers can now build their own downstream consumer applications to read from the artifacts topic and determine how to handle the data for their own use cases.

With the log, consumers can see a more transactional history of artifacts and determine for their own applications how to interpret that history. Offloading the complexity of artifact consumption to the actual consumers of artifact data helped to make the MEP as a producer of data much easier to reason about and maintain. As it turned out, the log was a natural complement to the asynchronous system that we had designed.¹

What’s next

The big picture, as presented at JW Insights earlier this year

The MEP could be at once simple and complex. Developing on it required various degrees of familiarity with distributed systems, RESTful API design, asynchronous patterns, databases and caching, queues, message logs, and last but certainly not least, machine learning.

As we move into the future of what’s next for the MEP and how it could contribute to JW Player’s efforts to create video intelligence at scale, we are exploring some exciting follow-on research:

  • The creation of special task classes in Celery that can use Celery’s result backend framework to help with the caching of tasks, so that similar work across features and jobs need not be duplicated.
  • Orchestrating dependent workflows so that the MEP can have output from one or more features feed into more complex features: for instance, using the output of a shot detection feature as input for an animated thumbnails feature.

We look forward to doing more work to see how intelligent metadata can power better products at JW Player and improve both the publisher and the end user experience.

Footnotes:

  1. For more reading on using logs as a robust data store, we recommend this piece by Martin Kleppman: https://www.confluent.io/blog/using-logs-to-build-a-solid-data-infrastructure-or-why-dual-writes-are-a-bad-idea/

--

--