Real-time machine learning inference at scale has become an essential part of modern applications. GumGum’s Verity engine powers the industry’s most sophisticated contextual targeting product by analyzing thousands of digital content every second around the clock. This is a challenging undertaking that requires deploying deep learning models using an event-driven streaming architecture on an elastic cloud-native cluster.
At GumGum, we use Apache Kafka’s high throughput and scalable streaming platform to connect various components of our machine learning pipelines. Up until recently, we deployed the underlying inference micro-services solely on Amazon ECS, which is a great choice due to its security…
At GumGum, we use Computer Vision (CV) to leverage page visuals for our contextual targeting and brand suitability product called Verity. We process millions of images every hour, and at this rate, our long-term inference costs dwarf the upfront training costs. So, we tackled this issue head-on. In this post, I’ll benchmark and highlight the importance of multi-threading for I/O operations and batch processing for inference. Note that implementing these strategies may be an overkill if your application’s scale is of the order of a few thousand images an hour.
Let’s look at our application components:
Software Engineer with entrepreneurial spirit. Passionate about building Machine Learning applications at scale. PhD in ECE, Univ. Minnesota. Caltech Alumnus.