Migrating all our reports in Looker from Redshift to Snowflake was a multi-quarter project for us at GumGum. We had strategically planned (see previous blog post) this migration in phases to ensure business as usual and cost optimization. Our first phase of migration was the longest where we gave enough time to ourselves to experiment, and learn from the gotcha moments. Right after the first phase of the migration, we realized that we need to build a process for the future migrations.
Data warehouse migration is moving data from one location to another or from one application to another. There could be various reasons for data migration like cost optimization, upgrading technology for performance, and scalability or consolidating data at one location. It is similar in concept to human migration, which is driven by climate, food availability, and other environmental factors.
At GumGum, business is expanding constantly, along with that, our data storage and compute requirements are growing exponentially. To strike a balance between scalability and cost, we have decided to move our reporting data from Redshift to Snowflake. This translated…
Data warehousing solutions have been existing for decades and these have been the backbone of all reporting and analytic needs of both small and large scale enterprises. Even in today’s world of Big Data, Data Lakes, and NoSQL databases, SQL as a language still remains the most powerful querying language and the combination of data warehouses and SQL continues to dominate the most modern data applications.
At GumGum, Amazon Redshift has been the primary warehousing solution for years. Redshift is a fully managed, petabyte-scale cloud data warehouse that has worked very well for our needs. However, our data footprint has…
GumGum receives around 30 billion programmatic inventory impressions amounting to 25 TB of data each day. Inventory impression is the real estate to show potential ads on a publisher page. By generating near-real-time inventory forecast based on campaign-specific targeting rules, we enable the account managers to set up successful future campaigns. This talk, Real-Time Forecasting at Scale using Delta Lake and Delta Caching, which Jatinder Assi and I presented at Spark + AI Summit 2020, highlights the data pipelines and architecture that help us achieve a forecast response time of less than 30 seconds for this scale. Spark jobs efficiently…
Our advertising data engineering team at GumGum uses Spark Streaming and Apache Druid to provide real-time analytics to the business stakeholders for analyzing and measuring advertising business performance in near real-time.
Our biggest dataset is RTB (real-time bidding) auction logs which amounts to ~350,000 msg/sec during peak hours every day. It becomes crucial for the data team to leverage distributed computing systems like Apache Kafka, Spark Streaming and Apache Druid to process huge volumes of data, perform business logic transformations, apply aggregations and store data that can power real-time analytics.
What is an anomaly? It’s something that deviates from what is standard, normal, or expected. In some cases, you can find anomalies by setting up thresholds or bounds. Or you can use supervised machine learning algorithms to find them. But it will require us to train the algorithm with a dataset containing anomalies. But what if the anomalies are hidden in a time series dataset? You can spot them by plotting a trend. Usually they are out of place in such a plot. For example:
The above picture shows a time series dataset in GumGum. The red point is an…
Kafka is a really powerful distributed publish / subscribe software that helps you build complex asynchronous applications. GumGum was an early adopter of this technology and is nowadays running hundreds of brokers across multiple clusters.
Kafka cluster operations are a thing on their own (scaling clusters in and out, recovering from a dead broker, reassigning partitions across the cluster…) but if you want to build performant client applications based on Kafka you want to pay close attention to your consumers:
Kafka Consumer Lag is an indicator of…
The above video contains two talks presented at the Kafka meetup hosted at GumGum on 19th November, 2019.
Speaker: Alex Woolford
In this session, Alex Woolford will walk us through some practical near real-time examples that touch on analytics, systems integration, and machine learning. We’ll show you:
- how to build streaming apps using a familiar SQL interface (KSQL)
- how to capture events from external systems
- how to deploy a machine learning model in a stream of events
Bio: Alex is a systems engineer at Confluent, the company founded by the original creators of…
When the organizations scale and the data explodes, it becomes vital to have scalable data architecture. This post revisits the problem statement discussed here, but for an entirely different scale. To give a quick recap, the goal is to forecast the inventory impressions per day, given a set of targeting rules and sample data. This time, the inventory being forecasted is programmatic inventory. In part one of the blog post, Jatinder Assi discussed in detail about data architecture and distributed sampling on the programmatic inventory. …
Forecasting is a common data science task at many organizations to help with sales forecast, inventory forecast, anomaly detection and many more applications. For GumGum’s advertising division, it is critical for our sales team to forecast available ad inventory in order to setup successful ad campaign.
Our Data Engineering team already uses time series forecasting for directly sold ad inventory (400+ million/day). As advertising industry is evolving, there has been exponential growth in programmatic ad buying and as a result GumGum have 30+ billion programmatic inventory impressions/day. Producing high quality forecasts is not an easy problem at GumGum’s programmatic advertising…
Thoughts from the GumGum tech team