Why Streaming First at Promoted
Good engineering is to build an MVP system, prove fit, and then upgrade components as necessary. However, in search, discovery, and ads, data processing latency is extremely difficult to upgrade from batch to streaming later systems. Grab, Pinterest, Snapchat, etc.: many top companies get stuck for years trying to upgrade batch processing to stream processing. At Promoted, we started with streaming first, even though this is “overkill” given the state of downstream consumers and the pressure for startups to show early value. We chose to start with streaming because starting with batch produces a fundamentally different, inferior product, even after upgrading to streaming later. Here’s why:
…starting with batch produces a fundamentally different, inferior product, even after upgrading to streaming later.
Your company is building top-tier discovery, search, or ads, but timelines are tight and the business demands revenue now. Engineering decides that a batch system for data will suffice “for now.” Streaming would be ideal, but that’s hard to build, and it can be done “after the MVP.” Also, any engineer can use the in-house ETL system (e.g., Hive + Workflows) to make a daily pipeline and outsource SRE to the pipeline system operations team.
The batch system has grown to consume hours of computation. A large fraction of on-call is babysitting and restarting pipelines. There is a team working on streaming, but A/B tests don’t show much of an improvement of streaming versus batch, and necessary productionization and migration work gets deprioritized accordingly. Meanwhile, customers are complaining about cold-start performance and the ML engineering team spends more and more time backfilling and updating batch queries in the ETL system. This drags on for years until leadership commits several dozen engineers to force migration to streaming and the migration takes at least a year. The result is a modest metrics improvement at a large cost, but the product still doesn’t feel “real-time,” and cold-start optimization performance is still a top complaint.
Product leadership is not driven by incremental engineering improvements.
Why This Happens
At Promoted, we started with streaming first because it’s so hard to migrate from batch to streaming later. Streaming is an engineering challenge, but the main blockers are product and organizational debts, not technical difficulties. Systems designed around a batch system don’t evolve to benefit from streaming systems. To change the system requires product leadership. Product leadership is not driven by incremental engineering improvements. When streaming stops being managed as an engineering component upgrade and managed as a fundamentally new product, then streaming happens. This can take years, or never.
How Streaming Leads Product Design
User Design: Reactive vs Personalized
TikTok and Facebook Newsfeed both use the same technology: personalized recommendations that optimize for engagement. The key difference is that TikTok reacts per video, whereas Facebook designed Newsfeed to update daily. Facebook uses streaming data. However, TikTok is TikTok because it’s reactive, and the improvement is a qualitative leap to a new product experience versus a minor metrics improvement. Slow TikTok is basically “lame YouTube.” Faster Newsfeed is basically still “Newsfeed.”
Another more technical example: Facebook ads’ discount pacer (aka Lowest Cost Bidder). This system works as a real-time feedback-control loop based on ad spending compared to projected ad opportunities. If spending too fast, slow down, if too slow, speed up. The alternative, batch solution to the real-time system is “daily bid suggestions.” This feels like updating a spreadsheet every day and performs poorly. A batch system that automates the spreadsheet behind the scenes still performs poorly and feels unpredictable, clunky, and untrustworthy. A bid suggestion system that updates a little bit faster maybe a little bit more efficient, but it will still feel clunky compared to sleek modern, auto-bidders.
If a more reactive system was critical in the past, and didn’t have one then, then these users wouldn’t still exist, and you can’t measure product impact for them now.
Measurement and Survivor Bias
The features, media, and users who succeed and survive in the slow system are the same features, media, and users in A/B tests when trying to roll out streaming systems. If a more reactive system was critical in the past, and you didn’t have one then, then these users wouldn’t still exist, and you can’t measure product impact for them now.
Product management typical in consumer product engineering organizations depends on A/B testing to make decisions. However, streaming upgrades can have low A/B improvement to high engineering costs, making them easy to deprioritize forever.
For example, in advertising systems, big brands already have long (quarterly) purchasing cycles. They may benefit from a faster measurement-response cycle, but that’s not critical to their continued use of the product. In contrast, small businesses and startups demand immediate feedback. Because they were unable to get immediate feedback in the past, they will not be using the ad product now. An A/B test would not include these hypothetical small business advertisers, so any impact on them cannot be measured. Meanwhile, big brands continue to spend well enough.
Another example: new and anonymous users and new items (discovery). Every day, the impact of each new user or thing is small as a fraction of total engagement. It’s easier to improve the majority experience a little bit than to dramatically improve the minority experience for the same impact in an A/B test. However, the total experience is the aggregation of all the new users and new items finding success over time. This will never be measured in an A/B test.
Organization Shifts and Operational Costs
Sometimes the engineers who built V0 will also own the upgrades. This is rarely true for the batch-to-streaming transition. Upgrading to streaming rarely happens bottom-up from empowered, passionate engineers who built the first version using batch processing.
The types of engineers who build data pipelines are rarely the types of engineers who build streaming systems. Batch ETL data pipelines can be and are built by anybody: data scientists, interns, or any backend engineer. They are easy to write and don’t require special understanding to operate. Debugging can happen at an interactive SQL prompt. When batch pipelines fail, which can happen about once per week or so, debugging consists of rerunning the pipeline, skipping a day, or increasing computing resources.
Streaming data systems are different. They require a detailed understanding of programming, the deployment production environment, and the streaming systems themselves. When they fail, it is a production emergency, and failures can happen at any time, requiring 24/7 SRE on call.
Because a different, more specialized, more expensive, and more operationally difficult team will build and operate streaming, this team must exist before streaming can happen versus organically materializing from existing engineers and technical staff in the project area. This requires planning, hiring, and transferring responsibilities to a new team. Given the product design and measurement challenges above, unless engineering quality is a top company priority, this investment can be challenging for senior leadership to approve.
Why Promoted Leads with Streaming
When we first started Promoted a year and a half ago, Dan Hill, my CTO and co-founder, asked what our latency requirements were for data. I said: “we need as real-time as possible to build discount pacing.” And so, down the rabbit hole of streaming we went, and despite the costs, and it has absolutely been worth it. Because Promoted is streaming first, we can prioritize and build qualitatively better product experiences that would be infeasible otherwise. It changes our thinking to focus on customers that are also building towards the future with engineering investments sufficient to already be beyond batch processing improvements. And finally, it instilled a rigorous discipline of always watching metrics and systems performance, 24/7. Choosing the more ambitious path in infrastructure has helped make Promoted the solution for top marketplaces and e-commerce apps.