Having your cake and eating it too: How Vizio built a next-generation data platform to enable BI reporting, real-time streaming, and AI/ML

6 min readApr 6, 2023

Authors: Parveen Jindal, Darren Liu, Alina Smirnova

VIZIO is the leading Smart TV brand in the United States, harnessing data from our Smart TV’s to power our platform business and create engaging experiences for our customers. As a leader in the data & analytics space, we have made great strides in innovating the watching experience. Now as we considered the future needs of our businesses, we migrated to the Databricks Lakehouse to support our rapid growth.

Before Databricks Lakehouse, we had no single platform to run a data as a service business at large scale which requires ingesting and processing data in real time from millions of TVs. So we got creative by stitching together many data services and leveraging a data warehouse to power our business. It was a brilliant system, but as the data volumes and the number of new features the business wanted to add grew, managing this system became prohibitively expensive and time consuming to manage.

Furthermore, it would have been a massive undertaking to bolt on a separate real-time streaming and production ML system on top of our current Data Warehouse to support new features. This would’ve required us to build these systems from scratch and taking data out and governing it (along with any models) outside of the Data Warehouse entirely.

It was clear we needed something more than just a newer data warehouse and adding new products to cater to different use cases …

Journey To Databricks Lakehouse:

First, we identified a slew of options for standardizing our future platform. We evaluated the following options:

● Staying on our current Data Warehouse + homegrown solutions

● Moving to a Data Warehouse (along with using DBT, Airflow, an ML platform, a separate streaming layer, etc.)

● Self-hosting Spark and the relevant other services needed

● Databricks Lakehouse

Almost all solutions were infeasible and simply created different “Frankenstein architectures” that would force us down the same path again.

TL;DR — Databricks was the simplest and most cost-effective solution of any we tested. With the other Data Warehouse vendors we considered, we would’ve had to build our own systems for real-time streaming, exploratory data science, orchestration, and production MLops. Databricks offered the full gamut of aforementioned tooling, which enabled us to get to production quickly and manage the environment easily.

Here were the primary criteria that drove our decision:

● Open — Databricks is built on open-source components such as Spark, Delta Lake, and MLflow, which are battle-tested industry standard projects with years of support

● Scalability — We are processing 100s of TB’s of data a day, having a platform that’s robust enough to handle this scale to keep our business running was paramount.

○ Databricks with Photon was able to provide us excellent performance for our Join-heavy workloads, but on the data lake with open table formats, and with costs growing linearly to data growth, even at massive scale.

○ Specifically, Databricks Photon proved to be 3X faster for our needs than other data warehouse vendors. This gave us confidence that the system could scale well.

● Cost — Running a platform at this scale, keeping costs in line is very important. Databricks enabled us to scale our costs linearly as our data grew and ensure we were running the platform in the most optimal fashion.

○ Specifically, Databricks is the only vendor we tested that enabled us to “form fit” compute to the right use case. For example, for best performance on ETL we required compute optimized instances for better parallelism in transformations, and for business ready datasets that were join heavy, storage optimized instances were best, and memory optimized instances were best for our real-time streaming workloads. Other Data Warehouse vendors we considered offered either monolithic clusters or a T-shirt sizing model, neither of which give us any of that optionality.

○ Thanks to Databricks Photon, we now have a viable path to reduce our costs up to 32% compared to the other options we evaluated.

○ Also, due to the decoupled compute and storage architecture, we could scale our costs linearly to data growth.

● AI/ML — since we are a data forward company, scaling our ML practice was very important for us.

○ We needed a solution that offered a Multi-language Notebook environment for exploratory data analysis and feature engineering, Automated experiment tracking and governance, multi-node model training, Production grade Model Deployment for real-time inference, and Feature store to facilitate the re-use of features across the business.

● Real-time streaming — Our business requirements demanded increased freshness of data, only which a streaming architecture could provide. Since we have hard SLAs to hit, it was critical to be able to control the frequency of micro-batches. Databricks met all these criteria nicely.

Ultimately, Databricks was the only platform that could handle ETL, monitoring, orchestration, streaming, ML, and Data Governance on a single platform. Not only was Databricks SQL + Delta able to run queries faster on real-world data (in our analysis, Databricks was 3x faster) but we no longer needed to buy other services just to run the platform and add features in the future. This made the decision to move to a Lakehouse architecture very compelling for solving our current challenges and while setting ourselves up for success on our future product roadmap.

Welcome to the Lakehouse:

As we’re actively transitioning in 2023, the benefits of Databricks Lakehouse were palpable. Our core ETL pipelines that were once hard to manage and scaling poorly, are now robust pipelines in Databricks Workflows that drive Structured Streaming jobs with a fully visible pipeline.

What was once a manually managed series of our Current Data Warehouse monolithic batch loads, is now a fully elastic job running on Ephemeral compute that grows and shrinks to exactly the right capacity for that job automatically.

For example, in one job, all parts of the Databricks Lakehouse work seamlessly with one another such as:

● Delta Lake — All tables are open Delta Tables that are performant and easy to manage. With ZORDER, Auto compaction, time travel, and more, we have a fully governed Lakehouse in an open format.

● Workflows — A native, extremely robust orchestrator built right into the platform at no extra cost. This provides alerting, conditional task orchestration, and automatic cluster management to make running the platform incredibly simple. All compute in Workflows (Jobs Compute) is ephemeral and autoscaling, which drastically reduces costs since it can automatically fit the compute needs to your exact problem at hand on runtime. This native orchestration is nearly impossible to do on other Data Warehouses without adding 3rd party tools.

● Structured Streaming Engine — All pipelines are now Structured Streaming jobs that provide automatic statement management, failure recovery, incremental processing, and throughput management. Now instead of brittle hourly batch logic in python, all we need to do to get data faster is change the trigger interval of our pipeline, and Structured Streaming handles the rest. This makes broken state a thing of the past for our team.

● Notebooks — Any pipelines can be built directly in notebooks and immediately scheduled as production jobs, making time to market twice as fast without sacrificing governance. Now that Databricks offers IDE support, we have the best of both worlds.

● Photon — Our ETL is complex, and Databricks’ Photon engine made it possible to not only run our pipeline faster, but also much cheaper than our previous solution on our Data Warehouse. Before Photon, this kind of performance for data warehousing style workload (think lots of joins/groupings/transformations) was simply not possible on an open data lake.

● Databricks Serverless SQL — Databricks’ Serverless native warehousing offering runs our data quality system that sends automatic alerts, creates native data quality profile dashboards, and allows users to perform Ad Hoc SQL analytics directly on their Delta Lake just like any other Cloud warehouse with instant startup and shut down.

Conclusion:

Putting all this together, we now have an architecture that provides us an opportunity to consolidate various data platform use cases (BI, AI, Streaming) with one unified platform, with linear scaling of costs, has full observability, automated state management, scales well, and sets us up for success for our future plans for more pioneering advanced analytics products:

Not only are we set up to grow our business, but our engineers are happier, more productive, and can now focus on staying at the bleeding edge Smart TV innovation.

Having your cake and eating it too: How Vizio built a next-generation data platform to enable BI reporting, real-time streaming, and AI/ML

Written by Parveen Jindal