On the Integration of Data Products

Yu Zhou
Yu Zhou
Jul 24, 2017 · 3 min read

I have been building a data product with my teammates. This product shows desired results and is promising for further advancements. I would love to deploy this product for real world testing. However, my managers and business partners think differently. They want to integrate this product into a couple of existing products.

At first, I feel very confused by this integration idea, because this data product was developed for a its own purpose but now it is asked to feed something else. Why would you want to optimize for one purpose and later pivot to another purpose? It is particularly challenging for data products because it may be logically impossible to incorporate one math formula into another math formula. Adding one data product into another data product is not like adding one tab onto a webpage. I believed that each data product was unique (because its underlying models were built/tuned differently) therefore each data product deserves its own deployment for optimized usage.

An image on Phillips products

Later, I read one Harvard Business Review article on the dark side of innovation. This article used some history of Phillips to make a point. Phillips used to be very innovative and it developed a wide range of products from health care to electronics (Phillips MRI and Phillips razor). A large variety of product offerings created a huge pressure to Phillips business operation, sales and customer service. The solution to this variety pressure is product integration (and/or elimination). Phillips started to consolidate product lines and business units, the rest is the history. We now see Phillips having less but focused product groups. After reading this Harvard Business Review article, I think product integration makes sense. Nevertheless, data products are different from physical products.

Data product integration is necessary but difficult. When we have three numbers: a weight, a height and a body temperature, we can integrate weight and height to create BMI, but we cannot integrate weight and temperature to create anything meaningful. (Computer would combine numbers one way or another regardless if the results are valid. Decision makers without a finger in the technical details would mistakenly think an integration works just because computer could complete a desired computation). In real life, working with high dimensional data also makes what is possible and what is not possible very unclear, because there exists many logic touch points for us to try for successes and failures. I always hear people say let’s try to do feature engineering in another way to achieve the goal. The is the the curse of dimensionality at work.

Another common way to integrate data products is building “second stage models”. Suppose two data products both generate predictive results on similar but not exact events. It seems logical to use one prediction to improve the other prediction. In a deterministic world or Newtonian world, one object could be used to physically improve another object. However, in a probabilistic world, two objects could be worse off when combined. If one prediction is 90% accurate, another prediction is 90% accurate. Combining them together, we may get a prediction of 49% accurate, which is 90% * 90%. We want to take away noise and strengthen signal through modeling, however, we may disproportionally increase noise and decrease signal when we combine two or more models together. In this case, less is more.

Yu Zhou

Written by

Yu Zhou

Senior Data Scientist at Cloudability | Twitter: @yuzhouyz

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade