5 Data takeaways from Google, Ada Digital Health, Botco.ai and Verdigris Technologies

Keertan Menon
DataSeries
Published in
3 min readJul 18, 2018

Panel @ Inescapable at Tech Open Air 2018, Berlin

The discussion topic entitled Inescapable AI at Tech Open Air 2018, in Berlin, comprised a diverse AI-related panel including Cassie Kozyrkov (Chief Decision Scientist at Google), Claire Novorol (Co-founder and Chief Medical Officer of Ada Digital Health), Rebecca Clyde (Co-Founder of Botco.ai) and Mark Chung (CEO of Verdigris Technologies). Moderated by OpenOcean’s very own Mike Reiner, Co-Founder of World Summit AI as well as City AI, the discussion covered a range of topics with a focus on the simplification and complexities of associated data sets some of us deal with today.

Below are four key takeaways.

  1. Choose your models first based on your problem and then based on your data

The greater the data, the greater the statistical power. For straight-line models, functionally-speaking, less data will be required. For more complex models, however, such as neural networks, one’s data set can be as complex as one likes. The key is to generalise the data and not memorise it, especially in the context of the overarching problem that is being solved. The problem should always be considered first when choosing the ML model, but the availability and type of data can be a good indicator. It should also be noted that extensive knowledge of the different ML techniques is needed to effectively map the problem to the right ML model.

2. Distinguish between research-side and applied-side

In graduate school, and before becoming a Googler, Cassie Kozyrkov studied Mathematical Statistics with a focus on Machine Learning and AI. She recalls most student researchers like herself to have created algorithms which, instead of being used to solve specific problems, were passed on to practitioners and data scientists as tools to, only then, solve real-world problems. In retrospect, she realized that her primary goal as a graduate student was to create algorithms… only to re-create better ones. Now at the helm of Google’s Decision Science department, she looks beyond the algorithm alone. Where researchers and academics build algorithms to use their data in order to achieve their goals, Cassie works in reverse and thinks about the problem trying to be achieved before creating the algorithm. A more concentrated effort must be made to differentiate the research side of AI from the applied side of AI.

3. Cleanup of large data sets takes time which delays model decisions

More data is not always a good thing. Mark warns against using massive data-sets where a lot of data clean-up is required, especially if you still need to figure out the models. Instead, he recommends opting for the most pragmatic route. One way to help to avoid data cleanup is to create your own synthetic data which in turn is helpful whilst you are still testing different models. It saves you time. This pragmatic strategy is applicable and available to some, depending on the objective, but is not always an option.

4. Flawed data is not always easy to spot

It is obvious that bad data leads to bad outcomes. Garbage in, garbage out. However, it is not always easy to spot flawed data — especially biased data. Claire, for instance, mentioned that in certain countries doctors get paid more for treating more severe illnesses. Hence, this incentive can slightly influence the data that is logged and consequently ML based diagnosis. Rebecca mentions another problem, which is outdated data such as job roles. Highly relevant leads, for instance, can suddenly become completely irrelevant over time. There will always be the ‘human factor’ when analysing data sets. The process of deciding which parameters to use, models to overfit, biases to instill, and experiments to launch and not launch, and not to mention what data is worth feeding into a given system, is all still dictated by us. This is something we must always keep in mind.

Be sure to check out the full video, here!

--

--

Keertan Menon
DataSeries

Partner @ Sansa Advisors 🌍 Ex @cerberus @openocean @dataseries