Hard Problems in Data Science: Causality, Sequential Learning and Complex Dynamic Theories

In the second of four informal discussion sessions, Professor Maurits Kaptein from Tilburg University discussed the methodological challenges of data science.

‘’Before discussing the biggest challenges we face in Data Science, we need to first have some common understanding about what data science actually is.‘’

Maurits starts his talk by attempting to define data science, and the current interest in the topic, by highlighting that over the past decades we have become better and better at learning mappings (or functions) from inputs to outputs: due to tremendous advances in AI and Machine Learning we can now learn very flexible mappings. Examples include models that take as input an image and output which person is in the image. Another example would be to take as input all the characteristics of a consumer and all the products in stock, and to output the expected monetary value of a customer when pitching a specific product. Our ability to learn very flexible input-output mappings and use these mappings in practical applications is at the core of what drives the recent interest in data science.

However, Kaptein states, the art and science of better learning is primarily the realm of AI and statistical learning; data science has, at least in popular opinion, a broader interpretation. Data science concerns not only efficient methods for learning mappings, but also concerns the social, ethical, legal, and economical consequences of putting these mappings to practice: ‘if a machine learning-algorithm is used to identify you when you enter the airport, and the mapping from the input image to the output name wrongly identifies you as a terrorist, how should we deal with this?’

In this definition, data science broadly studies the uses and consequences of AI and machine learning in practical situations. However, according to Kaptein, these uses also inspire new theoretical and methodological challenges. `Where traditional AI and Machine learning methods all to often assume that sufficient data is available, that learning does not come at a cost, and that decisions made based on data do not affect the underlying process, such assumptions are in practice very often violated. Consider my own work on computational personalization of treatments in healthcare. While in theory having access to a mapping that predicts exactly the health outcome of a patient given a specific treatment makes the problem of personalized healthcare a trivial optimization problem — -namely choosing the treatment with the best predicted outcome — -in practice this applications highlights a number of challenges that are too often ignored.’

Kaptein cites three such challenges that face data science:

1. Causality — Finding causes
Using personalized healthcare as an example, Kaptein observes that simple methods of learning mappings on observational data often fail to capture the causal effects of interest. For example, when naively training a model that takes as input the use of chemotherapy for treating breast-cancer, and outputs the estimated survival of a patient, one finds that not using chemotherapy improves survival. Hence, using this mapping in attempts to choose treatments would dictate not using chemo-therapy. In reality however, the observational data used to learn the mapping is erroneously interpreted to model a causal effect: in reality the severity of the tumor is a cause of both the decision of the caregiver to administer chemo-therapy, and a cause of the survival rate. Refraining from administering chemo-therapy for those with severe tumors may be disastrous.

In many data science applications it is not immediately clear whether similar errors are lingering. Very often when we use a mapping to make a decision, we are actually interested in the predicted outcome that would have been realized had the world been different. Estimating such causal effects based on observational data or combinations of observational and experimental data is a huge challenge. While scholars like Rubin and Pearl have made tremendous advances in this field in recent years, Maurits argues that a number of the assumptions that underlie these models, such as the assumption in DAGs that causes are unidirectional, are violated in practice. Furthermore, it is unclear how to use methods such as propensity score weighting when new treatments arrive over time. Hence, we still need to further develop methods that can deal with real-life complexities.

2. Sequential learning — Learning from our mistakes
Sticking with healthcare, Kaptein touches on the use of data science in decision-making processes for prescribing new treatments or medicines. While the standard learning model in ML and AI is that data is available already, in personalized healthcare we are faced with a very different situation: we do not have much data for many possible treatment-patient pairs. If we want to collect such data, to eventually improve our mapping, we need to administer the treatment. In such situations we need to balance `exploration and exploitation’; we need to balance trying out new options with using the knowledge that we have already.

`Although effective methods of addressing the exploration-exploitation problem exist for stylized problems, we are still faced with a reality in which medical practitioners rely primarily on the randomized clinical trial to gather their evidence; a strategy that is demonstrably suboptimal for sequential learning problems. However, alternative strategies such as Thompson sampling might be computationally challenging, less transparent, and in theory never lead to deterministic choices, properties that might be practically infeasible’. Kaptein argues that there is still a lot of work to be done to develop effective sequential learning methods for treatment personalization in healthcare and other domains.

3. Complex, dynamic, theories
Causality and sequential learning, especially when approached with a focus on actual application rather than the asymptotic behavior of methods in stylized problems, both present a clear challenge for data science. The final challenge Kaptein highlights is of a different nature: ‘Admittedly less obvious than the previous challenges, is a data science challenge that is inspired by the changing role of theory when using “black-box” machine learning method to learn mappings between input and output. Many ML and AI methods, while often having very good predictive performance, mostly suffer in their transparency and interpretational ease.

On one hand, we should be working on making the existing models more transparent, a research program that is already being carried out at JADS. However, I we should also work in the other direction: in recent years, especially in the social sciences, we have seen the emergence of many complex and dynamic theoretical models that are agent-based. Agent based models provide very transparent mechanisms, and can potentially provide parsimonious theories to explain complex phenomena such as treatment heterogeneity.

However, our ability to fit these types of theory driven models to data are currently limited. We need to actively develop ways in which we can bridge the gap between theory and data, both by eliciting theory from black-box models, and by developing methods to fit complex dynamic generative models to real-life data even when the specification of distance measures or likelihoods is unclear.’

About Professor Maurits Kaptein
A social scientist and statistician, Kaptein is primarily interested in quantitative research methods and computational methods for treatment personalization. He is a professor of Data Science & Health at the Statistics and Research Methods group at Tilburg University and a Principal Investigator at JADS where he runs the Computational Personalization lab (and yes, they are hiring!).