Ola Spjuth, PhD
Uppsala University and Scaleout Systems
I recently participated in the 8th Symposium on Conformal and Probabilistic Prediction with Applications in Varna, Bulgaria on Sept 9–11, 2019. I had together with co-authors two manuscripts accepted, and I presented these orally at the conference.
For those of you new to Conformal Prediction, it is a methodology to complement predictions with valid measures of confidence; i.e. instead of producing point predictions (like most traditional machine learning methods) conformal prediction outputs a prediction interval given a specific level of confidence set by the user. Naturally, if higher confidence is required for the application, the prediction interval will be larger. Conformal predictors are mathematically proven to be valid, meaning you will get an error rate corresponding to the specified significance (or confidence) level; meaning if you request prediction intervals with 95% confidence then the true prediction will be within this interval 95% of the times.
The book Algorithmic learning in a random world by Vladimir Vovk, Alex Gammerman, and Glenn Shafer is the main reference, and the website http://www.alrw.net collects many new working papers in the field by the authors continued research. I have personally worked with applications of conformal prediction to problems in drug discovery, together with collaborators from pharmaceutical companies including AstraZeneca (see list of my research papers).
Combining Prediction Intervals on Multi-Source Non-Disclosed Regression Datasets (Paper 1)
In this paper we explore the case when multiple agents (e.g. companies) would like to make predictions over their collected data sets, but when training data is available at different agents in different data sources that cannot be pooled. This could, for example, be due to: i) Regulatory reasons (you are not allowed to move data from location A to location B, e.g. for medical data); ii) Privacy reasons (e.g. sensitive, personal data); iii) Security reasons (you do not trust the security of the other parties); and iv) Practical reasons (e.g. data size is large).
Here we consider the regression case and propose a method where a conformal predictor is trained on each data source independently, and where the prediction intervals are then combined into a single interval. We call the approach Non-Disclosed Conformal Prediction (NDCP), and we evaluate it on a regression dataset from the UCI machine learning repository using support vector regression as the underlying machine learning algorithm, with a varying number of data sources and sizes.
Our results indicate improved performance when comparing a combined prediction with the individual sources, and the effect is more pronounced for 4 and 6 data sources than for when only 2 sources are combined. This approach has the advantage that it is rather simple to implement in real-world scenarios, and the only data that is sent between the agents is the new object to predict, and the resulting prediction intervals to aggregate.
Spjuth O., Brännström R.C., Carlsson L. and Gauraha, N.
Combining Prediction Intervals on Multi-Source Non-Disclosed Regression Datasets
Proceedings of the Eighth Symposium on Conformal and Probabilistic Prediction and Applications, Proceedings of Machine Learning Research (PMLR). 105, 53–65. (2019).
Split Knowledge Transfer in Learning Under Privileged Information Framework (Paper 2)
Learning Under Privileged Information (LUPI) is a methodology that enables the inclusion of additional (privileged) information when training machine learning models, data that is not available when making predictions. One example could be training a model on high-resolution imaging data together with low-resolution data, but making predictions only on low-resolution data. Vapnik et al showed that this has advantages in a paper where drones had two types of cameras, where drones at predict time had to fly faster and could only use the low-resolution camera.
In our work, we improved the accuracy of the LUPI implementation referred to as Knowledge Transfer, where privileged information is estimated from standard features using regression functions. Inspired by the cross-validation approach, we propose to partition the training data into K folds and use each fold for learning a transfer function and the remaining folds for approximations of privileged features. We evaluate the method using four different experimental setups comprising one synthetic and three real datasets. The results indicate that our approach leads to improved accuracy as compared to LUPI with standard knowledge transfer.
Gauraha, N., Söderdahl, F. and Spjuth, O.
Split Knowledge Transfer in Learning Under Privileged Information Framework
Proceedings of the Eighth Symposium on Conformal and Probabilistic Prediction and Applications, Proceedings of Machine Learning Research (PMLR) 105:43–52, 2019.
About the author: Ola Spjuth is Associate Professor at Uppsala University (research group website: https://pharmb.io) and Lead Scientist AI at Scaleout Systems. His research interests are mainly in applying AI and Machine Learning to automated high-throughput cell profiling with applications in drug discovery, but also conformal prediction and privacy-preserving machine learning.
Scaleout Systems is a Swedish company focused on developing next-generation AI systems for privacy-preserving Federated Learning.