This year’s 25th edition focused on data science, data mining and large-scale data analytics. We couldn’t miss it, so 11 of us took a long-ride to attend the KDD conference at Anchorage, Alaska, from August 4 to August 8.
KDD’19 is a 5 days conference, with 34 workshops, and 12 hands-on tutorials. KDD is organized on two main tracks: research and applied data science.
- Research track welcomes researchers both from the academy and the industry.
- Applied data science focused on practitioners.
- Keynote talks were given by leading researchers and domain experts.
- Hands-on tutorials had a lesson format: the aim was to introduce attenders to (proprietary and open-source) technological stacks by running simple exercises or reproducing results presented in previous papers.
The main themes of this edition were:
- ethics and social impact of data science
- data science applied to healthcare
- deep learning
- explainable AI
The opening speech introduced two main topics to the audience: the increasing focus of the community on healthcare, plus new ethical reflexions resulting from the increasing availability of data and applications. Fairness of machine learning algorithms will be a hot topic.
We cannot talk about all the papers that were presented at this conference, namely more than 300 accepted papers! Here you’ll find a summary of the highlights according to the Criteo delegation.
Criteo co-organized the AdKDD workshop focusing on highlighting state-of-the-art advances in computational advertising. Examples of topics include the following ones.
Incrementality in online advertising
Vincenzo D’Elia from Criteo gave a talk “On the causality of advertising”. He gave an overview of where Criteo is in the ad ecosystem. He described the problem of sales attribution, the lack of solid foundations in term of incentive compatibility and the pay-off for advertisers, and discussed A/B test protocols for incrementality.
Find his slides here.
Advertising systems and techniques
Taobao presented their advertising pipeline and the different optimizations conducted on each of its components: from the ad matching to conversion likelihood estimation, closing up with their impression allocation strategy. The key observation was the adoption of Deep Learning in almost every component and iterating over different network architectures.
Budget-constrained Auctions and ad-space allocation
There were several talks with relatively different setups on placement allocation and optimal bidding strategies. For instance, the talk by C. Pita on an optimal bidding modelling in RTB ecosystems, with a convex relaxation of the optimization problem. Another example of ad-space allocation in search by Aranyak Mehta From Google on a model to maximize the advertiser value in an internal auction under different constraints, a general optimization framework has been proposed to model constraints like budget limits or target CPA.
Best paper award: Graphing Crumbling Cookies
The “best paper” award of the AdKDD workshop was attributed to Graphing Crumbling Cookies, by M.Mallow, J.Koller and A.Cahn. For the past few years, web browsers have increasingly limited the persistence of identifiers (cookies), making user tracking more difficult. A revealing example is Safari’s Intelligent Tracking Prevention. This paper presents a clever way to overcome the lack of persistent identifiers without infringing on user privacy, that is without using browser fingerprinting. It consists of using community detection in the Device Graph to detect stable cohorts(person or household level grouping). It is then possible to find the IP addresses that are associated with the cohort over time and thus defining a persistent ID based on these IP addresses. This technique is called Graph backfilling. This technique reaches its limits when many people use the same IP or in the case of dynamic IPs. This is why it works like a charm in the US, but is more difficult to apply in China.
If you want to know more about it, check out their paper.
The full list of talks with the corresponding slides could be found on the following link.
Applied Data Science — guest talks
Rich Caruana presented in his talk “Friends don’t let friends deploy black-box models” the importance of intelligibility and interpretability of machine learning models. Many machine learning researchers believe that if you train a deep net on enough data and it looks accurate in the test set, it is safe to deploy to production. In some context this is true, but in some specific settings, it can be extremely risky. In the study he realized in the nineties on the prediction of death for pneumonia, the most accurate algorithm was based on a Neural Net. They realized that a much simpler rule-based algorithm learned that asthma was reducing the risk of death if pneumonia occurred. Doctors confirmed that asthmatics are high risk, but it was a real pattern in the data (asthmatics notice symptoms sooner, get healthcare faster and receive more aggressive treatment). Eventually, they decided not to use the neural network in the US healthcare system, even if it performed best on test data.
Caruana motivated the usage in this context of GAMS models, due to their accuracy (comparable with neural nets in this task), but highly interpretable by domain experts. Based on the application (in this case, a decision on the treatment of ill people), he also proposed to manually edit the model based on domain experts knowledge.
Microsoft and Healthcare
Peter Lee, from Microsoft, described the potential of ML in healthcare, as well as the challenges they are facing in his talk “The Unreasonable Effectiveness, and Difficulty, of Data in Healthcare”.
Satya Nadella defined the new strategy of the company, shifting more and more to healthcare. Microsoft has many partnerships with medical institutions and hospitals, to collect data and provide analytics on top of data. It is natural to use machine learning techniques to create new innovative products in the area.
The possibilities are endless, ranging from more systems assisting radiologists to delineate tumours, computer visions system to help diagnosis of tumours, graph and knowledge extraction from medical papers (4000 new papers are published every day on PubMed!). A place where ML brings value where you do not expect: Merritt Hawkins found in a 2018 survey that 78% of doctors suffered from symptoms of burnout. A particularly stressful task are medical visits: the doctor has to take accurate notes during medical visits, maintaining empathy with the patient. Microsoft is building an assistant that takes automatically notes so that the doctor can keep eye contact with the patient, and then interprets the text to extract medical concepts. In this way, the doctor can review the notes and he is ultimately completely owner of the process. The system is there to assist him, and it learns from past corrections, reducing progressively the number of interventions that the doctor has to do.
Main challenges in the field are data collections since we miss modern standards for health data. They introduced, in a consortium on-boarding Google, IBM, Oracle and SalesForce, FHIR a standard for data models, API specs to exchange data, and a set of tools and servers to build applications with. US government is promoting FHIR as the data standard for health.
It is the first citizen in Azure, and a server is published on Github. Retailers (pharmacies) are integrating that.
He closed his talk with the message that ultimately we do not know how good is AI for prediction in medicine. Papers have often statistical methodological issues, and we miss a real perspective.
The paper describes a variation of kd-trees for nearest neighbor search with favourable probabilistic guarantees. Their method gets inspiration from Random partitioning trees, although simpler. The algorithm relies on rotating the data with random rotation and creating an ordinary axis-aligned kd-tree. The search procedure is a defeatist procedure that looks for the nearest neighbor into the insertion leaf of the query (without backtracking, and a leaf contains a fixed number of nodes). The use of multiple trees with different rotations reduces the probability of a miss (just as in random projection-based methods such as LSH). The authors also use approximate schemes to perform random rotations to reduce the computation time required for a search query. The approximate final algorithm runs on O(dlg(d) + lg(n)) with n the number of points in the database and d the dimension of the data. This is a better search query complexity compared to the original vanilla kd-trees which have complexity logarithmic in n but exponential in d. The figure below from the paper illustrates the idea behind the paper with the search space of three rotated search trees being the union of the individual search queries.
Check the reference paper here: Revisiting Kd-Trees for nearest neighbor search.
The authors propose an extension to the k-means problem by relaxing the constraint of belonging to one cluster, instead they assume that the points belong to each centroid with a probability. The optimization problem seeks to find the solution minimizing the loss over the centroids and the probabilities vectors. The authors then propose an alternating optimization scheme and equivalent modelling as a constrained bipartite partitioning problem. The main motivation for the method is that it allows capturing non-convex clusters.
Optimizing impression Counts for Outdoor Advertising
This paper is an interesting one as it is the projection of retargeting to the real physical world. The authors aim to solve the problem of deploying ads on billboards in order to maximize influence or impression counts. The problem setup assumes having a set of billboards, a set of trajectories, and assumes that the influence of a billboard is a logistic function in the number of times the ad is seen by an individual driver on his journeys. In other words, the influence begins small, then increases rapidly, and then plateaus. Finding an optimal assignment that maximizes the overall influence is NP-Hard, the authors hence propose a branch and bound schema, and use a submodular estimation of the logistic influence function.
Deep Learning at Scale
In this session, we went through the steps of single-node deep learning model to distributed model inference and finally distributed model training and productionization. We used Keras with a Tensorflowbackend for the deep learning model. We leveraged Spark to distribute the computations across the workers and Horovod to distribute the model training. Furthermore, in order to be able to track and reproduce our numerous experiments, we used MLFlow which is an open-source platform for managing the end-to-end machine learning lifecycle. One of the main advantages of MLFlow is that it is library-agnostic. You can use it with any machine learning or deep learning models. It is even possible to mix different programming languages such as Scala, Python and R.
The slides are available here.
Concept to Code: Deep Neural Conversational System
This session showed a few deep learning algorithms for NLP. The repository with notebooks and paper references is available here.
Democratizing & Accelerating AI through Automated Machine Learning
It gave an introduction to AutoML tools in Microsoft environment. Reasons for AutoML are: it helps improving models, where AutoML tools do not need access to data, data is owned; it democratizes ML, enabling domain experts and data scientists to focus on business problems, and developers to prototype products based on ML. Finally, it accelerates the work of data scientists, that can leave hyperparameter tuning to automatic smart tools, and the ability to manage many more models than what they can do today.
Code for the tutorial is available here on Github.
“Tensorized Determinantal Point Process for Basket Completion” — our AI Lab poster session
There is a lot of exciting research happening at Criteo, we contributed to this year’s KDD with our work on Tensorized Determinantal Point Process for Recommendation. This work focuses on learning to predict the next item that should be added to an online shopping basket.
More precisely, the objective of basket completion is to suggest to a user one or more items according to items already in her cart. Early approaches involved computing a collection of rules in order to provide the recommendation, where all rules that satisfy the conditions are selected for recommendation. This is very heavy computation and not scalable for large catalogues. An alternative approach is based on determinantal point processes model co-purchase probability from item-item similarity kernel matrix and computation of determinants of the submatrices. The main contribution of this work is the generalization of the previous work on DPP for basket completion using tensorized approach enhanced by logistic regression. This new model allows us to capture ordered basket completion and we can leverage the information on the order in which items are added to a basket to improve predictive quality.
This years’ KDD was a really good opportunity to get to know about the main themes of interest to the industry of Data Science. Themes like interpretability and healthcare were excellent examples of the future of Data mining. We’re already looking forward to attending the 2020 edition in San Diego!
Authors : Vincenzo D’Elia, Nasreddine Fergani, Mathieu Lechine, Edouard Mulliez, Zofia Trstanova.