This article looks at what are the recent trends in data science/ML/AI and suggests subareas DS groups need to focus on.
Productionization of Machine Learning
This is going to be the most important focus area for 2018. Most enterprises have done proof of concepts on ML and are looking to realize the full value of their data with full fledged production implementations of the algorithms. The key technologies in this space may be Clipper. Clipper is the state-of-art ML serving system from Rise labs, Berkeley university and uses distributed computing concepts to scale models, containerized model deployment to handle models created in any platform and also performs cross-framework caching and batching to leverage parallel architectures like GPUs. Finally, Clipper can also perform cross-framework model composition using ML techniques like ensembling and multi-armed bandits.
Another interesting technology is related to model selection, what is referred to as autoML — similar to Michelangelo from Uber. Many open source frameworks such as TPot and AutoSKLearn have emerged which help in automatically searching the space of multiple models with appropriate hyper-parameters and select best model for the task at hand. Model management which is the ability to keep track of hundreds of models in production including the model/analytics lineage — things like what metrics were evaluated for each model on what data sets and results, essentially, the whole cycle of model development, retraining and refresh etc. ModelDB is one possible choice in this space.
Another interesting work in this space comes from MapR — work that goes by the name of Machine Learning Logistics whose key tenet is the Rendezvous architecture. Key elements of the Rendezvous architecture include stream based microservices and containerization as well as DataOps style of design that facilitates canaries and decoys. Decoy is a model that does not perform any operation on input data — it is used to persist a copy of the input streams. Canary is a model that provides a reference or baseline against which other models can be compared against. The Rendezvous architecture allows models to be monitored to a great extent and allows model drift to be measured, allowing newer models to take over slowly and seamlessly.
Deep Learning is Here to Stay
There have been some recent papers that have exposed limitations of deep learning including the well known article by Gary Marcus and this article in KDNuggets. Gary argues that deep learning’s limitations arise primarily from having only finite data to train, while it may actually need infinite data for perfect generalization. The KDNuggets article shows how deep learning networks trained for classification can be easily fooled by perturbations in input images as well as by random (nonsensical) images which generate high confidence misclassifications.
However, as shown by the Goolge Deep Mind team who used deep learning combined with reinforcement learning for the Atari game playing system or the AlphaGo, the Go game playing system which combined deep learning with MonteCarlo tree search, deep learning is quite useful and can be combined with other known ML techniques to produce great results.
One of the common issues in using deep learning is hyper-parameter tuning. A recent approach has shown how a reinforcement learning approach can compose certain forms of recurrent networks which can outperform existing systems by a significant factor.
Another recent paper has suggested how deep learning can be used to predict the quantum mechanical properties of small molecules — it has shown that deep learning (a specific form of deep learning known as message passing neural networks) can be applied to graph structured data and is invariant to graph isomorphism.
Implicit VS Explicit Signals
People lie, especially in surveys. So, the traditional method of understanding user behaviour using Survey seems to be falling flat. This was evident in Netflix, when they encountered classic films which were rated very highly by users, but not being actually watched. This is also evident in Google searches and was the subject of a recent Strata Data keynote. The Google researcher made the point that in a Survey of Maryland graduates, only 2% said their CGPA was less than 2.5, while in reality 11% have CGPA less than 2.5. Similarly, in another survey, 40% of engineers in an organization said they are in the top 5% of engineers in the company. Google searches is the digital truth serum, in which people are more truthful than on any surveys or on other platforms. This also tells us that we should be looking out for implicit signals (such as what people actually are watching on Netflix), rather than being dependent on explicity feedback through surveys to understand consumer’s behaviours. This is also evident in some of the recommendation system developed by Pinterest as documented in this keynote speech; they used implicit signals of how users interact with pins (save pins, repin other users pins, search for pins as well recommendations that users do not like or ignore) and were able to recommend appropriate personalized content for users based on graph based recommendation engine.
Model interpretability or explainability is the ability of an ML algorithm to explain why it is making a prediction in a certain way. It could be that due to a number of cases seen in the training data of frauds, system is potentially concluding this to be a case of fraudulent transaction. Model interpretability is becoming important as more and more ML models are being put into production in several domains such as finance, insurance, retail and even healthcare. An emerging technologies in this space include Skater and Lime among others.
AI Beyond Deep Learning
Libratus, another recent game playing system combined Nash equilibrium and game theory to solve imperfect information games such as Poker. with applications in dynamic competetive pricing and product portfolio optimization. The three main components of Libratus include
- Game abstracter — which computes an abstraction of the game which is smaller and easier to solve and also computes game-theoretic strategies for the abstraction.
- Second module constructs an fine-grained abstraction of the sub-game (state of the game after a couple of rounds) and solves it using a technique known as nested sub-game solving.
- The third module is the self-improver, which creates a blueprint strategy for the game, filling in parts of the blueprint abstraction and computes game-theoretic approaches for those branches.
Reinforcement learning is another important tool in the bag of data scientists. It is now being combined with deep learning to develop deep reinforcement learning networks such as the one from Google.
Privacy in the age of Machine Learning.
Data privacy and consequent data protection laws including the General Data Protection Regulation (GDPR) are important in this respect. GDPR for instance, have rules in place which tell that user’s consent is necessary to even collect personal data. Users also have rights to ask questions about the data being collected as well as modify the data or delete them or object to the use of personal data for targeting if they desire so. Data processors also have certain obligations as per GDPR. GDPR will influence design of all data products in 2018.
It is not only important to secure the data pipelines and the infrastructure, but it is also important to secure the BI and the analytics. This is where the privacy-preserving analytics become relevant and is going to assume great importance in 2018. The technology relevant for securing BI is a recent joint work between Uber and Rise Labs, Berkeley university, which guarantees differential privacy for SQL queries based what is known as elastic sensitivity that combines local sensitivity mechanism with general equijoins.
Privacy preserving analytics especially for deep learning models is achieved using a technique called federated learning that is popularized by Google. It is based on a centralized parameter server which hosts all the parameters required for learning. Each phone can download the parameters, use the local data to improve the model and compose a small update message and send it back to the parametric server. The server aggregates updates from several phones and updates the model parameters centrally. The interesting part of federated learning is that data is all local, while learning is global. Each phone updates the model by using its own local data and performing its computation locally — decoupling the ML from storing data on the cloud. This is being used in Gboard (Google Keyboard) on Andriod.
The above single model training by using federated learning does have challenges such as high communication overhead, stragglers and fault-tolerance as well as the statistical challenges of fitting model to the data. This is being addressed by recent work in this space from CMU by using federated kernelized multi-task learning that addresses the statistical challenges by using multiple models and updating them simultaneously, while the system challenges are addressed by MOCHA, the distributed optimization model.