The Future of Artificial Intelligence: 4 Trends from O’Reilly’s AI Conference

Paul Filitchkin
Building Creative Market
10 min readOct 22, 2019

I had the opportunity to attend O’Reilly’s AI Conference in San Jose in September 2019. It was a well-organized conference that focused on high-level advances in the industry with just the right amount of technical detail to keep a data scientist like myself happy. Here are the top AI trends that came up at the conference.

Image Credit: O’Reilly Conferences

1. Trusting AI in the Real World

A key set of topics at the conference focused on trusting AI models in the wild. Many tools and techniques were discussed around interpretability, fairness, and accuracy in production (i.e. model drift).

Most of today’s cutting edge models still operate as a black box, and on their own are unable to explain, for example, why a loan application was denied or why a product was recommended to a user. Top AI companies have built their own mechanisms to interpret results, but there is an emerging trend of modified model architectures as well as stand-alone services that help the rest of us understand what’s going on under the hood.

“I’m sorry, Dave. I’m afraid I can’t tell you why I did that”

One keynote presenter talked about Microsoft’s open-source package, Interpret ML, which uses an Explainable Boosting Machine architecture to help interpret results while maintaining state-of-the-art accuracy¹. A keynote presentation from IBM showed off their cloud-based tool, Watson OpenScale, which offers a set of dashboards that provide visibility around model decisions. The speaker presented a case-study where a bank was able to explain why a certain loan application was denied by identifying input features with the highest impact on the decision. The bank was then able to tell the applicant that saving more money, in this case, would ensure an approval in the future².

In the expo hall a rising startup, Fiddler AI, showed off their interpretability tech and how it can be applied to many existing machine learning stacks. Fiddler is able to add a layer of explainability without requiring any special changes to a model’s architecture. Their product works with wide a range of frameworks and uses training data together with unlabeled (production) data to answer questions such as why the model did (or did not) take an action and where the model succeeded or failed³.

In addition to interpretability, fairness continues to be an important, but difficult problem to tackle. The issue stems from historical data that includes biases from past decisions and behaviors which the model then inherently learns. It requires both technologists and decision-makers to first identify the bias and to then take corrective steps to ensure fair decisions are made. For a comprehensive deep-dive on fairness in AI see the Data Show episode “Why it’s hard to design fair machine learning models” ⁴.

Watson OpenScale came up again around this topic. Their platform is able to identify bias in production systems and correct for it using an additional layer of machine learning. They dub this technique “de-biasing” ⁵. Unlike many open-source packages, which rely on randomly changing predictions, their machine learning model is able to learn if any inputs which are linked to bias (e.g. gender, age, zip-code, etc) cause the model to act unfairly. With this knowledge, they change the input feature to the majority value and use this new prediction. A practical example: if a certain zip code frequently causes loan applications to be denied (regardless of other features) the system will use the majority zip code instead to get a new prediction. This ensures that important features, such as credit score, are not ignored while still maintaining fairness.

In virtually any real-world application varying trends will expose your model to data that is different from training examples, and over time your model’s accuracy may degrade. This phenomenon, commonly referred to as model drift, was another big topic at the conference. Naturally, the following question arises: without hand-labelling new data and re-analyzing the model on a fixed schedule how can you ensure your model is performing well?

In several domains, such as fraud detections and user recommendations, the system will naturally produce labelled data. Certain transactions will be reported as fraudulent by users and likewise certain recommendations will be clicked on or dismissed. So in these cases it’s possible to continually monitor accuracy (albeit with some delay). It’s also a good idea to setup real-time monitoring around key business metrics that your model directly impacts as a proxy for model accuracy. Many monitoring platforms have easy-to-use anomaly detection that can alert engineers if click-through rates, for example, suddenly dip.

In applications such as image detection and sentiment analysis, input features, internal model weights, and output predictions are the only direct indicator of model performance. One set of techniques, that works particularly well for structured data, compares the statistical properties of real-world inputs against training data. A measure of statistical divergence (such as KL divergence) is then computed to represent the level of surprise the model encounters when processing new data. Decision makers can then monitor this value in near realtime and take corrective actions if it becomes too high. Depending on perceived impact to the business, the model may need to be taken offline or simply undergo a new round of labelling and re-training.

Overall, it was very refreshing to hear about topics which go beyond model accuracy on benchmark datasets. For practitioners like myself, it’s very important to understand all of the real-world implications of running models in the wild and how to ensure they provide business value while adhering to high ethical standards.

2. AI Compute, Generalization, and Sustainability

No academic or industry AI conference would be complete without a discussion on computational power. As OpenAI’s recent study illuminated “the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.5 month-doubling time” ⁶. This is much faster than Moore’s law at its peak, and this trend does not seem to be waning.

For some considerable shock and awe, Cerebras unveiled the largest physical chip ever made during their keynote⁸. It contains 1.2 trillion transistors compared to current largest GPU, which has a measly 21.1B transistors. They designed it specifically for deep learning workloads which run highly parallel computations. Currently, most large deep learning models are trained on multiple GPU, but off-chip communication is tens of thousands of times slower than on-chip communication. Their design addresses this problem and effectively uses an entire wafer to achieve “cluster-scale performance on a single chip” ⁷.

At one of the seminars, the presenter talked about how XLNet, a state-of-the-art language model from CMU and Google Brain, was trained at the cost of $245k⁹. Not only is this a prohibitive cost for most organizations, but it raises the important question of sustainable AI (see this MIT Tech Review article¹⁰ on the considerable impact of model training on carbon emissions). In other words: how do we ensure that training is done using sustainable power sources and how do we avoid repeating computation for each new application?

One solution is to start where researchers left off. XLNet, for example, demonstrates some remarkable generalization properties (try an interactive online demo to fully understand its predictive power). In the computer vision realm, single-shot and R-CNN-based architectures also continue to improve their generalization abilities, and there is no shortage of freely available pre-trained models. For deep learning models like these, transfer learning (i.e. fine tuning an existing model for a specific application) continues to be a good starting point for new applications. In addition, more frameworks focus on tutorials and example hyper-parameters to help developers effectively utilize transfer learning. This is a technique that is both easy on the wallet and helps eliminate computational waste.

3. AI Algorithms for Tomorrow

Natural language applications are dominating much of AI. Image credit: https://creativemarket.com/Vige/3571909-Natural-language-processing-concept

As consumers demand more intelligence from search engines and voice assistants, natural language processing (NLP) continues to be a rapidly growing area of research. In fact, many have argued that “NLP is having its ImageNet moment”. This is referring to the large number of language models that are coming out in rapid succession: ELMo, BERT, potentially other Sesame Street characters that I’m not aware of, XLNet, and many others. These language models use self-supervised learning techniques to help predict words within a context in order to build nuanced representations of text. Their ultimate goal is to help us interpret human communication.

A more forward-looking perspective on applied AI was given by UC Berkeley professor Michael Jordan during his keynote¹¹. His big bet is that we will need better algorithms to address complex multi-agent systems, specifically in matching markets. This is a combination of the multi-armed bandit problem (where an agent must decide to explore or exploit one of several choices) and a two-sided marketplace with finite resources (where many agents are competing for something with a fixed quantity). Jordan argues that today’s AI algorithms do not consider how their predictions will impact such complex ecosystems. To put this into a concrete example imagine, for instance, a food recommendation app that sends nearly every user to the same restaurant. This overwhelms the restaurant and puts a big constraint on its resources which ultimately makes customers unhappy. Jordan’s lab has produced some early work that attempts to maximize agent happiness in these finite-resource scenarios, and he has hinted that there is much more work to come.

Kenneth Stanley talking about open-endedness in AI. Image Credit

Kenneth Stanley, from Uber AI Labs and University of Central Florida, presented on another fascinating research topic: open-endedness in AI. He argues that the evolution of life is the only known process that has produced human-level intelligence, and that the process did not result from trying to solve one narrow task or even optimizing for the outcome itself. Moreover, the ingenuity of the human species has come from many serendipitous moments of learning¹².

He went on to show off some of his work on open-ended learning, which was applied to biped agents in a simulated environment. The agents learned policies through open-ended objectives and then successfully completed specific tasks such as jumping over pits and ducking under obstacles. He compared these to agents who failed to accomplish the same tasks from directly trying to learn policies through reinforcement learning.

Opened-ended learning, which encompasses Stanley’s work on neuroevolution (genetic algorithms for training neural networks), offers a starkly different approach to today’s popular deep learning techniques. These non-traditional methods have already shown a surprising number of successes, as outlined in this Nature article¹³. This work begs the question if one day the AI community will favor these methods over gradient descent-based techniques on narrow objectives.

4. More Mileage out of your Data

Over the last 5 years the AI community has been flooded with new frameworks and pre-trained models, and computational power has increased by leaps and bounds. However, acquiring domain-specific data is just as hard as it was 5 years. Sure there are more crowdsourcing platforms, but they come at a cost and in many cases require additional effort to maintain consistent quality. Crowdsourcing platforms also do not directly help with specialized data labelling (e.g. medical diagnosis). There are also many privacy concerns to navigate in order to maintain various levels of compliance (e.g. GDPR). Even companies like Google are interested in more automated data labelling schemes due to the rate that their data changes.

Snorkel is a framework, originating from Stanford, that is aimed at just that. It uses several techniques to give new life to unlabeled data by making use of weak supervision¹⁴. One technique allows labelers (who have a limited degree of programming knowledge) to write labelling functions which are simple Python methods that assign labels based on available meta attributes¹⁵. For example, when classifying radiology images for signs of a disease, a labelling function could simply assign the “no disease” class when the doctor’s notes are relatively short. Of course, this may not always be true, but as long as the labelling function is correct most of the time (theoretical bounds are available in the author’s paper¹⁶) the data can be successfully used to improve model accuracy. Specifically, the Snorkel pipeline is able to combine many high and low-quality labelling sets in an intelligent way to help train remarkably accurate models. This is an incredible AI advance in a world where there is such a shortage of high-quality labelled data.

There were many other great topics from the conference that I did not get a chance to discuss, and I encourage readers to view select videos highlights here. Overall, I had a great time at the conference and I look forward to attending more events like this in the future.

Sources

[1] “Interpret ML”, https://github.com/microsoft/interpret

[2] “Unlocking the value of your data”, https://www.oreilly.com/radar/unlocking-the-value-of-your-data/

[3] https://blog.fiddler.ai/2019/09/series-a-why-now-and-whats-next-for-fiddler/

[4] “Why it’s hard to design fair machine learning models”, https://www.oreilly.com/radar/why-its-hard-to-design-fair-machine-learning-models/

[5] “De-Biasing in IBM Watson OpenScale”, https://medium.com/trusted-ai/de-biasing-in-ibm-watson-openscale-95ca89fa2072

[6] “AI and Compute”, https://openai.com/blog/ai-and-compute/

[7] “Cerebras Wafer Scale Engine”, https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-chips-for-deep-learning/

[8] Andrew Feldman’s (Cerebras) Keynote: “Enabling AI’s potential through wafer-scale integration”, https://www.oreilly.com/radar/enabling-ais-potential-through-wafer-scale-integration/

[9] “CMU and Google researchers present XLNet: a new pre-training method for language modeling that outperforms BERT on 20 tasks” https://hub.packtpub.com/google-researchers-present-xlnet-a-new-pre-training-method-that-outperforms-bert-on-20-tasks/

[10] “Training a single AI model can emit as much carbon as five cars in their lifetimes”, https://www.technologyreview.com/s/613630/training-a-single-ai-model-can-emit-as-much-carbon-as-five-cars-in-their-lifetimes/

[11] Michael Jordan’s Keynote, “On gradient-based methods for finding game-theoretic equilibria” https://www.oreilly.com/radar/on-gradient-based-methods-for-finding-game-theoretic-equilibria/

[12] Kenneth Stanley’s Keynote, “Open-endedness: A new grand challenge for AI” https://www.oreilly.com/radar/open-endedness-a-new-grand-challenge-for-ai/

[13] “Designing neural networks through neuroevolution”, https://www.nature.com/articles/s42256-018-0006-z

[14] “Snorkel: Programmatically Building and Managing Training Data”, https://www.snorkel.org/

[15] “Snorkel Intro Tutorial: Data Labeling”, https://www.snorkel.org/use-cases/01-spam-tutorial

[16] “Data programming: Creating large training sets, quickly”, https://papers.nips.cc/paper/6523-data-programming-creating-large-training-sets-quickly.pdf

--

--