Universal Sentence Embeddings; PyMC; Judea Pearl on AI

Weekly Reading List #5

Issue #5: 2018/05/14 to 2018/05/20

This is an experimental series in which I briefly introduce the interesting data science stuffs I read, watched, or listened to during the week. Please give this post some claps if you’d like this series to be continued.

SoA Word and Sentence Embeddings

An excellent overview of state-of-the-art word and sentence embeddings by Thomas Wolf @ Huggingface:

I’ve found Mr. Wolf’s post from a Google search after reading the technical notes of Talk to Books. Talk to Books is an experimental AI that lets you use NLP query to find relevant quotes from books. It’s far from perfect, but surely provide a quite different search experience:

Semantic Similarity Using the Universal Sentence Encoder

So I downloaded the universal sentence encoder using Tensorflow Hub and played with it a bit. The paper introduced two types of encoders, Transformer and Deep Averaging Networks(DAN). On Tensorflow Hub, the large version of the encoder uses Transformer, and the regular one uses DAN (They just added that specification in the description very recently. It was really confusing before that…).

It seems that using trainable=True when loading the encoder does not work. So fine-tuning is not officially supported. Someone provided a workaround by replicating the entire graph. However, I’d imagine there will be some complication when tuning the encoder. Some technique like differential learning rates might be needed.

Big News Announcement on PyMC4 and PyMC3

We’ve seen PyMC3 previously in a post about March Madness prediction, and mentioned the potential problem that its back-end Theano has ceased development and maintenance. Recently PyMC team announced that they’ll take over Theano maintenance for the purpose of continuing the development of PyMC3. In the mean time, PyMC4 will be developed based on Tensorflow Probability.

Judea Pearl on AI

Turing award winner Judea Pearl, whose specialty is probabilistic and causal reasoning, points out how the recent success in AI development has serious limitations:

“All the impressive achievements of deep learning amount to just curve fitting,”

On how create machines that share our intuition about cause and effect:

We have to equip machines with a model of the environment. If a machine does not have a model of reality, you cannot expect the machine to behave intelligently in that reality.

Clickbait Detector(s)

An interesting project with a Chrome extension to detect clickbait headlines:

The training dataset(12,000 headlines) came from several news outlets. The headlines are labeled based on the source (I’m not sure if there was some additional filtering inside a source).

It’ll be interesting to evaluate if the model generalized well outside the known sources, e.g. Youtube video titles, and how can we make the training dataset more representative/general.

A Useful SSH Tip

Twitter Fake Follower Detector