NER is a sub-task of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as

person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

NER is one of the features of TextSpace. To read how it compares in the feature intent classification, read this post.

We have created our own small test data set with 11 examples taken from Google’s Taskmaster 2 data set, which was just released in February 2020. …

This time, we will examine what **homoscedastic, heteroscedastic, epistemic, and aleatoric uncertainties** actually tell you. In my opinion, this is an upcoming research field in Bayesian deep learning and has been greatly shaped by Yarin Gal’s contributions. Most illustrations here are taken from his publications. But also see one of the field’s latest contributions where we propose a new, reliable and simple method how uncertainties should be computed.

As a background, in Bayesian deep learning, we have probability distributions over weights. Since most of the times we assume these probability distributions are Gaussians, we have a mean *μ* and a variance *σ²*. …

So far, we have elaborated how *Bayes by Backprop* works on a simple feedforward neural network. In this post, I will explain how you can apply exactly this framework to any convolutional neural network (CNN) architecture you like.

You might have seen Gal’s & Ghahramani’s (2015) publication of a Bayesian CNN, but that’s an entirely different approach and, in my opinion, not comparable with Bayes by Backprop. Personally, I would not even speak of a Bayesian CNN with their method, but we won’t cover the differences in detail here. Feel free to read Shridhar et al. …

Some of you might have heard the term *Occam’s razor, *sometimes spelled *Ockham’s razor*, together with Bayesian methods. It is indeed a great concept which is very useful for many applications. This post is just another argument why Bayesian methods are so widely applicable and must be applied. In a fairly short post, I will explain how Bayesian methods embody *Occam’s razor* naturally in an intuitive language and directly refer it to deep learning. Much of my post is taken from David MacKay’s PhD thesis (1992). I highly recommend giving it a read.

Imagine you work in a supervised learning setting, let’s say CIFAR-100, and you are quite new to computer vision, but already know that a convolutional neural network (CNN) is your choice to go. But, you do not really know how large your CNN needs to be to achieve the desired validation accuracy. In fact, even experienced deep learning researchers and practitioners still need to follow some rule of thumbs or trial and error to find an architecture of a CNN which performs well enough when they work with a new data set. …

By now, all of you have probably followed deep learning research for quite a while. In 1998, LeCun et al. proposed the very simple MNIST data set of handwritten digits and showed with their LeNet-5 that we can achieve a high validation accuracy on it. The data sets subsequently proposed became more complex (e.g., ImageNet or Atari games), hence the models performing well on them became more sophisticated, i.e. complex, as well. Simultaneously, the tasks these models can perform also became more complex as, e.g., Goodfellow et al.’s GANs (2014) or Kingma & Welling’s VAEs (2014). One of my personal highlights is Eslami et al.’s Neural scene representation and rendering (2018), which clearly shows that neural networks can perform fairly complex tasks to-date. …

In my last post, we explored what possibilities we have to infer the intractable posterior probability distribution. It was rather generic, because we spoke of probabilistic models — which might be neural networks, but not necessarily.

How Bayesian inference is applied to neural networks is covered in this post.

**Here, we reflect on Bayesian inference in deep learning, i.e. Bayes by Backprop.**

In its quintessence, interpreting neural networks in a Bayesian perspective means to introduce uncertainty to the neural network. This uncertainty is not necessarily introduced on the model parameters, i.e. weights, how it is done in *Bayes by Backprop*. It can also be introduced on the number of hidden layers and hidden units, or on the activation function. These factors of potential uncertainty may be summarised as *structure parameters*, and learning the posterior probability distribution of those may be construed as *structure learning*. But there are also many more potentially uncertain factors like the number of epochs, weight initialisations, loss function, batch size, etc. …

In this post, we’ll elaborate on the two major school of thoughts how we can approximate an intractable function in probabilistic models. In my previous post about inference in deep neural networks has one special method been elaborated, but there are many more which are widely used in other forms of probabilistic models, e.g. probabilistic graphical models (PGM), but not yet in neural networks. David MacKay’s book is the main source for this post; it’s excellent, please consider reading the chapters you’re interested in. …

Having understood all the fundamentals, we can now proceed and apply them to deep learning. If you cannot follow my steps here, please go back to my previous posts, or comment to this and I’ll answer you. We’ll have some examples, questions and neat graphics on the way, so it won’t be too arid.

We will not explain what deep learning or neural networks are, but if you don’t feel you have a solid understanding, read Shridhar’s post as an introduction or attend Andrew Ng’s coursera course for a more details.

The very base of probabilistic deep learning is understanding a neural network as a conditional model *p* that is parameterised by the parameters or weights *θ* of the network and output *y* when some input *x *is given. …

The intersection of probabilistic graphical models (PGMs) and deep learning is a very hot research topic in machine learning at the moment. I collected different sources for this post, but Daphne Koller’s coursera course is an outstanding one. Everything you need as a background is given by my first two posts (first, second). Please revisit them, if you don’t understand a point, or just comment on the bottom of this page and I’ll answer.

A probabilistic graphical model is a representation of conditional independent relationships between the nodes.

Nodes are random variables. When we shade them, it means we have observed, hence have data for them. If nodes are blank, they are unknown and we call them latent or hidden variables. …

The second part of our fundamentals. Keep in mind, if you do not already know some of these definitions, write them down on flashcards and learn their definitions and formulas as they were vocabularies.

This time, we take the *context for our examples *that we already have seen in Fundamentals 1 for some definitions. We simply see the history, say starting from 1945, of the day temperature of a typically warm spring day, say May, 23rd, in Copenhagen.

**Sampling:** Randomly choosing a value such that the probability of picking any particular value is given by a probability distribution.*Example: *We randomly pick a temperature of all the temperatures ever being recorded in Copenhagen on May, 23rd. This could be18°C. Note that chances to pick a temperature that has occurred more often than others are higher. …

About