Probabilistic Programming Possibilities

Composing distributions with Edward and TensorFlow

Nicholas Teague
From the Diaries of John Henry
25 min readJan 28, 2020

--

Took a short break this week to conduct a little research on probabilistic programing, although I’m not sure if research is the right word since I’m not like running experiments or anything — does it count as research when you just read a bunch of papers? Hmmm. Well given how hard the papers were I’m just going to go ahead and call it research. Executive decision.

Basically I was trying to get caught up on some concepts that I had kind of gotten introduced to at NeurIPS in conversation but didn’t really get to sit in on any talks (it’s hard to see everything). Specifically had seen on the agenda a meetup for “probabilistic programming”, which tbh kind of went in without doing a whole lot of diligence in advance, so yeah was basically like “I’ve studied probability, I’ve studied neural networks, how hard could the intersection be?” Well it turns out kind of hard.

Although this blog has addressed some of the potential applications of probabilistic programming for neural networks in prior essays, like for the various types of generative models, I had only really explored the algorithmic considerations, which without actually setting formulas to code can be kind of hard to conceptualize beyond the abstract. So yeah turns out this was a useful exercise because it kind of helped connect a few dots between theory and practice, which hey just because don’t have an immediate application in mind for Automunge or something doesn’t mean one might not present itself down the road. Anyhoo without further ado.

Oh and since this essay might get a little dry at times here is a soundtrack to lighten the mood.

Beethoven’s Sonata Op 79 — Nicholas Teague

Part 1: Composing with Edward

Based on review of “Edward: A library for probabilistic modeling, inference, and criticism” by Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, and David M. Blei (2017) arXiv:1610.09787

I’ll go ahead and save the best part for first. The Edward library will be the primary focus of this essay, as it serves as a platform for probabilistic programming. Built on top of TensorFlow’s Distribution library for modeling distributions it also allows the use of TensorFlow for incorporation of neural network elements. The library is at a similar layer of abstraction as say Keras, but the paper notes the applications between these two are kind of orthogonal, such that they are intended for different tasks — in fact Keras can also serve as a input for the neural network elements of an Edward probabilistic model to give you an idea. Of course probabilistic programming isn’t new, but what is new with the Edward platform are a few pieces: first is the enabling of free GPU/TPU acceleration of operations or even distributed training by being built on top of TensorFlow, second is the composability of not just the probabilistic models but also the ability to compose inference elements in a comparable fashion (more on that to follow). Anyhoo let’s jump into it. The probabilistic programming loop follows a simple convention, in fact originating from the same George Edward Pelham Box after which the library was named.

(spared no expense on this essay ;)

As an introduction, let’s try to discuss this loop as an analogue to a vanilla machine learning supervised learning workflow.

  • In its purest form the ‘data’ segment will actually be pretty much comparable composition to what would be expected for a supervised learning problem. We’ll have a training set, let’s call it X, and some corresponding label set y, where the goal will be to model the relationship between X and y, where since this is a probabilistic framing we can think of as modeling p(y|X).
  • One way to think about the composing of a probabilistic model is kind of like creating and initializing the layers of a neural network. That’s a loose analogy though, where in a neural network we’re creating an architecture of neurons, in a probabilistic model composition we’re instead creating a model of one or more distributions, whose parameters we will then try to derive. Am going to go into a little more detail here shortly, just trying to lay the groundwork.
  • Composing the inference is kind of, in a way, like composing the hyperparameters of a neural network — like for instance selecting a loss function or an optimization algorithm — such that applying the iterations of inference are then kind of like iterating through a training operation. There are a few options for inference methods, we’ll get into that.
  • The criticism segment you can think of as then like the validation operation on a supervised learning operation. Here we’re evaluating how well our model performs, such as to identify whether the model is sufficient or further experimentations in distribution composition may be appropriate — which, outside of AutoML frameworks, is generally a manual analysis even in neural networks, so any update based on criticism is generally going to include a human in the loop I believe.

Great you are now officially in the loop. Let’s dive a little deeper.

Model Composition

Am kind of wondering to what extent this may be intuitive to the uninitiated, what it means to compose a model of a probabilistic distribution. Let me see if I can talk my way through it, worth a try. So do we all know what a probability distribution is? I mean just in case let’s at least touch on the ground floor. A single variable probability distribution is a shaped curve representing probabilities of drawing a value from a random selection, where the x-axis represents the potential realized value from a random draw, and the y-axis represents the corresponding probability of realizing that x value. Here’s a few simple single variable distributions to illustrate, a key point to keep in mind that theorists over time have catalogued a whole zoo of distributions, each of whose formulas are governed by their own set of parameters. I mean I’m sure we’ve all heard of a Normal Distribution, if not you might need to open wikipedia, but let’s just put a couple of simple distributions in mind to support the narrative, one for a continuous distribution and one for a discrete:

Fair warning: this is as good as the handwriting gets

So what’s important to keep in mind is that there’s no need to constrain probability distributions to a single variable, in fact one simple kind of probability model composition could be achieved in Edward by assigning a comparable class of distribution across a set of variables, each with their own “x-axis” and associated parameters (such as would make this kind of visualization a little trickier). In the context of Edward we can achieve this by defining say (hypothetically) a Normal distribution in which the parameters are passed as tensors instead of scalars. For example, if we wanted to initialize a two variable Normal distribution, we could initialize a Normal parameter set with vectorized mu=[0,0], sigma=[1,1]; noting that we are using TensorFlow tensors for these vectors, e.g. initializing with tf.zeros(2) and tf.ones(2), and for more variables could potentially be passing higher order tensors such as a matrix or etc. In fact this isn’t even an atypical way to initialize our distribution, (spoiler alert) the whole point of the inference operation will then be to update the initialized parameters (in statically defined distribution classes) based on properties of the data, so really just initially defining parameters as say 0’s or 1’s is all that’s needed — bonus this is a little more straightforward than considerations for initializing weights of a neural network.

This demonstrated kind of two variable distribution is only one of the ways that we can blend and shape our probabilistic models. The paper notes that Edward is actually a Turing-complete probabilistic programming language, which means it can model any computable probability distribution. If we want to get to some of the more exotic flavors of model composition, consider that we can combine multiple distributions in several fashions. For example, we could say that a master distribution for p(y|X) is the result of algebraic combinations of component distributions. Another source of composition could be to assume that the parameters of a distribution have their own probabilistic distribution, which can easily be modeled in Edward by embedding a distribution as a parameter for another distribution. Another source of composition could be to embed a control flow operation in the distributions, e.g. ‘if’ statements or ‘while’ loops. Oh and certainly another interesting means of composition is to incorporate defined neural networks trained in conjunction with the inference operation, such as for instance neural network derived parameters or other combinations between a neural network and the modeled random variable as a function of the data (I’ll walk through an example demonstration in Part 2 of this essay). All of these different composition methods are demonstrated as visualizations in the paper by “computational graphs” depicting the directions of parameter and data seedings into the various components of distributions (basically a collection of circles and arrows), in the interest of brevity I’ll save an example for the Part 2 of this essay.

Oh and just a quick clarification. The whole point of these probabilistic models is to facilitate generation of returned data point samples whose distribution characteristics are based on the model. For example for a modeled simple Normal distribution the sample returned will be data points whose average will fall around the passed mean parameter, or for a Bernoulli distribution the model will return samples of 0’s and 1’s with a ratio around the parameter basis. (And of course as we increase the number of ‘draws’ the distribution of the returned sample will increasingly approach that of the ideal model).

Inference Composition

The inference operation then takes the static model definition and derives populated parameters through an iterative operation. As noted earlier one of the points of novelty for the Edward library is the ability to compose the stages of inference — the paper implies that other probabilistic programming libraries basically treat this as a black box operation. Now I’ll be honest I found the mechanics of the inference composition somewhat less straightforward than the probabilistic modeling — this is partly due to there being several options for methods of inference I suspect. I was a little unclear if each of these objective function options are plug and play or require their own conventions in inference model composition.

The two key paradigms of inference fall under the headings Variational Inference and Monte Carlo. These are illustrated as a hierarchy consistent with the library class inheritances, e.g. each of the two classes (‘VariationalInference’ and ‘MonteCarlo’) inherit from a master ‘Inference’ class, and similarity down the nest, where the layer below these two methods I believe serve as objective functions, for example for a Variational Inference one can choose between minimizing KL divergence aka relative entropy (such as minimizing in direction of q|p or p|q) — actually the KLpq labeling convention is helpful for this narrative so I’ll just expand. The probabilistic model defined in prior section is what we’ll refer to as the probability distribution p, and then the probability distribution that we’re composing for inference is a comparable distribution that we’re referring to as q. I think the difference between p and q is the direction of inference — in the probabilistic model p we’re starting at inputted distribution parameters and trying to produce samples based on the associated probability of returned data points, whereas in the inference model q we’re starting at ‘training’ data points and trying to infer corresponding parameters — so this is almost kind of like a forward and backward pass of backpropagation in a way, but I don’t want you to get too caught up in that neural network analogy because it is just that, an analogy. Also note that those same parameters I believe are generally input as tensors for the probabilistic model p, and then as TensorFlow Variables for the inference model q — the difference being that the tensors are temporary values that are progressively updated through inference, whereas the Variables are values that are explicitly derived. In fact in each iteration of inference we’ll be slightly improving the inference model q Variable parameters using information derived from data inputted through TensorFlow placeholders, which we’ll then pass to the probabilistic model p as updates to those corresponding inputted parameter tensors (the ones that we had initialized with just 0’s and 1’s in last section).

In other words, each of the different objective functions is using some method to compare the probabilistic model distribution p and the inference model distribution q, where more specifically I believe with each iteration of the inference operation they’re updating the inference model q parameter Variables such as to make the distribution of the sampled “training data” used as input to the inference operation closer match the distribution of a returned sample from the probabilistic model p. Then after an iteration of inference there is an update to the parameters of the probabilistic model p with those improved parameters of the inference model q. In this way the validity of the probabilistic model is iterated to better match the characteristics of the training data. When I refer to the updating of ‘parameters’ I mean parameters comparable to those demonstrated above, such as mu and sigma (μ, σ) for a Normal distribution or theta (ϴ) for a Bernoulli distribution. The classes of distributions defined in the models are static, all that is updated through inference are the parameters, so if we have a bad model for the classes and combinations of distributions we’ll need to address that after the criticism stage. Oh and the update to the parameters is accomplished differently based on the different objective functions available for Variational Inference and Monte Carlo — which am not going to try and act like I completely understand, but the gist I got was that in Variational Inference we’re basing our evaluation of comparison of distributions between p and q on the basis of the assumed family of distributions that we spec’ed out in the probabilistic model, and I think the Monte Carlo distribution analysis is more ‘empirical’ in that a summation of (a specified number of) Monte Carlo samples is taken and empirically evaluated — I know that’s a little bit of a hand wave there apologize. There’s a lot of great resources to learn about Monte Carlo out there, and for Variation inference the paper cites a resource of paper by (Jordan, 1999).

So yeah once we’ve set up our inference model in a manner corresponding to the probabilistic model, we can run our inference operation through Edward, using one of the objective functions from the Inference Methods Hierarchy shown above. Voila. (Quick asterisk — the paper also notes a third kind of inference paradigm not shown above called ‘Exact Inference’ based on symbolic algebra on nodes of computational graph to identify conjugacy relationships between variables, deferring to the paper for that treatment.)

Criticism

I noted above that the criticism stage is loosely analogous to the validation of a supervised learning operation. We’re evaluating model accuracy and performance characteristics. How might you ask does one go about evaluating accuracy of a probabilistic model against a non-probabilistic validation set carved out form training data? Consider that for the example of a classification task the output of a probabilistic model p(y|X) is loosely analogous to a sigmoid activation function, where we have some range between 0–1 and just like in supervised learning where we can compress an activation to one of two values based on some threshold, we can simply use our probability estimates to derive a most likely classification and compare to the validation set labels. Similarly, in say a regression problem, although slightly less rigorous we can use a mean of predictions for instance with an associated accuracy metric (something like mean squared error or etc) — noting too that a graphical depiction of distributions may supplement a metric evaluation for better explainability. And as illustrated in the Box’s Loop above, the intent should be to potentially revisit your composition of predictive model distribution architectures based on the results of the criticism, although the paper does not go into a great deal of detail of what that might entail.

Part 2: Learning Through the Generations

Based on review of “Deep Probabilistic Programming” by Dustin Tran, Matthew D. Hoffman, Rif A. Saurous, Eugene Brevdo, Kevin Murphy, and David M. Blei (2017) arXiv:1701.03757

I briefly noted in the last section when discussing the potential avenues for model composition the various ways that we could shape the randomness into our specifications of distributions. For example, we could craft a distribution as an algebraic composition of multiple distributions, or introduce a control loop such as with ‘if’ or ‘while’ statements, more interestingly we can also assign distributions to the parameters of other distributions (or heck similarly assign distributions to the parameters of those distributions, eat your heart out Synecdoche, New York). But the key method that is the subject of this second paper further involves the intersection of probabilistic programming and neural networks, such as what will be for this demonstration the definition of distribution parameters as the output of an adjacent training of a neural network coupled to the inference operation. We’ll demonstrate one such composition here by way of walking through the probabilistic programming components of a Variational Auto-Encoder (VAE), a type of generative model that has the ability to advance image characteristics along latent vectors identified from a training operation, which makes use of some probabilistic inferences to accomplish. Actually if you’d like a good primer on VAE’s as some background I can point you again to another paper review I wrote titled The Choices of a New Generation. Anyhoo as part of this demonstration let’s first start with the presentation of a vanilla VAE’s computational graph depicting the directions of parameter and data seedings into the various components of distributions.

This computational graph depiction will be a handy reference as we get into the coding of the probabilistic model and inference models below, so let’s take a second to walk through the various components of this illustration.

  • The big box with the N at the bottom simply represents a data set with N number of samples, such that when we later talk about the sets of Xn and Zn those will correspond to this same data set as can be inferred from the n subheading. Since this is image data, let’s for simplicity neglect RGB color designations and just assume a black and white image, such that each data point from the set N will be a 28x28 pixel image with binary activations (so simpler than MNIST which has a grayscale to pixel activations). [update/correction: I believe the N for this image actually indicates the number of axis to the Xn distribution.]
  • The circles represent the distributions at play. Here we see two distributions, the distribution of the latent vectors Zn and the distribution of the corresponding data points Xn (which as a reminder are a collection of pixels). Notice the shading of the Xn distribution, I believe that is the signal to indicate that this distribution has its parameters as a function of a neural network. (Although an interesting aside when we later compose our distributions we’ll find that the points of application for the neural network “h” varies between the probabilistic model p and the inference model q. A little confusing to describe, just something to look for when we get to the code below). [update/correction: I believe the shading indicates that this distribution is observable, and the inclusion of a neural network is not depicted in the graph.]
  • The Greek symbols are the parameters which indicate the classes of distributions. For example we saw earlier that the Bernoulli distribution is parameterized by a theta, and in fact that’s what we’ll define for the Xn distribution (this is Bernoulli because our pixels follow a discrete distribution meaning each pixel is either on/off, 1/0). The phi (φ) parameter feeding into the latent vector distribution Zn I believe is shorthand for the set of parameters associated with a Normal distribution which we saw earlier was parameterized by a mu and sigma — so when you see phi in this image just read it as the set of mu and sigma. We apply a continuous (Normal) distribution to the latent vectors Zn because these are realized as a continuous (potentially unbounded) range — the is not to say that Normal is always the best distribution for each application, but I expect it is a common starting point.
  • The arrows of the diagram differentiate between directions of distribution parameter seedings, such as between the probabilistic model for production of samples and the inference model for derivation of parameters. As a reminder, the output of a probabilistic model p is a collection of data point samples fitting the distribution of the model, here which will be derived as a function of the latent vector distributions Zn and the theta parameter of the Xn pixel Bernoulli distributions — which is what’s demonstrated by the solid line arrows. The output of an inference model is then derived in the other direction by applying an objective function of inference (such as either a variational inference or a Monte Carlo assessment), in the directions of the dashed line arrows. (There’s a little added complexity for this example since there is a neural network in the mix, which to be honest I think the computational graph could do a better job of illustrating, again I believe this is what is meant by the shaded circle for Xn distribution.)

Great so you are now caught up with the computational graph, let’s dive a little deeper by way of code demonstrations of the distribution compositions.

Model Composition

We’ll demonstrate here the composition of the probabilistic model, again for our VAE demonstration on application of 28x28 binary pixel set of N data points showing a distribution Xn with latent vectors Zn. As a head’s up the “h” that you’re going to see referred to here is a layer in the neural network that will be trained as part of the inference operation. As a review, the output of our probabilistic model will be a collection of data points Xn with sample characteristics based on the distribution of the model. Great so let’s get started.

The first input into the probabilistic model (z) will be representing the latent vector distributions. As indicated in the computational graph by the phi parameter, this Zn distribution is modeled by a Normal distribution, which designation will be static through inference, it is just the parameters of the specified Normal distribution that will be adjusted by inference (sorry getting a little ahead of myself). Since we will be deriving the parameters of this Normal distribution through inference, the initialization is relatively simple by just passing TensorFlow tensors of arbitrary values, here we set zeros for the mu and ones for the sigma, and the only real complexity comes from the determination of the dimensions of those initialized parameter tensors. The dimensions of [N,d] represent N for number of data points in our set and d for number of evaluated latent vectors. (As an illustration of what could be meant for a latent vector, in say an MNIST handwriting recognition application one latent vector could represent the neatness / sloppiness of the handwriting, or the straightness of the lines, or you know like to what extent letters are evenly spaced, stuff like that.)

The second input into the probabilistic model (h) is our first coded representation of a neural network layer. Here we’re passing a densely connected 256 cell layer with relu activation. To be honest I’m not positive if the ‘Dense’ call is native to Edward and built on top of TensorFlow or if that is a direct TensorFlow call (the full demonstration of this code, in the paper’s Appendix C, makes use of TensorFlow Slim, which am not well-versed on, but I think it’s the latter). The input to this Dense layer is the tensor returned from the Normal Distribution samples of the latent vectors Zn, which one way to think about it is that each row of this model composition is its own distinct probabilistic model which generates its own set of samples per the governing distribution, so in other words the neural network Dense layer h will transform the set of latent vector samples via relu activation functions (to be fed into the next layer’s input).

Then the final layer of our probabilistic model (x) is a Bernoulli distribution whose theta parameter is derived as a function of the output of the h layer relu activations into one more Dense layer. To be honest I’m a little confused by the lack of activation. Hmmm. Save that for the next paper. But as you can see the neural network output feeds a logits input to the Bernoulli distribution which represents log-odds of a 1 event (looked that one up) — actually I think it’s the use of logits which makes the no activation function ok. Such that this layer outputs a joint distribution of a 28x28 pixel image, or more specifically generates a set of samples of 28x28 pixels images meeting these joint Bernoulli distributions of pixel activations.

Inference Composition

Great so we’ve got our forward pass of the probabilistic model p, now let’s demonstrate the backwards pass so to speak for the inference model q. Of course the composition is just the first step of inference, after that we have to actually run the objective function which isn’t shown here. (It’s definitely worth checking out the Appendix C of the paper if you want to see the full code demonstration which includes an interplay of joint inference and training (simpler than it sounds), I’m just going to focus on the inference model composition here.) Note that the paper refers to this as a “Variational model” instead of an inference model, I believe this is due to the use of Variational inference as the objective function (specifically KLqp), I’m not sure if this composition would need to be changed for use of Monte Carlo inference for instance, leaving that question as a reader exercise.

The first input to the inference model q (qx) is associated with the Xn distribution, which if you take a look above you’ll see was the final layer of the probabilistic model p, just goes to show moving in opposite directions. The inference models seem to follow the naming convention of a ‘q’ letter added to the corresponding layer of the probabilistic model (i.e. ‘qx’ corresponds to ‘x’). The input seeding is just a TensorFlow placeholder — which means we’re not actually initializing with values, a placeholder is just that a placeholder for passed values, which in this case will be the passed values of our data set N. We’ll initialize the placeholder to accept the points in our data set N with a given float precision (32 bit float) and input dimensions [N, 28*28], for N number of samples in our data set, each of which is a 28x28 collection of binary pixel activations.

The second input to the inference model is that tricky h layer again (qh) which remember is kind of not well represented in the computational graph flowchart presented above, and turns out it gets even a little trickier because while in the probabilistic model we had a neural network feed from a Dense layer in h to a Dense layer in X, here it’s the opposite where we’re feeding a dense layer from qh into a Dense layer in the distribution parameters of qz. I think it’s implied by the dashed arrow of the parameters passed to Z distribution that any neural network incorporated into inference, if not prior aggregated, will feed through to the final layers of the inference. A little confusing I know. But getting on with it you can see we’re taking as input the data set tensors passed from the qx layer and feeding into a Dense layer of 256 neurons — same as the h layer in the probabilistic model actually come to think of it, just different inputs.

Then the final layer of the inference model (qz) is the qz associated with the Zn distribution. Remember how in the probabilistic model we had derived our Xn distribution parameters as a function of a neural network? Well here in the backwards pass we’re instead deriving our Zn distribution parameters as a function of the backward pass neural network. It’s a little different because the Zn distribution is a Normal which means we have two parameters to derive (mu and sigma), in comparison to the forward pass where Xn distribution only had one parameter to derive. The input to these Zn distribution neural network models are the values returned from the qh layer, and they have a softplus activation on the mu parameter to deal with constraints on allowable values (>0). Oh and I presume the dimensions of the parameters tensor is partly inherited from qh (which inherits from qx) for the N value and the d for number of latent vectors is passed to the layer explicitly.

Training

The training of the neural network and inference of the distribution parameters take place in parallel iterations, which am not sure if that generalizes to all applications or is a quark of VAE’s. This is the phase where the application a KLqp inference objective function will take place to derive parameters, and the paper’s demonstration also makes use of an RMS prop optimizer for the neural network training. I do recommend taking a look at the full code demonstration in the paper’s Appendix C.1 (referring to the paper that I am reviewing), which includes further details like imports, parameters, and training operation. Yep, moving on.

Part 3: At the Root of It All

Based on review of “TensorFlow Distributions” by Joshua V. Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi, Matt Hoffman, and Rif A. Saurous (2017) arXiv:1711.10604

At the heart of the Edward library is a foundation built from stones of TensorFlow Distributions. Put simply, the Distributions library is a tool for modeling distinct classes of distributions and generating samples based on those distribution properties. Of course since this is a TensorFlow library it includes all of the built-in support for operations like GPU integration or distributed operations, but by itself the Distributions library doesn’t support Edward operations like inference or integration with neural networks for instance. There are actually two key methods available in the library, ‘Distributions’ for defining distributions, and ‘Bijectors’ for applying transformations to distributions. I’ll start this brief dialogue on the library with a quick illustration of shape semantics associated with generated samples, as understanding the mechanics of generated samples goes a long way to intuiting the operation of Distributions for generating samples, a key mechanism behind the Edward library.

A sample consists of a series of Monte Carlo draws, where the collection of events are aggregated into a set of batches and batches are aggregated into a sample. The shape parameters shown here of n / b / s are not neccesarily scalars, they represent shapes after all so I believe the representation could also be a list of scalars, each representing the number of dimensions along some axis (for example similar to how we initialized dimensions to the latent vectors in the VAE demonstration as a list [N,d] to give you an idea). In the example of the VAE demonstration, one event would be a drawn 28x28 pixel image, and so I believe the event_shape for sampling from the Xn distribution would be [28*28] (whereas the sample from the Zn distribution would have event_shape of [N,d]). A batch_shape would be, actually not sure on this point how to tie into the VAE example, this might be a part of the considerations for the various inference objective functions, not sure. And then the sample_shape just speaks to how we are taking Monte Carlo draws for the entire sample. When we say that sample_shape draws are identically distributed and batch_shape are not identically distributed, we are referring to the possibility that, hmm kind of speculating here, well I guess this might be for scenarios where the event_shapes are subject to a distribution, and actually this might be what is meant by event_shape can be dependent, such as dependent on the batch I expect. Kind of speculating on this point. Moving on.

As noted in the intro to this section, the TensorFlow Distributions library has two primary methods, the Distributions for modeling classes of distributions, and the Bijectors for applying transformations to distributions. For the Distributions methods, there is actually a library of distribution classes to choose from, each with their own parameters and characteristics (such as the difference between continuous and discrete distributions demonstrated earlier in this paper). The Bijectors library I find very interesting and it’s not really obvious to me how Bijectors ties into the Edward library — that might all be under the hood. This might be slightly less intuitive so I’ll try to offer a little clarification, when we are applying a transformation to a distribution, that’s loosely analogous to say when we are applying like a z-score normalization to a single variable in tabular data preprocessing. But consider that the z-score normalization primarily changes the scaling of values returned from a distribution by way of offset and multiplier, I expect the Bijector library could add more sophistication in updating the shape of the distribution curve — applying a transform to the values returned from a sampling by way of applying a transformation to the curve of probabilities associated with returned potential values from a distribution. Of course important to keep in mind that not all distributions share enough in common that you can directly translate from one to another — for example a distribution with bounded left tail can’t be derived from a distribution with unbounded left tail without losing some information in the process, or more dramatically a continuous distribution can not be transformed to a discrete distribution without even more loss of information. Most importantly, and I expect an easy ‘gotcha’ here, might be that the transformation of a distribution relies on the assumption that the properties of the source distribution are sufficiently understood — I believe there are some distributions where estimating parameters potentially require orders of magnitude more data than others, such as with fat tailed distributions.

I’ll briefly offer in conclusion that this exploration into the Edward and TensorFlow Distributions library was a very enjoyable and worthwhile experience. I hope you the reader might have gotten some value out of this review. I know this was kind of challenging subject matter, the goal here was to try and provide plain language treatment at high detail of mechanics such as for a reader to gain an understanding which took me several days of reviewing papers to achieve. If you enjoyed or got some value out of this review, I hope you might consider checking out Automunge, which is an open source library built as a platform for feature engineering and automated data preparation of tabular data for machine learning. It’s very useful and I’m very hungry for feedback from users who might help me identify points that are well-done and/or might need clarification. Wishing the best to the impressive probabilistic programming facilitators of the Edward and TensorFlow libraries. I expect industry will continue to find new high-value use-cases for these tools as a timeless foundation for the intersections between neural networks and probability. Cheers.

Books that were referenced here or otherwise inspired this post:

From the Diaries of John Henry — Nicholas Teague

From the Diaries of John Henry

References

[1] Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, and David M. Blei (2017) Edward: A library for probabilistic modeling, inference, and criticism arXiv:1610.09787

[2] Dustin Tran, Matthew D. Hoffman, Rif A. Saurous, Eugene Brevdo, Kevin Murphy, and David M. Blei (2017) Deep Probabilistic Programming arXiv:1701.03757

[3] Joshua V. Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi, Matt Hoffman, and Rif A. Saurous (2017) TensorFlow Distributions arXiv:1711.10604

For further readings please check out the Table of Contents, Book Recommendations, and Music Recommendations. For more on Automunge: automunge.com

--

--

Nicholas Teague
From the Diaries of John Henry

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.