Optimising for beauty - the erasure of heterogeneity and an appeal for a Bayesian world view

Part of a series of articles I’ll be posting over the next few days & weeks regarding thoughts from the past few years I’m only just getting round to writing about.

I believe many of the situations we face are generally of vastly complex nature and high dimensions, such that it is impossible for us to be able to see any situation in its entirety. All that we are able to perceive at any given time, is perhaps a tiny slice from a very specific perspective. And the best we can hope for, is to attempt to widen that slice by a tiny amount, and maybe even occasionally slightly shift our perspective. Unfortunately, contrary to popular belief, it’s not always being exposed to new ‘facts’ that helps us shift perspective — or at least, it’s definitely not enough. Instead we seem to need some kind of mental gymnastics to navigate to a new location inside our head, and gain these new view points. These mental gymnastics can often benefit from external nudging, and mental frameworks that act as structures around which we can build our thoughts. Traditionally this has been aided by natural language, evolved over thousands of years — but this isn’t always sufficient; and art for example is also critical to this end. I believe borrowing from other, more recent languages — such as computation — might also contribute to these mental frameworks.

In the spirit of the subject matter, doesn’t have to be read in any particular order:

http://www.memo.tv/portfolio/optimising-for-beauty/, 2017

This video shows an artificial neural network training on a well-known dataset containing hundreds of thousands of images of faces, namely celebrities. Every face that appears in this video is fictional, dreamt up by the neural network whilst it’s training.

The politics of this dataset, who’s in it, who isn’t in it, how it was collected, what it’s used for and the consequences of this etc. is in itself a crucial topic, but not the direct subject of this text. I’d like to draw attention to a related, but different point.

When I first started seeing these kinds of results, I was fascinated by how everything was so smoothed out. Even though the original dataset has a certain level of diversity (though arguably not enough), in these output images, even that level of diversity is lost. Even variety in detail, face shapes, individual characteristics, blemishes etc are all erased. Everything is normalized, averaged out with the most common attributes dominating the results. This is not a behavior that I am explicitly programming in, it is an inherent property of the learning algorithm I'm using to train the neural network. The network is learning an idealized sense of hyper-real beauty, a race of 'perfect', homogeneous specimens.

This ‘blurriness’ is a well-known problem of generative neural networks. Though I should mention, the method I’m using here is by no means state-of-the-art in image generation. Already today Generative Adversarial Networks (GANs) are producing more photo-realistic results.

Nevertheless, I’m intentionally using a very particular method here — a method that is still very widespread in many areas of machine learning and statistical inference. This is what’s known as Maximum Likelihood Estimation (MLE). That means — quite intuitively speaking — given a set of observations (i.e. data points), out of all possible hypotheses, find the hypothesis that has the maximum likelihood of giving rise to those observations. Or in fewer words: given some data, find the hypothesis which is most likely to have produced that data.

It makes a lot of sense right?

But it has some shortcomings.

Imagine we find a coin on the street. And we’d like to know whether it’s a fair coin, or weighted (i.e. ‘biased’) towards one side, possibly favoring heads or tails. We have no way of determining this by physically examining the coin itself. So instead, we decide to conduct an experiment. We flip the coin ten times. And let’s say we get 7 heads and 3 tails.

A maximum likelihood approach would lead one to conclude that the most likely hypothesis that can give rise to these observations, is that the coin is biased in favor of heads. In fact, such a ‘frequentist’ might conclude that the coin is biased 7:3 in favor of heads (with a 26.7% probability of throwing 7 heads). An extreme frequentist might even commit to that hypothesis, unable to consider the possibility that actually there are many other alternative hypotheses, one of which — though less likely — might actually be the correct one. E.g. it’s very possible that actually the coin is indeed a fair coin, and simply by chance we threw 7 heads (for a fair coin, the probability of this is 11.7%, not negligible at all). Or maybe, the coin is biased only 6:4 in favor of heads (probability of throwing 7 heads is then 21.5%, still very likely). In fact the coin might have any kind of bias, even in favor of tails. In this case it is very unlikely that we would have thrown 7 heads, but it’s not impossible.

Some of you might be critical of this example, citing that 10 throws is far too small a sample size to infer the fairness of the coin. “Throw it 100 times, or 1000 times” you might say. Heck, let’s throw it a million times.

If we were to get 700,000 heads out of 1 million throws, then can we be sure that the coin is biased 7:3 in favor of heads?

For practical reasons, we might be inclined to assume so. We have indeed quite radically increased our confidence in that particular hypothesis. But again, that does not eradicate the possibility that another hypothesis might actually be the correct one.

Evidence is not proof. It is merely evidence. I.e. a set of observations that increase or decrease the likelihood of — and our confidence in — some hypotheses relative to others.

The maximum likelihood approach is unable to deal with uncertainty. It is unable to deal with the possibility that a less likely, a less common hypothesis might actually be the correct one.

The maximum likelihood approach is binary. It has no room for alternative hypotheses. The hypothesis with the highest likelihood is assumed to be unequivocally true.

The maximum likelihood approach commits to a dominant truth, and everything else is irrelevant, incorrect, and ignored. Any variations, outliers, blemishes etc., are erased from existence; blurred and bent to conform to this dominant absolute truth, this binary world view.

So are there alternatives?

At the opposite end of the spectrum, one could maintain a distribution of beliefs over all possible hypotheses, with varying levels of confidence (or uncertainty) assigned to each hypothesis. These levels of confidence might range from “I’m really very certain of this” (e.g. the sun will rise again tomorrow from the east) to “I’m really very certain this is wrong” (e.g. I will let go of this book and it will stay floating in the air) — and everywhere in between. Applying this form of thinking to the coin example, we would not assume absolute truth in any single hypothesis regarding how the coin is biased. Instead, we would calculate likelihoods for every possible scenario, based on our observations. If need be, we can also incorporate a prior belief into this distribution. E.g. since most coins are not biased, we’d be inclined to assume that this coin too is not biased, but nevertheless, as we make observations and gather new evidence, we will update this distribution, and adjust our confidences and uncertainties for every possible value of coin bias.

And most critically, when making decisions or predictions, we don’t act assuming a single hypothesis to be true, but instead we consider every possible outcome for every possible hypothesis of coin bias, weighted by the likelihood of each hypothesis. (Those interested in this line of thinking can look up Bayesian logic or statistics).

But, as I’ve mentioned in my other works and texts, my main interest is not necessarily these algorithms themselves. In a real world situation, no fool would settle on a maximum likelihood solution with only 10 samples (one would hope!).

My main interest is in using machines that learn as a reflection on ourselves, and how we navigate our world, how we learn and 'understand', and ultimately how we make decisions and take actions.

I’m concerned we’re losing the ability to safely consider and parse multiple views within our social groups, let alone within our own minds. I’m concerned we’re losing the ability to recognize and navigate the complexities of situations which may require acknowledging — or even understanding — the existence of lines of reasoning that may lead to different views than our own; maybe even radically different, opposing views. I’m concerned that ignoring these opposing lines of reasoning, and pretending that they don’t exist, might be more damaging in the long run; compared to acknowledging their existence and trying to identify at what point(s) our lines of reasoning are diverging, and how, and why, and trying to tackle these more specific issues at point; whether they are due to what we consider to be incorrect premises, flawed logic, differing priorities or desires.

I’m concerned we’re becoming more and more polarized and divided on so many different points on so many different topics. Views, opinions and discourse in general seems to be becoming more binary, with no room to consider multiple or opposing views. I’m concerned we want to ignore the messy complexities of our ugly, entangled world; to erase the blemishes; and commit unequivocally to what seems (to us) to be the most likely truth — the one absolute truth, in a simple, black and white world of unquestionable rights and wrongs.

Or maybe that’s just my over-simplified view of it.