Self-Organising Feature Maps for fun and profit

This is Part 2 of a three-part series on competitive neural networks. You can find Part 1, an introduction to competitive neural networks, here. Part 3, which looks at a different competitive algorithm called Growing Neural Gas, will be released on 27 January.

There is a companion Jupyter notebook to this post, which includes some of the mathematical details as well as the code generating everything you see here. The companion notebook for all three Parts will be released with Part 3.


A typical Kohonen map with a 4x4 map layer and three input neurons. The map layer can be seen as a two-dimensional layer where each node can be identified by their (x, y) coordinates. The input layer accepts a three-dimensional data set, row by row, each row consisting of three values. Note that in the map layer, every node is connected to its adjacent nodes. This connection is relevant as it conditions how the nodes in the ‘neighbourhood’ of the best-performing node behave, and which nodes are going to be dragged along with it.

In the previous chapter, we looked at the general idea of competitive neural networks and vector quantization. We saw, among others, that competitive neural networks were quite different from the ‘classical’ feed-forward neural networks we all know and love (sometimes called ‘error connecting networks’). We have also seen that competitive neural networks can reproduce topologies quite well. In this post, we’ll be looking at Self-Organising Feature Maps (SOFMs) or, as they’re also known, Kohonen maps (named after Finnish computer scientist Teuvo Kohonen, who first implemented them for vector quantization for voice signals).

A 7x7 SOFM’s map layer coloured by the distance from the node at position (2, 4). Lighter colours indicate higher distance/lower adjacency.

A SOFM is a network of neurons, typically arranged initially in a square shape, with each neuron connected to adjacent neurons (toroidal or wrap-around Kohonen networks additionally connect edges, so that the topmost neuron of each column is connected to the bottom-most, and the leftmost neuron is connected to the rightmost of each row). This means that for every neuron of the map layer, their proximity to other neurons can be calculated. This will be quite important when we consider the process of training a SOFM. The training algorithm makes use of the concept of a gradually decreasing ‘learning radius’. Like other forms of Hebbian learning, in a SOFM, the neurons ‘compete’ to respond to the data point being considered in the current iteration. The ‘winner’ is the neuron that is already closest to the data point (this is sometimes expressed in the language of having an initial weight closest to the node, but I find that this only serves to obscure what this really means in a spatial sense and is intended to make these anomalous neural networks fit in with error-correction networks). Its ‘reward’ is being moved, along with the neurons within the learning radius, according to an update formula which ensures that the neuron closest to the data point (referred to as the ‘Best Performing Unit’ or BPU) moves most, and the amount of move decreases with distance from the BPU.

Training a SOFM

A single iteration of a SOFM: a data point is chosen from the data set, and the point of the initialised grid (displayed here as an idealised square grid but in reality having a rather more irregular shape if initialised with random values) closest to an arbitrarily selected point from the source data is identified. This is the ‘best performing unit’ or BPU. Neurons within the learning radius of the BPU (denoted in Step 3 as purple neurons) are also moved, although they are moved less as the distance increases.

Unlike most methods of unsupervised learning that determine topology and clustering, SOFMs do not require you to know the number of clusters in advance. Other than determining a distance function (Euclidean, cosine, &c.), SOFMs only require that you specify map layer dimensions, initial learning radius and learning radius decay (the number of epochs after which the learning radius is decreased). The initial grid is then initialised with initialisation weight vectors of the same number of elements as the input data has variables, denoted usually with 𝛿— Kohonen’s original paper suggested random weights, but recently, approaches that used principal components have also been used, with more success for normally distributed data than otherwise. To clarify, in the context of a SOFM’s map layer, ‘weights’ quite simply refer to a 𝛿-dimensional vector associated with each neuron that describe a position corresponding to that neuron in 𝛿-dimensional ℝ-space.

The network is trained using an iterative process outlined in the figure above.

  1. After initialising weights for each neuron of the map layer, a 𝛿-length vector v is selected from the data set.
  2. The distance between the 𝛿-element vector v and each of the weights associated with each neuron of the SOFM is calculated. This happens using the distance function specified in the beginning (more often than not, this is Euclidean distance).
  3. The neuron closest to v — the best-performing unit or BPU––is moved closer to v.
  4. Neurons that fall within the learning distance (or, in weights language, the weights are adjusted to be closer to the weights of v) are also moved towards v. The amount to which the neuron is moved (or, in weights terms, the weights are adjusted) is a function of the affected neuron’s distance from the BPU.
  5. This is iterated until a given number of iterations, described as the limit 𝜆, have been performed.

Identifying similar faces with SOFMs

One unique application of clustering pertains to the identification of similar images. For this, we are going to use the Olivetti Faces data set, consisting of ten photos each of forty employees at AT&T Labs, Cambridgez, with different lighting, facial expressions and pose, as 64x64 black-and-white pictures stored as a single 4,096-element ndarray.

For this example, an ‘out of the box’ 20x20 map grid was used with Euclidean distance and a rectangular SOFM grid over 100 iterations. Training time was approximately five, and what is quite remarkable is the semanticity of the results. Most clustering algorithms, for instance, would be thrown off by differences in pose and lighting, whereas the clusters that are created by the SOFM are very similar to the way we humans would understand facial similarity. So for instance the faces of young men and young women were closer to each other than counterparts of the same gender but more advanced age. Each photo shows an image representative of that cluster (specifically, the first) from a cluster corresponding to the map neuron, while blank spaces mean no neuron in that location. You may notice two things: first, that some faces appear multiple times. This is not unexpected, sometimes individual photos are sufficiently different for the SOFM to allocate them to different neurons. This can be cut down by reducing the number of neurons or by calculating the precise distance between individual values (which is somewhat obscured here — in order to adequately display the photos, the map does not show the exact distances).

Olivetti Faces dataset clustered by 100 iterations of a 20x20 SOFM with Euclidean distance, coloured by 1-distance von Neumann U-matrix (the Euclidean distance of the image from its top, bottom, left and right neighbours, averaged). Warmer colours indicate closer proximity to the von Neumann neighbourhood, cooler colours indicate larger differences.

An interesting tool for the analysis of SOFMs is the U-matrix. The U-matrix, quite simply put, is a matrix of the Euclidean distance between the weight of two adjacent neurons. This allows us to visualise not only where borders between particular clusters lie, but how deep those borders are. For instance, an algorithm might correctly identify that a blue, teal and red stripe are three different areas, but the difference between red and blue or teal will be much larger than the difference between blue or teal. This allows SOFMs to be developed further, for instance, to ‘merge’ clusters that are not sufficiently different––using the aforementioned blue-teal-red example, reducing the threshold of separation would reduce it to only two clusters, ‘blues’ and ‘reds’. The U-matrix colours are overlaid on the faces, so that images that are relatively close to their surroundings (using a cross-shaped, i.e. von Neumann neighbourhood for a Manhattan distance of 1).

Tweaking SOFMs

Olivetti Faces dataset clustered over 50 iterations of a 40x40 SOFM with Euclidean distance, coloured by 1-distance von Neumann U-matrix (the Euclidean distance of the image from its top, bottom, left and right neighbours, averaged). Warmer colours indicate closer proximity to the von Neumann neighbourhood, cooler colours indicate larger differences.

Are large feature maps necessarily better? How about more iterations? With SOFMs, this often depends on the particular data set. In the example left, we have reduced training time to half but doubled each dimension (quadrupled the area, that is) of the SOFM map layer. As you can see, the results approached a higher degree of coherence. There are more clusters, but many are strongly correlated with their neighbourhood (warm colours), and those most correlated with their neighbourhood are at a considerable distance from those that correlate less well. It is a characteristic of SOFMs that as the number of neurons in the map layer increases, the distance-based representation gradually assumes a circular shape.

Clustering real-world data with SOFMs

SOFMs can be used for clustering real-world data, too. To illustrate this, we’ll be looking at two indicators from the Wine dataset from UCI’s Machine Learning Repository.This data set contains thirteen chemical parameters of three types of wines, which we are going to call by their arbitrarily assigned dot colours––”green”, “orange” and “yellow”. Of this, to illustrate SOFMs in clustering, we will be only looking at two — alcohol content (column 1, y-axis) and malic acid content (column 2, x-axis).

Cluster centers of the Wine dataset as determined by a SOFM with a 3x1 rectangular map layer over 200 iterations. The dark points denote cluster centres. It is quite evident that the cluster centres are fairly good approximations of the clusters themselves. It is notable how resilient this approach is to outliers, especially visible in the case of “orange”, where the higher malic acid outliers are almost completely ignored. This may or may not be a desirable feature.

In general, there exist two approaches to SOFMs in clustering. If the number of clusters is known, then an n-by-1 SOFM will quite reliably generate cluster centres. This is often more reliable than what most clustering algorithms would come up with. In the example to the left, a 3x1 rectangular map was used over 200 iterations, and generated three cluster coordinates that are fairly good approximations of what points do a good job at representing a ‘typical’ case of the cluster, even with the relatively large spread, both in alcohol content and malic acid content, of ‘green’. One remarkable feature is the insensitivity to outliers: despite a number of high (more than 2.5g/L) malic acid content wines, the cluster centre remained relatively unaffected.

A 7x7 SOFM creates 49 representation points that together are the best possible compression of the data. Note the density contingency: the density of the points in the 1–2.5g/L malic acid range reflects the density of the points in this range. A convex hull built using these points would cover more of the data than a convex hull using the cluster centroids of traditional clustering algorithms.

A slightly different clustering approach looks less at different clusters and more at a number of points that together give a good representation of an entire dataset or an already discriminated cluster within. An example is presented below, where a 7x7 grid chose 49 ‘representation points’, points which together give the best representation of the data. Note the difference between clustering algorithms, which would seek to represent data using averages, and this algorithm, which looks at it from a more information theoretical approach and tries to discern a positioning for points that represents as many points as possible as accurately as possible.


The next part of this series, Part 3: Growing Neural Gas, will appear on 27 January 2019. Watch this space!