Matrix processing with nanophotonics

Martin Forsythe
Lightmatter
Published in
12 min readAug 28, 2019

At the heart of the computations that power deep learning and many other numerical scientific computing tasks is a mathematical operation called general matrix multiplication (GEMM). Today, the most advanced neural networks — used in self-driving cars, cancer diagnostics, and voice assisted search — are powered by GEMM (see reference [1]), and so a great deal of effort has been devoted to designing specialized hardware to accelerate this operation.

Different hardware platforms can be used to perform the GEMM operations for a deep learning prediction. The box titled LM indicates Lightmatter’s photonic processor.

At Lightmatter, we are building a photonic processor to accelerate deep learning. In a previous post we discussed the history of optical chips and the advances that made Lightmatter’s technology possible today. This post will dive into some of the technical differences between optical and electronic accelerators with the goal of explaining why nanophotonics offers a route to higher computational performance while requiring significantly less energy, reducing both cost and heat production. The post will first explain how general matrix multiplication works and how it is implemented in accelerator hardware today. Then, we will explain how Lightmatter’s nanophotonic processor uses light rather than digital electronics to implement the same GEMM calculations more efficiently.

Matrix multiply-accumulate operations

A matrix is a two-dimensional array of values — its size is quantified by the number of rows and columns, which is often written as [rows 𝗑 columns]. When two matrices, A and B, are multiplied together, the result is a new matrix D. If A and B are of sizes [I 𝗑 J] and [J 𝗑 K] respectively, the matrix D is of size [I 𝗑 K]. Each element of D is computed by accumulating the multiplication of each element of the vectors (1D arrays) that are extracted from the rows of A and the columns of B. Consequently, matrix multiplication “contracts” the inner dimension J.

The highlighted element of D is computed from the highlighted row of A and column of B.

In code this looks like:

When designing hardware to perform matrix multiplication there are a number of optimizations that are possible. In digital electronics, one such optimization is to implement a hardware instruction called a multiply-accumulate (MAC), which computes d←ab+c for three numbers, a, b, and c. From the code above we can recognize that a matrix multiplication can be decomposed into a series of MACs. There are also a number of optimizations involving data access/memory management tricks, but for the purposes of this post we’ll just concern ourselves with the MACs since they are the fundamental unit of computation.

If you look back at the first figure, you will notice that the elements in A, B and D are grouped suggestively. It turns out that you can express the multiplication of a large matrix in terms of the products of smaller matrices if you split the larger matrices into sub-matrices, called tiles or blocks.

On a matrix processor the green tile is computed by accumulating products of the blue and orange tiles; the red tile is accumulated from the products of the purple and orange tiles. Since there are two rows of tiles in the matrix A, the orange tiles from B are used twice in computing the output.

On a matrix processor, such as Google’s TPU [2] or MIT’s Eyeriss [3], rather than operating on a stream of single values in row and column vectors, the computation is parallelized by processing products of tiles. The tiles are accumulated in exactly the same row-column pattern pictured above where the unit element was a single entry.

So far, we have explained the matrix-multiply portion of the GEMM operation. The word “general” in the acronym comes from allowing the matrix product (A 𝗑 B) to be summed with an initial value matrix C [4], forming a matrix multiply-accumulate (MMAC).

In this equation the left arrow indicates that the result may optionally be accumulated in-place, i.e. into the same memory that was storing C, just like the += notation in the code above. Thinking in terms of MMAC operations we can visualize the matrix multiplication of the tiles as follows:

+= indicates accumulation, x indicates matrix multiplication

Matrix processors — mesh-based parallel computing

To understand how nanophotonics and digital electronics differ for making GEMM accelerators, it is helpful to first discuss the mesh diagrams that get used to describe parallel dataflow processors. In a dataflow processor, different processing elements (PEs) independently perform partial computations and data moves between processing elements each tick of the processor’s timing clock. Computation at each PE depends on the upstream elements, i.e. in the diagram below b depends on a, and c depends on b (thus indirectly on a).

Data moves one arrow during each clock period (T), no matter the length of the arrow.

Thus, the arrows that connect PEs in a mesh diagram indicate both a temporal and arithmetic dependency structure. Each compute node also plays a role in the distributed storage of data by having a small local memory. However, due to the local topology of the mesh, moving data from main memory to an internal PE requires the data to first pass through intermediate PEs–e.g. it takes two clock cycles to move data from memory to the central PE in the diagram below.

Data moves between processing elements (PEs) in the direction of the arrows.

Depending on the dataflow processor in question, each node may have varying levels of complexity; a PE might be as complicated as a general purpose CPU or as simple as a single multiply-accumulate (MAC) unit with data routing. When the mesh of PEs is homogeneous, i.e. each PE performs the same partial computation, the mesh is called a “systolic array” [5].

PEs for matrix multiplication with a 2D systolic array. In a spatial accumulation array the values for a are broadcast to the array prior to beginning computation, which is why the PE stores a as internal state. In a temporal accumulation array, the result is stored as internal state, so a readout broadcast must be used.

The pipeline parallelism that is used to feed a systolic matrix multiplier can be a little surprising if you are not used to dataflow meshes. In a spatial accumulation mesh, the values for a tile of A are broadcast to the mesh prior to beginning the computation. Column vectors from the tile of B are then fed into the mesh in a staggered pipeline and for each clock cycle data flows between the processing units and accumulates the matrix-vector product as it passes through. The animation below shows this pipeline for a [3 𝗑 3] systolic array processing a [3 𝗑 9] input matrix.

Animation of computing a [3 𝗑 3][3 𝗑 9] matrix product with a spatial accumulation systolic array.

The animation above shows that computing this [3 𝗑 3][3 𝗑 9] matrix product takes 14 clock cycles. In general it takes N timesteps to load the tile A onto an [N 𝗑 N] mesh and (2*N-1+K) clock cycles to compute a [N 𝗑 N][N 𝗑 K] matrix product. In contrast, a processor with a single MAC unit would take N*N*K clock cycles (e.g. 81).

Weight stationary processing

In deep learning applications, we are frequently interested in computing matrix products where one of the dimensions is significantly larger than the other. For instance, computing a 2D convolution as a matrix product using the so-called im2col trick will result in a small “weight” matrix A and a wide “data” matrix B. In such a situation it is useful to perform as much computation as possible while the same weight matrix tile is loaded on the matrix processor, i.e. keeping the “weights stationary” while streaming “data vectors” through the processor. This weight stationary scheme is typically used in spatial accumulation architectures, while an “output-stationary” scheme is used in temporal accumulation architectures (see references [3, 7, 8]). In the context of deep learning this pattern of holding the weights static is especially useful for “inference” workloads where a pre-trained neural network is used to make predictions on millions of data.

Nanophotonic matrix processors

So far, we have discussed matrix multiply accumulate operations and how they can be implemented with conventional digital electronics. Now we are going to dive into how Lightmatter’s processor uses light instead of electronics. The word “nanophotonic” — “nano” meaning small and “photonic” meaning having to do with light — is used to describe optical computing elements that are implemented with integrated optical circuits, which can be fabricated using traditional semiconductor/oxide processes.

Computing a matrix product on a nanophotonic matrix processor is quite a bit different from the operation of a digital electronics-based 2D systolic array. However, the pattern of using a weight stationary scheme to compute a stream of products between “weight” tiles of the A matrix and “data” vectors from the B matrix remains the same. In fact, we can represent both a systolic array and a nanophotonic processor with the following dataflow diagram in which the processing element (PE) computes the product between an [N 𝗑 N] matrix tile and an N-element data vector. The only difference is that with a systolic array, the data pipelines are staggered across N cycles while a nanophotonic processor computes all N-elements synchronously.

Dataflow for a matrix-vector processor with an array size of N=6. In a nanophotonic processor the PE computes the matrix-vector product synchronously, while for a systolic array the PE computes with an N-step pipeline.

In digital electronics, data is encoded as a binary string in the on/off state of transistors. This means that the numbers that are representable are discrete, e.g. integers or floating-point values. In photonics, data is encoded by modulating the amplitude (or phase) of pulses of laser light; this allows representing continuous values, i.e. real numbers. Changing the strength (or direction) of the optical field changes the real number that is represented. Instead of wires guiding electronic currents, silicon structures called waveguides carry the laser light — this is similar to how an optical fiber guides telecommunications signals. In place of the integrated transistor circuits that perform a MAC, in nanophotonics there are two primitive computational elements for manipulating our representation— a Mach-Zehnder interferometer, that rotates vectors, and an attenuator, that scales them.

Comparison between digital electronics and photonics representations of operations and numbers.

The Mach-Zehnder interferometer (MZI) is the most important element to understand because it can be modified to build an attenuator. An MZI operates on the light in two waveguides. By interfering the light from the two “arms” of the interferometer, an MZI computes the product between a [2 𝗑 2] unitary matrix, U, and the 2-element data vector, x, represented by the light in the two incoming waveguides. This product Ux rotates the x vector without changing its length. You can think about this as shuffling some of the light from one waveguide to another while still preserving the total amount of light. The rotation angle, θ, determines how much light moves from one arm to another and is controlled by the voltage applied to two adjustable phase shifters.

An MZI performs a rotation matrix on the 2-element data vector in two incoming waveguides.

There are many ways to parameterize a unitary matrix (see reference [6]), and correspondingly there are a number of different physical layouts for building an MZI. The parameterization used above highlights the relationship with the rotation matrices that you may have encountered in a math class.

The second primitive element for computation nanophotonics is an attenuator. One way to implement this is by blocking off the waveguides coming into/out of one arm of an MZI. Taking the light out of one arm allows this attenuator element to scale the data value in a waveguide by a factor of cos θ, i.e. by factors in the range [-1, 1].

An attenuator scales the data value in one incoming waveguide by a factor of cos θ.

So, how do you use MZIs and attenuators to perform the matrix multiplication between a square tile of A and the corresponding tile of B? Researchers in quantum computation / optics have devised a number of different schemes for making meshes of MZIs that can be programmed to perform any [N 𝗑 N] unitary matrix, U, on N waveguides (see references [9, 10, 11]). You can think of this as building up an N-dimensional rotation by performing a series of [2 𝗑 2] rotations in different pairs of directions, i.e. 2D subspaces. If you view each column of B as specifying the coordinates of a point in a high dimensional space, the act of multiplying this vector by the matrix A can be regarded as transforming the coordinate system.

By using a factorization called the singular value decomposition, it is possible to interpret any matrix multiplication as first performing a rotation, then scaling each axis, and finally performing another rotation. Since a unitary matrix performs a rotation and a diagonal matrix scales each row independently, this decomposition allows us to rewrite any matrix tile, M, as a product of two unitary matrices (U, V) and a diagonal matrix (Σ).

Singular value decomposition of a matrix tile M.

By choosing an appropriate scaling for our number representation, we can implement the diagonal matrix Σ with attenuators. Thus, any matrix tile M can be implemented with two MZI meshes and a column of attenuators. In the diagram below we assume that light propagates from left to right, so the order of the MZI meshes corresponding to U and V is reversed relative to the equation above.

Example array of photonics elements that can be programmed to implement any [6 𝗑 6] matrix. A careful observer will note that there are N*(N-1)/2 MZIs for each unitary matrix, for a total of 2*N*N phase-shifters. However, since we set the two phase shifters in the MZI to θ and , there are only N*N unique phases, the same number of free parameters as the original matrix tile, M.

It’s important to note that the lines and boxes in the diagram above have a very different meaning than in the previous dataflow processor diagrams; the entire computation is performed as the light travels from the input to the output of the MZI/attenuator mesh. Since this “time-of-flight” is only about 100 picoseconds, we are able to compute the full output for a matrix-vector product in a single clock cycle!

Dataflow in a nanophotonic matrix processor.

The future of AI hardware accelerators

ML is both broad enough to apply to many domains and narrow enough to benefit from domain-specific architectures, such as Google’s Tensor Processing Unit (TPU). Moreover, the growth in demand for ML computing exceeds Moore’s law at its peak, just as it is fading. Hence, ML experts and computer architects must work together to design the computing systems required to deliver on the potential of ML.
— Jeff Dean, David Patterson, Cliff Young IEEE Micro, 38 (2), 21 (2018)

Deep learning is revolutionizing the way that problems are solved in many fields, from enhancing diagnostic medical imaging to powering speech recognition systems and search engines. At the same time, the need for ever more powerful computers to train the largest AI systems has been increasing exponentially —more than 5x faster than Moore’s law! This exceptional computational resource requirement is one of the factors that limits advances in artificial intelligence; all this computation demands considerable energy with correspondingly high financial and environmental costs [12].

The computational throughput achievable with digital electronics is limited not just by how many transistors can be put in a given area of silicon, but by how quickly the chip can be cooled. One of the largest operational costs of a data center is the electricity used to cool the processors. With modern transistor densities, it is not physically possible to cool a chip fast enough to prevent it from melting if all of the transistors on the chip are simultaneously powered on (leading to so-called dark silicon). This also limits how fast the processor’s clock frequency can be (reference [13] offers an approachable discussion). Thus, thermal management is one of the central factors that limits how much computation a chip can perform.

Unlike switching electronic transistors on and off, passing light through the optical elements in a nanophotonic processor does not generate heat. Transistors are still required for the memory used to store data as well as the control electronics that manage the processor. However, the key is that we are able to replace all of the transistors for N*N MACs with just 2N elements for converting the data vector into/out of the optical signal. (There are also electronics for re-programming the phase shifters whenever the weight matrix tile is changed, but these modulate less frequently and are thus less of a power constraint). As a result, our processor uses orders of magnitude less energy per matrix calculation. Since a nanophotonic matrix processor does not generate nearly as much heat, it is theoretically possible to run the processor much faster than the latest electronic chips. By building a processor that is both faster and more energy-efficient than electronic accelerator hardware, we hope to make a positive contribution to the advancement of artificial intelligence and reduce the carbon footprint in the process!

This article contains contributions by Martin Forsythe, Siddharth Trehan, Victoria Pisini, Michael Gould, Tomo Lazovich, Tyler Kenney, Scott McKenzie, Darius Bunandar, and Nicholas Harris.

References, footnotes, and further reading

[1] P. Warden Why GEMM is at the heart of deep learning?
[2] Google’s TPU: arXiv:1709.03395
[3] MIT’s Eyeriss project: eyeriss.mit.edu & arXiv:1807.07928
[4] Strictly speaking the BLAS standard for GEMM also allows scaling by numbers ⍺ and β, D←⍺(A𝗑B)+βC, and optionally allows transposing or conjugating the matrices A and B, but deep learning accelerators are typically designed to perform a simple MMAC.
[5] H.T. Kung (1982) doi:10.1109/MM.2018.112130030
[6] H. Führ and Z. Rzeszotnik (2018) doi:10.1016/j.laa.2018.02.017
[7] C.-P. Lu (2017) arXiv:1705.05983
[8] C.-P. Lu Should we all embrace systolic arrays?
[9] M. Reck et al. (1994) doi:10.1103/PhysRevLett.73.58
[10] W. Clements et al. (2016) arXiv:1603.08788
[11] H. de Guise et al. (2018) doi:10.1103/PhysRevA.97.022328
[12] E. Strubbell et al. (2019) arXiv:1906.02243
[13] P.P. Mattsson Why Haven’t CPU Clock Speeds Increased in the Last Few Years?

--

--