We believe that AI has the potential to change the world in fundamental ways that exceed anything seen so far. To date, however, that potential has been restricted by the high cost and technical difficulty of AI development, limiting innovation to a few powerful and well-resourced companies. We take a completely new approach, creating AI-optimized chips that rely on analog computation to deliver revolutionary power and performance that unlocks human innovation. To bring this vision to life, we are building a unified hardware and software platform, which makes it much easier and more affordable to deploy accurate, consistent AI in the real world, from the data center to the edge device.
This article is about the types of problems that our software engineers are solving. It illustrates what the end product is and what steps are needed to get there. Hopefully, after reading this you will know if your skills and interests line up with one of these areas and you reach out to our recruiting team.
Programming Mythic’s chip is like programming a supercomputer — amazingly parallel with all of the fun that entails. There are many (tens to thousands, depending on the product) of parallel RISC-V processors managing a sea of compute resources and accelerators. Coordinating these processors to work together is difficult — we need to avoid deadlock and we need to balance resources to keep the system performing near peak theoretical efficiency. Instead of programming this supercomputer to tackle a typical problem like simulating weather events or a nuclear reaction, however, it is programmed to perform neural network inference quickly and efficiently. To ensure that the platform is accessible to the broadest audience possible, there is the additional wrinkle that the code to all of these processors needs to all be automagically generated from a customer’s neural network description, so that the customer does not need to program the supercomputer themselves.
Let’s get into the details.
Mythic’s chip is an artificial intelligence inference accelerator with statically configured weight memory. This means that:
- It is a chip sitting on a PCIe card inside a workstation or server, or it is a chip sitting next to a microcontroller in a system like a smart camera
- We are running already-trained algorithms like neural networks (inference-only)
- The chip is configured to run a single set of applications at a time, and those applications change infrequently, like a firmware update for your phone. Infrequent changes is the common use-case in most systems, like a home security system watching for intruders will not suddenly need to play Go.
To run an application, our chip needs a firmware binary that contains all of the code and configuration of that application. When a new firmware is loaded, it puts the program code into SRAM memory and the neural network weights into the matrix multiplier units’ integrated flash memory. When booting a firmware binary for a second (or nth) time, we still have the weights from before since flash memory is non-volatile, so only the program code needs to be loaded. Creating that firmware binary from a neural network description is difficult, but we shoulder that burden so that our customers don’t need to. While the chip is a fancy piece of glass hidden in a piece of plastic hidden away somewhere in the system, the compiler tool-chain is something our customers will actually touch each day. It needs to be intuitive, simple, and powerful so that our customers enjoy using our product (and keep using). Luckily, our system has a few properties that will make that goal manageable.
The chip architecture is tile-based where each tile includes multiple units: a Matrix Multiply Accelerator (MMA), a RISC-V processor, a SIMD engine, SRAM, and a Network-on-Chip (NoC) router. Depending on the product, there will be dozens to hundred of tiles connected together in a 2D mesh, and multiple chips connected via PCIe. The MMA provides a competitive advantage: by using analog computing coupled with embedded flash memory, each provides a huge amount of performance, approximately 250 billion multiply-accumulate operations per second, at a very low energy cost (0.25 pJ/MAC, see here for details). The SIMD unit provides digital operations that the MMA cannot perform, such as MaxPool or AvgPool. The SRAM holds program code and data buffers. The RISC-V manages the scheduling for the tile. To connect off-chip, there is a PCIe interface as well. For more details, see our Hot Chips 2018 presentation.
To have a full functioning neural network on the chip, we need to configure each of these units across all 50+ tiles. The neural network weights need to be allocated into the MMA flash memory space, the intermediate data buffers between the neural network stages need to be allocated to the SRAM space, and parallelized RISC-V program code must be generated for each node of the neural network graph. Additionally, there are many knobs to control accuracy for each stage/unit (1-bit to 8-bit), what activation function to perform, and so on, which allows us to optimize for speed, power, and memory space.
The system is designed to execute a dataflow graph, which (not coincidentally) is what AI inference is organized as. Each neural network has an input matrix (e.g., an image), an output matrix (e.g., a set of labels), and a series of matrix operations in-between. We have template RISC-V programs for executing each of the basic operations in-between, for instance, a Dense matrix mutiply, a convolutional matrix multiply, or a MaxPool operation. In each case, the RISC-V program needs to read from some buffer, perform the operation, and write the result to a different buffer. Using these RISC-V programs as building blocks, we can draw a new execution graph representing the databuffers and the units (MMA, SIMD, etc) involved with those operations.
The first step is converting the neural network graph to an equivalent execution graph. At the beginning, we need a few pieces of information from the user:
- A description of the network (Keras, TensorFlow, etc)
- What chip(s) are being used (what resources are available?)
- What are the power and/or performance goals (5W limit? need 30fps?)
- What interfaces are being used (PCIe? USB? I2C? HDMI?)
The pieces used right away are the network description, the chip(s) being used, and the interfaces being used — those each factor into the initial execution graph. This step is more-or-less a direct conversion process where we directly transform from one graph to another. The power and/or performance goals will be used later during optimization.
Now that we are in our execution graph, there is a lot to do. We need to convert the weights from floating point to the integer math our system uses, we need to map the elements of the execution graph to physical locations on the chip(s), we need to optimize for power and performance, and we need to generate the final firmware. Additionally, we need a few tools to help us: a validity checker to confirm that the current mapping is physically possible, an equivalence checker to confirm that the execution graph is equivalent to the original network graph (formally or empirically), and a performance checker to estimate the performance and power consumption of the system.
Similar to how we co-design our hardware and software, there are many opportunities inside the compiler to have these tools work together to create interesting outcomes. They form an iterative search process, where they modify the execution graph, check to see how well it performs, and either keeps the changes or backs out to try another. Rinse (equivalency check) and repeat. This is possible since inference is using an already trained graph as its input, and the entire process is performed offline. Now, we will look at the individual boxes shown above.
Our quantizer tool handles the conversion of floating point to fixed-point math, as well as hardening the network to 8b analog noise and analyzing where the accuracy can be reduced from 8b. There are a number of options for conversion — the simplest version of this process is feed-forward, where floating point ranges are re-scaled to integer values. A more complex process will use re-training using the original training dataset, which will also allow the user to assesses the benefits using 1–8b for each layer, which can more than double the performance or halve the power consumption of the application. Once either of these processes are complete, the network is in the correct number format and hardened against analog noise.
The mapping process needs to allocate execution graph elements to physical resources in the system, which are limited on the system as a whole, as well as each tile. They include the MMA memory space (which holds the neural network weights) and the SRAM buffer & code space. Some operations may be too big to fit on a single tile, so we need to split them into multiple pieces, map them to multiple tiles, and use our equivalence checker to confirm that the mapping is correct. Engineers who have worked on compilers targeting many-core systems might be familiar with these types of problems.
To illustrate this process, we can show the logical graph of ResNet-18:
The network has 18 neural network layers, organized into six groups. The first and last group have only a single layer, with the final layer being fully connected. The middle four groups each have four layers. We can visualize a basic mapping of this by putting each layer on its own tile, and showing the dataflow between them (below). This figure also highlights how a network-on-chip performs so well for this type of application, each of the data transfers is a single link and does not collide with any other transfers.
To improve performance of our basic mapping, the tools first needs to find where the bottlenecks are. To start with, we can start with static analysis (static analyzer), and then once that looks good, we can move to a simulation-based analysis (simulator). For static analysis, we know how many MMA operations are scheduled for each tile, how many SRAM operations, how many RISC-V cycles are needed, and so on, so we can find places where units are over-subscribed to hit our performance target and spread the workload to neighboring units. The simulator uses an architectural model of the system to add in the element of time — where the static analyzer looked at average loading, the simulator looks for congestion at various points in time. In both cases, they feed the optimizer which then modifies the graph to achieve higher performance. After the modifications, we need to call the equivalence checker again.
As an example of this process, imagine that the static analyzer found that the first layer was our performance bottleneck, which is expected since it tends to have the highest number of convolutions. To mitigate that performance bottleneck, we can duplicate that first layer into two MMAs and split the workload between them. To make that modification, we duplicate that node in the execution graph and modify the RISC-V programs, split the workload between them, and then re-merge the data before the next stage. This requires a modification of the physical mapping, and calling the equivalence checker.
This continues until we either hit our performance target or we run out of resources. Other resources on the chip may also be overloaded and require modifications to the execution graph or its mapping. For instance, a network connection may be overloaded and a re-shuffling of the mapping may alleviate the congestion.
In the case of optimizing for maximum performance, we know if we have done a good job by seeing what percentage of peak performance we were able to achieve. Many systems achieve about 10% of peak performance when running real application, which can be seen in Google’s TPU paper (below). For a well-optimized application, we believe that we can achieve much higher than that. If the application was designed specifically for the chip, it could theoretically hit ~100% of peak performance.
Making this whole process automatic is challenging. Although it looks simple enough for a linear network like ResNet, we expect networks to get increasingly complex over time. For instance, check out GoogLeNet (which actually came out before ResNet), it has some interesting expansions and merges.
After we are satisfied with our mapping, we need to generate the final firmware binary. This includes all of the RISC-V programs, the MMA weights, the configurations for various units, and so on. It is wrapped up in boot code that allows it to be loaded onto the system. The final binary also comes with drivers for the host system.
The most exciting part of this system is how vertically integrated it is. The things we learn from the compiler team and the deep learning team feed back into the chip design. Highly parallelized analog computing is a new area and there are many low-hanging fruit for improvement.
To wrap up, we have lots of development areas, including mapping, performance simulation, optimization, equivalence checking, RISC-V compilation, and parallel computing. These are the tools that our customers will be using on a daily basis, so they need to be rock solid and intuitive. To that end, we are looking for engineers for quality assurance, infrastructure, user interfaces, and applications. For more information, check out our job listings, or email firstname.lastname@example.org