A homomorphic FPGA implementation of the Intel 4004 — Part 1

Thomas De Cnudde
Zama
Published in
7 min readNov 15, 2021

--

Exactly 50 years ago, on 15 November 1971, Intel launched a product that would radically change the computing industry. The Intel 4004, today known as the world’s first commercial microprocessor, was the starting point for everything smart or connected that we use today.

The first ad for a microprocessor ran in Electronics News magazine, November 15, 1971. It introduced Intel’s first microprocessor: the 4004.

Inspired by the work of these pioneers, our team at Zama set out to apply our technology to this first computer-on-a-chip in order to make it fully homomorphic.

If we successfully complete this journey of creating a homomorphic processor inspired by the Intel 4004, we will be able to evaluate programs on private data. This computer would only see encrypted values, while never being able to decrypt them. If such a computer were to be deployed in the cloud, it could render useless any leaked information following a data breach. Considering this scenario of cloud computing, there is a question burning for an answer — Why would we need to offload computation of 4-bit values to a cloud if we can easily perform those computations on our own devices, away from possibly malicious third-parties and hackers? So let’s start with a bit of history and then loop back.

Computers 50 years ago

The Intel 4004 microprocessor is part of the “Micro Computer Set 4” (MCS-4) family of chips. At the heart of the MCS-4 is the 4004 central processing unit (CPU), a 3 x 4 mm² chip integrating about 2300 transistors defining a 4-bit datapath and operating at a 750kHz clock frequency. Compared to a CPU in a modern-day mobile system, the numbers look quiet unimpressive: in fact, a modern chip can hold 15 billion transistors (x6.5 million) on a 88 mm² (x645) area and computes with a 64-bit datapath (x16) and a 3 GHz clock frequency (x4000). Nowadays, the number of cores has increased from 1 to up until 8 on a mobile chip, permitting more work to be done simultaneously. The cosmic scaling in transistor count is attributable to Moore’s Law: the number of transistors that can be integrated on a chip doubles about every two years. The smaller but still impressive numbers are limited by the dictates of nature: with a clock frequency higher than 3GHz, so much heat is generated on the chip that it can’t be cooled down efficiently. To keep a steady increase in the performance of computers from generation to generation, hardware designers increase the number of cores rather than the clock frequency. With modern fabrication processes, the MCS-4 could easily be integrated on a single chip (in fact, manufacturers now do this for small computers). These are branded as microcontrollers rather than microprocessors.

The Micro Computer Set-4 Family (MCS-4)

The MCS-4 was conceived out of an assignment by Busicom, a Japanese calculator manufacturer. Busicom contacted Intel, at the time a young company producing computer memory, to design a set of 12 chips for a broad range of calculators. In the early days of silicon computing, chips had to remain small and simple to make their design, manufacturing, and programming easy. As a guiding rule, a system is ideally composed of a small number of chips, as each chip requires a substantial investment to create. Commercially, the more flexible the functionality of a chip, the more sets could be sold for integration in a variety of applications and systems. With these constraints, the team at Intel proposed a family of just four chips.

The Intel MCS-4 family of chips. For now, only the elements we discuss here are shown inside the 4004 CPU.

The 4001 chip is a 256-byte Read-Only Memory (ROM) for storing application programs. It also functions as the input/output (I/O) port of the system. An application would be loaded in the factory by Intel and once programmed, it could not be changed.

The 4002 chip is a 320-bit Random-Access Memory (RAM) for storing processed data. It additionally provides a 4-bit output port for the system.

The 4003 chip is a shift register and is used to expand the number of I/O ports.

All three support the 4004 CPU chip which coordinates the whole system through 45 instructions.

Instructing the CPU to Add Numbers

An instruction set is like a dictionary of commands containing all the operations we can instruct the CPU to execute. The instructions of the 4004 can broadly be categorised into these classes: compute, control, memory, and input/output. Most instructions consist of two parts: an operation code and an operand. The operation code, or opcode, specifies the operation to be performed, whereas the operand specifies the data that will be processed. An example is shown here:

An instruction is translated into computer-readable code by an Assembler. `wxyz’ is the binary value of `i’.

This particular instruction tells the 4004 to perform an addition between two 4-bit integers. The two operands for the addition and the result are loaded and stored on the chip itself. One operand resides in the accumulator, a memory in the arithmetic logic unit (ALU) which itself is the computational centre of the 4004. The other operand is fetched from one of 16 memories called the register file. When we give the CPU the instruction to “ADD i”, the CPU will fetch the iᵗʰ value stored in the register file, add it to the value in the accumulator, and store the result back in the accumulator.

An addition of numbers between 0 and 15 is not very impressive by today’s standards, but many complex functionalities can be created by operating on these small values. When 4004 programmers finished writing their list of instructions, they converted their human-readable assembly code to computer-readable machine code using an assembler. The resulting machine code, essentially a bunch of zeros and ones, would then be shipped to Intel who would hard-code them into the ROM.

A 4-bit adder is built from one half adder and three full adders. Carry₃ is 1 if the Sum of A and B is larger than 15, the maximum value representable by 4 bits.

Inside the ALU, the two 4-bit values go through what’s called a 4-bit adder. This adder is a circuit composed of one half adder and three full adders, themselves composed of Boolean AND, XOR, and OR gates. It just so happens that there’s an easy way to convert a Boolean circuit to a homomorphic circuit: the Concrete Boolean library. So, let’s get coding!

This homomorphic circuit adds two encrypted 4-bit values with 128 bit security in 1.3 seconds on a machine with an Intel i5 processor. In comparison, one addition on the Intel 4004 takes 10.8 microseconds.

CPUs are designed to make “the common case fast”. This means that operations that are expected to occur more in programs are designed to be faster than instructions that are rare. In the Intel 4004, 4-bit additions were designed with speed in mind, and similarly in our Intel i5, 64-bit additions were designed to execute fast. Unfortunately, the “common case”, or the list instructions to evaluate our homomorphic Boolean gates differs from the “common case” instructions in regular programs. This leads to the performance gap we notice when comparing the execution of an addition on the 4004 and a homomorphic addition on the i5.

In order to really leverage modern technology, we will have to design our own circuits and our own computer architecture. This design effort will result in a dedicated hardware accelerator that is tailored to running the operations behind the Concrete Boolean gates. Rather than executing the code on a CPU as we did now, we will execute the code on an FPGA, a device that allows us to program a custom computer architecture.

So, what’s next? Revisiting the past …

Since this blog post is the first in a series in which we will investigate how to make the Intel MCS-4 family of chips fully homomorphic, we will keep our homomorphised instruction set as close as possible to the original, while giving ourselves all the advantages of modern technology to match the speed and functionality of the original i4004. Over the course of the 50th year anniversary of this landmark microprocessor, we will regularly write about our progress and the challenges we face.

In order to create a future where privacy is the default, we need to understand how we got to a present where electronic computation is as pervasive as it is, and learn from its success. This is why scientists and engineers should study the history of their fields, because as the old sayings go: At the end, “to predict the future we have to learn from the past” and “the best way to predict the future is to create it”.

As nascent digital archeologists, we’ll have to put dates and years on artefacts. If we can achieve a similar speed to the i4004 with our homomorphic z4004 processor leveraging 50 years of technological advancements, we can ultimately answer the question: What year is it homomorphically?

We are not the first to commemorate the Intel 4004, and what follows are several other projects that are interesting to check out as well. In 2006, on the 35th anniversary, Intel released the schematics of their 4004 circuit. This allowed replicas of the microprocessor’s circuits with different technologies. The building of a “macrocomputer” with individual components rather than components integrated on one chip is an ongoing project that can be followed here. A very handy emulator, assembler, and disassembler was launched as a web application in 2007. More recently, an add-on card for Arduino microcontrollers was released to drive an actual 4004 chip from software as the RetroShield 4004. Also, Federico Faggin, the MCS-4’s project lead, released his autobiography earlier this year.

If you’re interested in following our progress, you can subscribe to our newsletter to get the latest news about homomorphic encryption and what we do at Zama. Stay tuned until next time!

--

--