HiFive 1 Rev B board from SiFive with RSIC-V CPU

How to Design your own RISC-V CPU Core

Shirish Bahirat Ph.D.
Programmatic
Published in
17 min readDec 22, 2020

--

Welcome to the RSIC-V revolution!

There will be 62.4 billion RISC-V processors connecting over 30 billion devices by 2025. Most of these devices will include application specific and custom RISC-V cores. While Personal and Cloud Computation era fueled by Intel, Arm empowering mobile devices and now Internet of Things (IoT) will be driven by RISC-V [1]. IoT based smart devices will be connected to internet, powered by AI and will understand environment through sensors.

Why RISC-V?

RISC-V is an open source Instruction Set Architecture (ISA). Tools, infrastructure and productization of silicon chips is a daunting task. It’s common to have software startups with approx. 500K dollars capital, however hardware startups requires 10’s to 100’s millions of dollars. It’s because development of Application Specific Integrated Circuits (ASIC) with proprietary ISA do not have much leverageable components. On the other hand software startup like Instagram can use around 20 plus open source frameworks significantly reducing development cost and efforts [2].

AI is driving customized silicon topologies such as Vision Processing Units (VPU), Tensor Processing Units (TPU), Neural Processing Units (NPU), Data Processing Units (DPU) that can be used on edge devices such as self driving cars, automated departmental stores, automated manufacturing, smart homes and even automated agriculture. Taking this further, imagine small RISC-V chips embedded in the walls, windows, doors, roads, every container in your house, chair, desk, cloths, shoes and even in trees from your backyard enabling many new possibilities.

Having open source ISA can not only lower the hardware costs but also can enable scalability by allowing community to include custom instructions that can simplify software. While RISC-V brings huge potential, it can be only realized through community growth and contributions.

Central processing unit (CPU), the brain of the computational device is a truly technological marvel. It seems like a fascinating magic that is only for elite experts to comprehend, however we will unravel this magic by designing our own RISC-V. Alan Turing introduced the idea of Universal Turing machine (UTM) in 1936-1937. UTM can read input and produce output based on the data and description of the machine itself, thus formalizing the idea of a computer program. In 1945, John von Neumann proposed an architecture for “Electronic Computing Instrument” that can execute the programs defined using UTM, is famously called as von Neumann architecture.

Von Neumann architecture includes:

  1. Arithmetic and Logic Unit (ALU) that performs mathematical operations like add, subtract as well as logical operations to check if two numbers are equal, less than or grater than so on.
  2. Memory management unit (MMU) that stores instructions and data based on which the program operates. Input and output devices (IO) that reads instructions and stores the updates back to Memory.
  3. Control unit (CU) that directs and orchestrates execution of the instruction and program including data flows. Control unit governs single cycle or multi-cycle pipeline and normally associated with performance of the CPU.
Von Neumann architecture

We will develop a single cycle RISC-V CPU from scratch as an academic exercise using python based Hardware Description and verification Language (HDL) called MyHDL. MyHDL is an open source, simple yet powerful HDL that integrates seamlessly with traditional HDL’s like Verilog and VHDL[3]. Code written in MyHDL can be also converted to Verilog or VHDL and can be integrated with System Verilog or SystemC verification environments. Our CPU will perform 12 basic operations defined in RISC-V instruction set and can be easily extended to include more advanced as well as custom instruction sets. While understanding of VHDL, Verilog, SystemC or System Verilog can help, it’s not required for this exercise. However, we will assume readers have little bit basic understanding of MyHDL.

Software Vs CPU parallelism

Software or code executes in the form of threads. While each CPU can execute multiple threads, in reality CPU cycles are shared by multiple applications and only one application can be executed at a time with time shared fashion. If you are running 10 applications on single CPU, each application executes for a fraction of time providing an illusion that all applications are running in parallel. Multi-core CPU’s such as 4 to 8 core on general desktop computers can have 4 to 8x parallelism. On the other hand, when HW executes it is moving electrons from one transistor to another.

Modern CPU includes around 7 billion transistors. While obeying the laws of physics, each electron moves independently allowing HW that implements ISA to perform lots of complex tasks and with massive parallelism. Simplicity of RISC-V ISA enables CPU implementation with approximately 8K to 15K gate count, around 47% lower than ARM processors.

ISA’s comes on two broad categories, Complex Instruction Set Computers (CISC) and Reduced Instruction Set Computers (RISC). As name defines single CISC instruction can perform many complex tasks while single RISC instruction can perform only single simple tasks. CISC programs can be smaller at the cost of silicon complexity and high power consumption on the other hand RISC programs can be longer providing ability to run these programs on simpler silicon that requires much less power. RISC-V is much simpler instruction set specifically designed to simplify CPU design. The RISC-V project was initiated at the University of California, Berkeley during 2010. Today it has more than 750 affiliates from over 50 countries that spans across academia and industry.

CPU Modules

We will review building blocks of the CPU through fairly simple python code, then integrate all the blocks to finish our implementation. After completing the integration, we will simulate and verify instruction set is executed correctly by our CPU.

To keep it simple, our CPU is single cycled and not pipelined, that means we will fully execute one instruction before executing the next instruction. Single cycle CPU will require two clock domains, one domain will be used for few internal steps such as write operations and second will be to start execution of next instruction. In general, CPU requires multiple steps to perform task defined in single instruction:

  1. Instruction fetch: read instruction from memory
  2. Instruction decode: understand what instruction means
  3. Execute: perform the operation defined in instruction
  4. Memory access: read required memory locations if needed
  5. Write back: write the data to memory if modified

These 5 steps will be executed within multiple clock cycles. So the first clock domain will track above 5 inter instruction steps and second clock domain will be used for the instruction pointer increment. We will design the instruction pointer increment clock to be adjusted dynamically and static clock will transition from high-to-low and vice versa at every 10 nano seconds. Design presented here is based on [4] with few modifications.

Clock

Clock block implements a fixed cycle clock that transitions every 10 nano seconds from high-to-low and vice versa. Since this is the first module we are reviewing, we will use it as a quick MyHDL primer. Every hardware module or functionality can be defined as a @block and each block can include one or multiple functions. In other term, block defines boundary of specific functionality like software class object. All sub-functions within a block executes in parallel. One block communicates with other block using signals or wires and can have multiple input signals and output signals as well as internal signals.

Sub-functions within a block are executed based on specified sensitivity of a signal defined by @always decorator that is either transitions from low-to-high or vice versa. Sub-functions can be also executed based on delay of reference clock that is externally attached to the CPU. Once a block is defined, it needs to be instantiated and integrated with other modules. Also electronic circuits comes in two types, sequential logic and combinational logic. Combinational circuit relies on the present input while in the sequential circuit relies on latest input and earlier outputs. Combinational logic circuit is state less and sequential logic circuits can maintain states.

Following clock block is implemented as sequential logic. It executes the clck() function at every 10 nano seconds. Clock block have one input signal called clk and clock block flips the next polarity of clk signal at every 10 nano seconds. Once instantiated and integrated this action happens all the time until our CPU is running and will used as a reference signal by all other modules. So basically our CPU will be running at 100 MHz frequency.

Clock block implementation

Program Counter Selection Mux

Program Counter (PC) maintains a pointer or address index at instruction memory. CPU executes instruction that is stored on current PC register. CPU register is a memory location that can be accessed extremely fast. Once current instruction is executed, CPU needs to make decision which instruction to execute next. This decision is made by control unit. In normal circumstances next instruction needs to be executed, however in some cases the instruction pointer needs to jump to a different place. Jump is required to execute conditional instructions or to call a different function. This jump is also called as branch. Branching decision is made by the control module and branching ALU which we will review later.

Following module makes selection for next PC based on input signal called pc_sel. If pc_set is set or signal is high then next program instruction execution moved to the jump address else it is incremented to the next PC. Generally PC’s are incremented using byte (8 bit) addresses and for 32 bit CPU it is incremented by 4 (32 bit/8 bit). For simplicity our CPU will increment PC by 1 and memory module will increment it to the next 32 bit or 4 byte boundary. We will review reset functionality later.

Program counter or jump address selection multiplexer block implementation

Write Data Selection Mux

ALU requires two inputs to perform arithmetic (i. e. A+B) or logical (i.e. A > B) operation. Output of ALU can be either an address in the data memory or new value produced by the operation. Control units makes this decision using mem_to_reg signal based on type of instruction being executed. If mem_to_reg signal is set then, ALU produces memory address within data memory and contents of the data memory at that address are transferred back to the register file. If mem_to_reg is not set then ALU output itself is stored back in register file.

Memory to register data transfer selection multiplexer block implementation

ALU Second Input Selection Mux

This is the last Mux we will look at, Mux is abbreviated form of Multiplexer. ALU input is controlled using alu_mux. Arithmetic or logical operations requires values stored in register file, where as jump and immediate instructions required value that is defined inside the instruction itself. If all_src is set then information from instruction is passed to ALU, otherwise register file contents are transferred to ALU.

ALU source data selection multiplexer block implementation

Branch taken

Often program execution pointer required to move based on some logical decision. When some condition match, the execution is transitioned to a specific function call. To make this decision XOR operation is performed. If all bits match then XOR output becomes zero and branch is called as taken. This module checks if branching is required and XOR result from ALU is zero. ALU can have a separate zero flag signal, however to keep our implementation simple we pass the ALU result to taken module.

Branch taken module implementation

Register File or Register Bank

Often CPU can have multiple register banks and depending on operating modes specific bank gets activated. Multi-threaded OS manages separate register bank context for each thread. Our CPU will have just single register bank with 32 bit wide 32 number of registers. The width of data stored in registers depends on RISC-V variant. RV32 ISA uses 32 bit registers, RV64 uses 64 bits, and RV128 uses 128 bit wide registers. Also specific register is enforced for a specifc use case, x0 is hardwired for zero, x1 is used for function return address, x2 is used for stack pointer, x3 is used as global pointer, x4 is used as thread pointer .. so on. Our reg_file module initializes the register array with some default values so that instruction execution can be verified, otherwise all registers needs to get initialized with zeros. Reads are managed through combinational logic and when register addresses are changed, data is read from file and transferred to output signals. Writes always happen on positive edge of clock signal only when write register address is non zero.

Register file block implementation

ALU

RISC-V comprises base 32-bit integer instruction set is called as RV32I and it includes 47 instructions. We implement 10 of them and our CPU can be scaled easily to include the rest. Also there are several extensions to the base set:

M: Integer multiplication and division

A: Atomic

F: Single-precision floating point compliant with IEEE 754–2008

D: Double-precision floating point compliant with IEEE 754–2008

Q: Quad-precision floating point compliant with IEEE 754–2008

C: Compressed instructions (16-bit instructions) … so on

ALU module performs operation as defined in alu_decode on the rda and rdx inputs and stores output in result.

ALU implementation

ALU Control

ALU requires control unit that is separate from main control unit. Main control unit defines alu_op as a function of type of instruction is being executed. Depending on more information from instruction all_control unit decides which operation needs to be executed. For example, jump instruction will need ADD operation, branch instruction needs XOR, arithmetic operation needs its own operation so on. ALU control unit keeps the architecture modular and makes it easy to include additional instruction types.

ALU control implementation

Control

Instructions are categorized in I-type, R-type, SB-type etc mainly to facilitate the control operations. Data path for a specific instruction type is mostly same thus helping scalability of CPU architecture. For example R-type instructions work on two registers as input and one register as output. ALU can perform one of the many R-type operations as defined above however data flow remains same for a specific instruction type.

Configuring control is one of the first step after decoding the instruction. To make CPU implementation simple, RISC-V defines bits [6:0] that needs to be used by control unit. Bit [6:0] defines type of operation and control units configures all multiplexers to set required data flow.

R-type: arithmetic and logic operations on registers

I-type: arithmetic and memory load operations based on data from instruction and register

S-type: Store data operation based on data from instruction and register

SB-type: Increment PC based on Jump or branch instruction

Data flow control implementation

Immediate Generation

This module supports a special type of instruction decode that derives information based on bits in the instruction itself. Immediate decode is only required for I, S and SB types within the framework of our implementation and implemented instructions.

Immediate data decode block implementation

Data Memory

Generally memory operations are expensive and consumes few clock cycles. To keep things simple we implement a simplified version of memory module. We read data in binary format and load the data tightly coupled memory (DTCM) that can be accessed at the positive edge of the clock within single cycle. ALU result provides the read or write operation and control unit sets the operation type.

Data memory block implementation

Instruction Memory

Our instruction memory module defined as combinational logic, yet the PC increments only on the slower clock domain when execution of previous instruction is complete. Apart from reading data from memory this module also decodes the instruction.

Instruction at defined read_addr is decoded in the form of register address for first (ra) and second (rb), and third register (wa) depending on type of instruction. Data memory does not require this decoding, so normally it is kept separate from instruction memory. Some CPU’s may support configurable data and instruction memory space.

Instruction memory block and instruction decode block implementation

PC Adder

Program Counter adder is a special form on ALU, specifically designed to increment the program counter. This module gets executed on the positive or rising edge of step signal that toggles after 6 clocks making our CPU run at 16.66 Million Instructions Per Second (MIPS).

Program counter adder implementation

Jump Adder

Jump adder is another special form of ALU that moves the PC to required position based on ALU output when conditions for branch instructions are validated.

Jump address adder implementation

PC Assign

Our last module assigns next PC to the instruction data memory, triggering start of the next instruction execution.

Module that assigns next PC implementation

After developing all these modules, next step is to bring them together and make them talk to each other.

CPU Top

Following schematic shows the CPU modules and their integration along with input-output flows, bit width of wires connecting these modules, color signifies execution grouping for various instruction types. Clock and reset signal are not shown in this diagram as every module will have clock and reset signal as input.

Based on [4] chapter 4 Figure 4.17 — defines signal and block names per our implementation

MyHDL allows to define hardware data type called as intbv abbreviated from integer bit vector. It can be defined with initial, min and max values as well as number of bits. For example opcode[6:0] is defined as intbv(0)[7:]. To access a specific bit from intbv 0 based index is used, however to access range of bits it requires offset + 1. So as first step, we define all signals their widths and initial values, then create instance of all the blocks. Python creates generator objects of all the modules within block that’s why every instance returns a function object. MyHDL simulation kernel executes all modules based on applicable sensitivity defined by either by @always or @always_comb. In cpu_top block we define two additional functions, one that releases reset and another that triggers first instruction execution after reset is released. Reset functionality can be used to initialize the blocks to default states or if something goes wrong then reset can be invoked in exception handling path. We keep our reset functionality simple for now, all it provides is to enable execution of all the blocks when it is released.

Putting all the blocks together

Running the CPU using stimulus

Top block provides clock and reset signal, CPU can pull the reset signal down whenever needed, however we have not implemented this function. Top module acts like a test bench, releases reset and starts the CPU execution for defined number of clock cycles (1400) that is enough to execute around 12 instructions.

Creating and running CPU instance

Machine Code

Only language any CPU can understands is 0’s and 1’s. ISA puts meaning behind these bits and these bits are translated into assembly language in human readable form. Modern compilers with high level language enables hiding the complexity behind all the machine specific details. C++ or C can be compiled to run on different type of CPU’s and programer don’t need to worry about CPU specific details.

To test our CPU, we will run following 12 instructions. First seven and last two are R-type instructions. Eight, nine and ten are I, S and SB-type instructions respectively. Each instruction encompasses 32 bits. Following data flow sections will describe meaning of each bit and execution steps and best way to review these is by walking through the data flow schematic defined above.

Compiled binary code for example — can be extended to any number of lines

R-Type Instruction Data Flow

Decode process defines rs1, rs2 registers to read the data contained in these registers, rd is destination register. After the ALU opcode is defined, data from rs1 and rs2 sent to ALU. All mux values are set to 0 by control, reg_wr is enabled and ALU results are written back to rd.

Following table shows implemented instructions, definitions for each bit in opcode, rd, funct3, rs1, rs2 and funct7 columns, assembly instruction in column Op and lastly operation Description.

Bit definition, assembly instruction and operation description for implemented R-type instructions

I-Type Instruction Data Flow

Immediate decode is used to define offset from the memory address pointer stored in rs1. ALU uses add operation to find the data memory address. ALU mux is set high, reg_wr and mem_to_reg is set true as well. ALU computes the memory address, data from memory is transferred to rd. For load word operation, func3 defines the data width.

Bit definition, assembly instruction and operation description for implemented I-type instructions

S-Type Instruction Data Flow

Store type instruction stores data contained in rs2 register into memory location computed by ALU. To execute S-type instruction alu_src and mem_wr signals set to high, then ALU computes data memory offset and process completes after writing data from rs2 to memory.

Bit definition, assembly instruction and operation description for implemented S-type instructions

SB-Type Instruction

Branch instruction needs control to set branch signal high, all other control signals are set low. Immediate computation is bit complex for this instruction. ALU performs XOR operation on the contents of the rs1 and rs2 registers. Branch is taken if XOR results are zero.

Bit definition, assembly instruction and operation description for implemented SB-type instructions

More instructions can be added to our CPU ensuring the data path and control gets configured correctly.

Results

GTK Wave and Scansion are open source trace analysis tools to review inner workings of our CPU. Even it is single cycle, our CPU will include 100’s of signals that can be observed to verify every step in execution for all blocks along with their inputs and outputs.

Following snip shows control signals of our CPU as different operation types are getting executed for example branch signal is set to high for SB type where opcode is 0x63 or 99 decimal.

VCD trace for control signals

Red colored data in following snip shows register writes and yellow data traces shows memory writes. As writes gets executed on the rising edge of clock they are delayed compared when write back register wa[4:0] decode.

VCD trace for register and data memory writes

Here is the detailed review of key input and outputs for all 12 instructions along with clock and reset signals. Data and control flow for each step can be stepped through this information.

VCD trace for key signals to review CPU operations

And to verify details in above vcd trace we can create a CPU code execution trace tool as shown below. The trace defines operation code, register numbers, data contained in register and memory operations, immediate values and type of instructions being executed.

CPU instruction execution and data trace

RISC-V ISA Innovations

RISC-V instructions are grouped into extensions and types (i.e., I-type, R-type etc.) based on the data flow and control paths. Adding a new instruction within an existing instruction type or new type gets simplified because of the modularity of base architecture.

Base RISC-V ISA includes about 47 instructions, very small number compared to other RISC instruction sets (for example over 1000 instructions in ARM) and complex or advanced functionality can be emulated through base RISC-V instructions.

Also, RISC-V ISA is easily scalable across 32-bit, 64-bit and 128-bit without requiring significant modifications in instruction decode process as 32-bit instructions can be implemented to handle different data width. Data width definition is part of the instruction itself.

Verilog Conversion

Conditionally, MyHDL supports automatic conversion of python blocks to the Verilog or VHDL modules. This functionality provides a direct path from MyHDL into the standard Verilog or VHDL based design environment. Following python code snip provides conversion process.

MyHDL python to Verilog conversion

Auto generated Verilog code as an example for ALU control module:

Verilog code for ALU Control

Summary

We developed and verified single cycle RISC-V processor that executes 12 of 47 instructions. Complete source code of this CPU is available for reference on github. Once MyHDL is installed our CPU can be simulated using following command line and auto generated Verilog code can be used within development kits for actual fabrication.

rm -f *.vcd && rm -f *.v && python riscv-cpu.py

Designing and implementing a processor is relatively a complex task however can be comprehended easily for RISC-V ISA. With open source tool chains and advent of AI, IoT; RISC-V architecture will become ubiquitous. Benefits of application specific hardware are exponential and can not be overlooked to advance the state of technology. Linux already supports RISC-V, many world changing startups are focused on leveraging this infrastructure and we are on the verge of exciting revolution that will be driven by RISC-V.

References

[1] risc-v foundation https://riscv.org

[2] Instagram engineering https://instagram-engineering.com

[3] MyHDL http://www.myhdl.org

[4] Computer Organization Design

--

--

Shirish Bahirat Ph.D.
Programmatic

Engineer with passion for learning and sharing knowledge, worked with world's leading organizations like Google, Intel, and Nvidia. Opinions are my own.