Improved pipeline processing for self-made CPU

4 min readMar 14, 2024

Intriduction

I am currently creating a virtual CPU using chisel.
It was originally created by copying the code from the book and I’m currently in the process of improving it to my own taste.
This time, I will summarize the improvements about pipeline processing for the upcoming superscalar execution.

1 ID, EX stage unitization

Originally, all pipeline processing was done in one class. I have created separate classes for the ID stage and EX stage and made them into units.

2 Distribute commands to appropriate arithmetic units

The EX stage has multiple arithmetic units (currently ALU, multiplyer, and divider). Originally, all commands were connected to all arithmetic units, and the arithmetic units decided whether to receive the commands and execute the calculations.
That is, if the command is MUL, the ALU and the divider ignore it, the multiplier performs the multiplication.

However, it is wasteful to run arithmetic units for commands that they do not support, and later on when performing superscalar execution, the arithmetic units will not be available and parallelization will not be possible.

Therefore, I changed the system so that the commands are distributed to the appropriate arithmetic unit based on the type of command.

val multiplyer = Module(new Multiplyer())
multiplyer.io.exe_fun := BUBBLE
switch(exe_fun){
  is(ALU_MUL, ALU_MULH, ALU_MULHSU, ALU_MULHU){
   multiplyer.io.exe_fun := exe_fun   
  }
}
multiplyer.io.op1_data := op1_data
multiplyer.io.op2_data := op2_data
val mul_out = multiplyer.io.calced

exe_fun is the function types executed, op1_data, and op2_data is data stored in the operands. The command is input into the multiplier only if the command is supported by the multiplier, otherwise BUBBLE is input.

3 Addition of performance counters

I want to understand how much performance will improve as I add a certain feature to my CPU.
So I’ve made it possible to measure and display performance.

The following indicators were added.

Number of processing cycles: Number of cycles required to finish executing the program
Number of commands processed: Number of commands processed
Number of stalls: Number of stalls during the process
CPI: Number of processing cycles/number of processing commands

They are calculated as follows.

Number of processing cycles
For this I just added a register and increments it every cycle.

val cycleCount = RegInit(0.U(32.W))
cycleCount := cycleCount + 1.U

Number of stalls
A flag (stall_flag) that determines whether to stall has already been implemented.
This is incremented only when this flag is not set.

val stallCount = RegInit(0.U(32.W))
when(stall_flag){
  stallCount := stallCount + 1.U
}

Number of commands processed
It can be calculated using the following formula.
Number of processing commands = number of processing cycles — number of stalls — (number of pipeline stages — 1)

The final term on the right-hand side is the number of extra cycles caused by pipelining (the last command does not benefit from parallelization and takes an additional number of stages in the pipeline).

val instCount = cycleCount - stallCount - 4.U

CPI
Just do the division.
Chisel does not have a floating point type, so the value multiplied by 10 is stored internally and the integer part and decimal part are separately displayed.

val cpi10 = Mux(instCount === 0.U, 0.U, cycleCount * 10.U / instCount)

At the end, display these indicators using the printf function.

printf(p"cycle count: ${cycleCount}\n")
printf(p"stall count: ${stallCount}\n")
printf(p"instruction count: ${instCount}\n")
printf(p"CPI: ${cpi10/10.U}.${cpi10%10.U}\n"

Execution result

Let’s run the program below.

int main(){

  asm volatile("li a0, 5");
  asm volatile("li a1, 9");
  asm volatile("divu a2, a1, a0");
  asm volatile("unimp");

  return 0;
}

Each indicators were successfully measured.

Changing process termination condition

When the “unimp” command is detected, the program is terminated.
Originally, the type of command decoded was determined at the ID stage, and if it was ”unimp”, the process is terminated.
However, when the ID stage is processing “unimp”, the previous line is in the EX stage.
If it ends here, the last line (excluding “unimp”) will not pass through the MEM and WB stages.
To prevent this, I put three extra “nop” instructions before “unimp”, but this is also useless.

//Original program
int main(){

  asm volatile("li a0, 5");
  asm volatile("li a1, 9");
  asm volatile("divu a2, a1, a0");
  //↓↓These lines are inserted
  asm volatile("nop");
  asm volatile("nop");
  asm volatile("nop");
  //↑↑
  asm volatile("unimp");

  return 0;
}

So, I solved it as follows.

When “unimp” is detected at the ID stage, the 4 points before the pc at that time (pc increases by 4) is stored in the register as pc_before_end.
The process is terminated when the pc processed in the WB stage reaches pc_before_end.
Instructions after “unimp” are treated as BUBBLE (do nothing)
I also added end_flag and set it to 1 on “unimp” detection.
If end_flag is set, pass BUBBLE after EX.

The code looks like this:

val end_flag = RegInit(false.B)
val pc_before_end = RegInit(0.U)

//id_reg_inst is the original instructions fetched from memory
//exe_br_flg is a flag that indicates that a branch has occurred.
//Other flags indicate conditions that make a command BUBBLE.
val id_inst = Mux((exe_br_flg || exe_jmp_flg || stall_flag || end_flag), BUBBLE, id_reg_inst)

//pc processed by ID stage
id_reg_pc := Mux(stall_flag, id_reg_pc, if_reg_pc)
  
when((id_inst === UNIMP) && (end_flag === false.B)){
  end_flag := true.B
  pc_before_end := id_reg_pc - 4.U //the previous line of UNIMP is the last
}

・・・

io.exit := false.B  //process ends when this becomes true
//wb_reg_pc is pc processed by WB stage
when((pc_before_end > 0.U) && (wb_reg_pc > pc_before_end)){
  io.exit := true.B
}

Reference

I wrote the detailed topics below.

[Chisel/Scala] Introduce decimal point data

What I want to doI want to introduce decimal point data (float in other languages).The background is that I'm trying to…

en.rm48.net

Future prospects

First of all, my immediate big goal is to execute superscalar.
However, before that, we need to be able to reproduce the following characteristics, so I would like to do this first.

Make one operation take multiple cycles

For this purpose, I think the following is necessary.

Specify the number of cycles required for each type of operation
Adding a flag to indicate whether the operation is complete
Make the next process wait until the calculation is finished