Improving ZipCPU instruction set, part 1
Disclaimer first — I’m a software architect and know very little about hardware design. I’ve just found myself applying my everyday skills to a new problem. I could have made lots of mistakes below.
If you know nothing about ZipCPU, I refer to official description at http://zipcpu.com/zipcpu/2018/01/01/zipcpu-isa.html
ZipCPU has very much reduced instruction set already. Still, when I first opened https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/idecode.v and started to understand how it works, I was very much surprised on how complex actual instruction decoding is.
It is amazing how many non-standard decisions ZipCPU creator Dan Gisselquist made, while designing the ISA, most notable are interrupt handling without interrupt vectors by simply switching to supervisor mode, modest register set of 16 registers, instead of more common 32, 3 conditional execution bits corresponding to common combinations of 4 flag bits, making processor flags and PC general purpose, etc. I was reading docs again and again, thinking a lot about design and looking into ways to simplify it.
ZipCPU 32-bit instruction format has 4 large classes — Standard with 18-bit immediate, Standard with register and 14-bit immediate, MOV with 13-bit immediate and LDI with 23-bit immediate. Condition codes are part of all instructions, except LDI.

Why MOV has special format? Because this is the only instruction that can move between user and supervisor register sets. All other instructions work with current set (user’s in user mode and supervisor’s in supervisor mode).
Why we need LDI? Because MOV cannot move immediate to register, it always moves sum of register and 13-bit immediate.
So let’s look at MOV from a different angle. Most of the time we move between registers in current set, we also obviously need to move between user and supervisor, but do we actually need to move between user’s registers from supervisor mode often? Can we simplify if we drop this behaviour? In rare cases when supervisor needs to move between user’s registers, it can do it using one of its own registers as an intermediate.
So let’s split MOV into ordinary MOV instruction, which moves between register in current set, and MOVSU instruction, which moves between sets. Then we can unify MOV and standard instructions. After unification MOV will either move 18-bit immediate, or register + 14-bit immediate. Now we suddenly notice that we can get rid of LDI instruction. Yes, it has longer 23-bit immediate support, but do we need it so much? Values that fit into 18-bit can now be loaded with MOV, values that do not fit into 23-bit need 2 instructions anyway. So let’s drop LDI instruction altogether. Bonus is that all instructions are conditional now.
Now let’s look into MOVSU instruction. First reaction is we cannot unify it with standard instructions, because it needs direction bit to select move from/to user register set. But as we’ve just removed LDI instruction, let’s just assign two instruction opcodes to it and name them — MOVSU and MOVUS.
Now all instructions (except special instructions) are unified.

Arguably it is very hard to simplify instruction format any more.
MOVSU/US pair occupies former LDI opcode with 22-bit as direction selector.
assign w_movsu = (w_cis_op[4:1] == 4'hc); // former LDI code
assign w_dcdR[4] = ((!iword[`CISBIT])&&(w_movsu)&&(!OPT_NO_USERMODE)&&(!i_gie))?!iword[22]:i_gie;
assign w_dcdB[4] = ((!iword[`CISBIT])&&(w_movsu)&&(!OPT_NO_USERMODE)&&(!i_gie))?iword[22]:i_gie;Surprisingly, as 18 +14 = 32, long jump and in general loading 32-bit immediate value into register can now be written as
BREV.cc rev(adjusted 18-bit hi part), r6
MOV.cc r6 + (14-bit lo-part), pcThe only twist is that as both parts will be sign-extended, they must be adjusted if 18th and/or 14th bit of desired value is set.
PC is special because it cannot be partially loaded. For other registers, no intermediate is required.
Similar, loading 32-bit immediate into user register from supervisor mode can be written as
BREV.cc rev(hi part), r6
MOVSU.cc r6 + (14-bit low-part), rx // rx is user'sSeems that with a new idiom of loading 32-bit immediate, it is even possible to remove LILO instruction for something more valuable or common. Amazing BREV instruction is now even more powerful and in perfect symbiosis with MOV register plus 14-bit immediate.
Can special instructions (four is defined — BREAK, LOCK, NOOP, SIM) be simplified/streamlined with standard instructions? Now as all standard instructions are conditional, I would make special instructions conditional too, saving an if statement or two in decoder. Unconditional execution can always be requested by setting condition codes to “always”. Simplification will also lead to more flexibility — executing BREAK or printing to simulator terminal on condition seems useful occasionally. For LOCK and NOOP this does not sound as useful, but why not. Instruction data space reduction by condition codes does not matter much, as this space is unused in LOCK and NOOP, arguably not vital to BREAK (debugger can keep additional breakpoint information in its own memory in dictionary {address: data}), and does nothing to SIM as 2 characters still fit.
Can Compressed Instruction Set be simplified? It is already very simple and terse, with a single trick of LW/SW with 7-bit immediate implicitly referencing SP, as accessing 128 bytes at the start and at the end of 32-bit address space is not too common.

If we cannot simplify it, may be we can improve it a bit?
We see that MOV with 7-bit immediate is redundant, because LDI can load all those values, so we have space for “half an instruction”.
IMHO that instruction is valuable LSL 7-bit immediate instruction, because to access arrays register is often shifted. So compiler might wish to generate something like
// a = p[index].field; p is in r1, index is in r2
MOV r2 + off1, r6 | LSL 3, r6
ADD r1 + off2, r6 | LD (r1 + off3), r6
// off1 * 8 + off2 + off3 = offset(field), to increase reachIt seems no logic would change except
begin : GEN_CIS_OP
...
case(iword[26:23]) // Now switch on 4 bits
...
4'hf: w_cis_op = 5'h0d; // MOV
4'he: w_cis_op = 5'h06; // LSL
...Instead of adding LSL we might add another trick and extend LDI range to 9 bits. It might be valuable because it will allow loading unsigned byte value into register. I do not think this is so good idea, but may be describing it will lead to some valuable thought.
We again see that MOV with 7-bit immediate is redundant, so we already have half of space we need to extend LDI. If we only could find another half… But actually, we almost have one. It is hidden in the pair of ADD and SUB with 7-bit immediate. We can change register by [-128..128] when we have both of those instructions, but if we drop one of them, we will reduce the range by 1 to [-127..128] or [-128..127]. For keeping symmetry with CMP, and also because arguably adding 128 is more valuable than subtracting 128, let’s remove ADD with 7-bit immediate, reusing that opcode for MOV.

I have a very simplistic understanding of how FPGA works and do not know how much this will increase LUT count. Instead of iword[23], we will now have to use more complex formula, because ADD is now special
assign using_imm3 = iword[23] && (iword[24:26] != 2'h2); // ADD is not like others nowWhat if we wish both LSL and extending LDI to 9 bits? There is 16 quite useless instructions SUB rx + imm3, rx, which effectively set flags according to value of zero minus 3-bit immediate. I have no idea how many LUTs is required to compare two 4-bit values (I would naïvely think about xoring them + 4-bit LUT), but if it is cheap, then issuing LSL with 3-bit immediate instead sounds valuable. Keeping 3-bit immediate sign extending behaviour consistent with the rest of instructions will give decent shift range of [0..3], effectively multiplying by 2, 4, 8, while zero-extending will increase range to [0..7] or even to respectable [1..8] if adding one to 3-bit value is cheap. All register selecting logic would probably work without modification.
wire [8:0] w_halfI;assign shl_trick = (iword[23:26] == 4'h1) & (iword[19:22] == iword[27:30]); // SUB rx, rx + imm3assign w_halfI =
(iword[26:25]==2'h3) ? w_halfbits[8:0] // 9'b for LDI
: shl_trick ? {5b'0, w_halfbits[2:0]} // +1, if cheap
: using_imm3 ? { {(7){w_halfbits[2]}}, w_halfbits[1:0]}
: { {(2){w_halfbits[6]}}, w_halfbits[6:0]};
If you are still here, thanks for reading!
