How I started with arithmetic cores and built a best selector (multiplexer).

3 min readApr 22, 2023

This is my new Medium account (instead of Pushkarev Valeriy Andreevich). It seems I’ll lose my Gmail account and can’t recover my mail even by phone :(.

Anyway, that is the story about a better memory selector for DDR/SSD and other types of memory.

The beginning

I read an article about positional encoding (one-hot encoding) that works faster for state machines.

And I implement a set of arithmetical instructions on it (add/mul, logical operators, and comparators). But, after testing, I figured out that all the tricks are in transistors and their fan-out vs. gate capacity vs. switching power characteristics.

But I get the fastest way to implement any 2–4 bit function. (a sequential adder that works as fast as a carry-lookahead adder). So I’ll build a selector :)

DRAM/SSD address decoders and what you can read about them

Well, it’s disguisting. All information about the memory address decoder is like this:

https://www.ques10.com/p/36431/what-are-various-decoders-used-in-memory-structure/#:~:text=The%20row%20and%20column%20decoders,raising%20its%20voltage%20to%20VOH. (NOR)
https://people.engr.tamu.edu/rgutier/lectures/mbsd/mbsd_l16.pdf (NAND)
https://www.ece.ucdavis.edu/~bbaas/281/notes/Handout.verilog6.pdf (So-called Even worse, “Each long wire has N/4=64 gate loads”)

And so on.

Basically, more than 4–10 gates on a single line is bad practise. Also, capacitive load of DDR cell gates is less than capacitive load from gates that connected to ground/voltage. For example, 64 gates per line is about 16 FO-4 delay, and this is for 256 lines (not 65536). Also, there is the limit of capacitive load that defined by peak transistor thermal power. And there are schemes that can decode addresses of any bitness with two transistors per row.

How?

It is simple: a binary tree with a fanout of 4 or 16, where every selector turns on only one selector after them and translates address to him.

But first of all, we need to convert our binary address to the one-hot (unary positional) encoding.

After that, we simply write a tree of selectors.

Why did it work?

Instead of NOR and NAND row address encoders, we get a constant branching factor (of 4 or 16) and a low capacitive load.

Also, we get a low logical depth in the selector.

And as a result, we get a ridiculous speed (that is comparable with the speed of a carry-lookahead adder with the same bitness).

Also, we get only 1–2 transistor gates per row for any address bitness(!).

How can I compare that with existing ones?

13 FO-4 delay for 16 bit selector(65536 lines), or less delay than in ucdavis 16 FO-4 (256 lines).

Oh, it’s about 20 ps on 7 nm, or about 40 ps for the row and column selector (32 bit address space). Or more than 5 Ghz. Plus line charging, and so on.

Anyway, that is much faster than NOR/NAND selectors and consumes less power (about 40 transistors switching; to get 1 W you must switch about 5 million transistors at 1 Ghz on 32 nm).

Here is comparison with the fastest selector from ucdavis. I’ll get 20% less space with same speed characteristics.

Funny example — you can consume more energy to store the result of addition than preform addition. That’s why every institute from top-50 have their own Topology Sorting and Operations mapping on FPGA research project ). (With “ground breaking results that are better than any TOP HARDWARE NOWADAYS”, of cause). And zero “AddMulJmpe” arithmetics blocks on 1 Ghz or less. With statistical processing of all typical commands sequencies/datapaths. (Except MantiCore and some of the ARM commands, but according to publications its more Mysterios Power of Believe (also known as a tree of 8 Tbps buses) and less cycles than Compute More Store Less paradigm)

It is a fault like yours Global optimisation method that is superior to the ones in Matlab.

No, actually, you can check even the synthesis results Here.

Used materials:

Image for publication — https://www.elprocus.com/different-types-of-demultiplexers/
Example 1 — https://www.ques10.com/
Example 2 — https://people.engr.tamu.edu/
Example 3 — https://www.ece.ucdavis.edu/
Synthesis and timing analysis — OpenLane (https://github.com/The-OpenROAD-Project/OpenLane)