What is JuxtaPiton?
JuxtaPiton is (to our knowledge) the world’s first open-source, general-purpose, heterogeneous-ISA processor being developed at Princeton’s Parallel Group. It was originally designed by Katie Lim (now of the University of Washington) and now under the supervision of Jonathan Balkind and Prof. David Wentzlaff. It serves as a platform to research about Heterogeneous ISAs and presented the case of OpenSPARC T1 core having SPARC V9 ISA and PICORV32 having RISC-V ISA coupled together in its original publication. Both of these cores had a fully cache coherent, shared memory subsystem enabled by OpenPiton’s P-Mesh cache coherence system. The overall infrastructure of JuxtaPiton allows different types of cores with different ISAs to be coupled together at the L1.5 cache of OpenPiton which assures cache coherency of the new core that’s being coupled with the existing modified OpenSPARC T1 core. This presents us with a unique opportunity to evaluate these cores without worrying too much about the data-coherence and provides one of the easiest ways of adding an L2-L3 cache structure with a tightly coherent interconnect(our beloved P-Mesh).
The case for x86 ISA
In this iteration of JuxtaPiton, we are trying to augment the infrastructure with the open-source ao486 core which is i486 compatible providing x86 ISA support connected with the OpenSPARC T1’s SPARC V9. This kind of system allows us to re-use a lot of x86 legacy code(our first priority is obviously Doom) and is capable of booting Windows 95 or Linux Kernel(till version 3.13).
To understand how the whole system should be tied up together, we have to take a close look at the structure of the ao486 core and OpenPiton’s cache sub-system.
Figure 1 describes the original JuxtaPiton architecture in which the PicoRV32 core is tied to the L1.5 of OpenPiton cache sub-system which is connected to L2 with P-Mesh coherent interconnect. For this project, we are replacing the PicoRV32 with ao486 core providing it with a coherent 3-level cache system for improved memory performance. This is accomplished by interfacing the L1 of ao486 with L1.5 cache of OpenPiton. Now you all might be wondering why it’s called L1.5 instead of L2. This is because L1.5 provides us with the interfacing point between the host(OpenPiton) and guest(ao486) memory interface. In our case, it is of the same size as L1 but provides us with another level of cache, the interface to P-Mesh and point of initiation of all transactions which maintain coherency with the write-back L1 of ao486. So basically, it does everything that a cache level usually does but with extra functionality. It’s the ease of interface from a core (as in the case of PicoRV32 which doesn’t have an L1 cache), or another cache level (as in the case of ao486) which makes prototyping and evaluation much faster.
Let’s tie ’em up!
The ao486 project has a single level of cache which uses Avalon memory interface (by Intel) to submit both memory and IO requests from the core. Their usage of Avalon is justified because it’s targeted for an FPGA specifically an Altera one. We had two options for connecting the L1 of the core with OpenPiton’s L1.5, either remove Avalon or interface with it. After looking at the code structure of the memory subsystem of ao486, we found out that Avalon is quite deeply embedded in the way L1 structures a load or a store request. Instead, we went the other way and wrote a transducer that would adhere to the specifications of Avalon and manipulate those requests in a way that the L1.5 could handle.
The first requirement was to fulfill the unaligned accesses created by the core since x86 ISA doesn’t guarantee natural alignment. Since the L1 adheres to the Avalon specifications, we were promised that all the requests from L1 would be 4-bytes aligned but this also comes with a catch. While they are 4Bs aligned for all the addresses that are put on the bus, the alignment is only relevant to the loads but not exactly to the stores. In the case of loads, 4B aligned addresses, as well as a signal called “burstcount”, is used to specify the full word that is requested. But OpenPiton’s L1.5 can only return 16Bs aligned data, so we need to align the accesses in such a way that the maximum request length for 16Bs from a 4B aligned address can be serviced from a fabric which services only 16B aligned accesses. For example, one of the worst access time cases would be a request of 16B from an address like 0x000FFFFC. This would mean that our burstcount is 0x100 since our data bus is 4Bs wide. This would require two accesses to L1.5 for a total of 32B transfer out of which 16Bs will be filtered out and serviced back to the L1. This kind of two access traversal can also happen in an 8B case when the address has a modulus of 0xC.
In the case of writes, the L1 can issue an unaligned-write on a 4B aligned address and with the help of “byteenable” signal. For example, if I need to write a byte at address 0x0023, my request would look like address: 0x0020 and the byteenable: 4'b1000. This allows the L1 to issue writes at basically any address and combined with the “burstcount” feature, it can write multiple bytes spanning across multiple contiguous 4B words. At this point, P-Mesh really helps us by allowing 1B, 2B, 4B, and 8B aligned stores to the L1.5. Now it’s the transducer’s work to buffer all the incoming writes and issue them to L1.5 in a correct sequence of the aligned stores it can accept. The transducer also takes care of the opposite endianness of both the L1.5 and ao486 and flips the requests going either way.
We’ve also implemented the Instruction fill feature available in P-Mesh which allows non-L1.5-cacheable instruction fills from L2 to directly into L1 of the core because we don’t require their modified copy. IFills from P-Mesh allow 32B aligned accesses to the L2, so again for 4x4B edge requests would need the traversal across two accesses and filtering out the data. For this, we share a lot of infrastructure with reads that also traverse across 2x16B aligned accesses like buffers, arbitration signals, etc. This allows us to have a tight opening and closure of arbitration and control signals that would administer the operation of reads/Ifills and writes from Avalon to P-Mesh.
It has been kind of time-consuming to get the transducer working because of many nitty-gritty details of the Avalon memory interface. One of the biggest time-hurdle was to understand what aspects of the Avalon did ao486 completely adhere to. The requests issued by L1 doesn’t exactly adhere to the complete Avalon standard. We learned this the hard way because each time when the slave(OpenPiton L1.5) responded to the core according to the specification, the core would always go haywire either by reading wrong data or issuing a wrong request. It was only through trial and error that we got to know exactly how the core wanted it’s request to be serviced by the Avalon slave. We’ve got the reads working completely fine and are currently working on the writes. Stores are tricky because all the incoming data from the L1 needs to be buffered, the correct aligned accesses to be calculated from the completely unaligned byteenables and then issuing them to the L1.5 before the fabric can be ready for any other accesses requested by the L1.
In a little departure from our intended timeline, we first gave a thorough look at the software side of things. We thought that this was essential because to boot Linux or Windows, we needed to understand the complete compile chain for the BIOS, bootloader and the final OS image. Although ao486 project provides a boot ROM image which supports all the features of the ao486 SoC, we still need to understand the complete process because a lot of changes to the existing BIOS will be needed to support the removal of many parts of the SoC like PCI, DMA, USB, etc. because the initial boot won’t need them and they just add unnecessary debugging at the BIOS level. The ao486 project lends it’s BIOS from the open-source Bochs project and VGA functionality from VGABIOS project. The first step was to understand the compilation of the rombios. The Makefile revealed that it uses as86 assembler instead of GNU Assembler (GAS) which we were initially using with i386 flags to compare our binaries of rombios with ones provided by ao486. This allowed us to find the correct disassembly for the BIOS binary image that comes with ao486 to find out what’s happening inside the BIOS. One would wonder why were we peeking inside the disassembly when the ao486 project provides us with the corresponding rombios.c and the header files. This is because rombios.c contains a lot of C-style procedures which we needed to follow carefully because our initial RTL won’t be supporting any IO based instruction like all variants of in and out.
Our initial plan is to boot the SoC without any peripherals and VGA. This is because the ao486 SoC uses the NIOS micro-controller on the Altera based board to service the requests from all these peripherals. We don’t want our project to be dependent on a proprietary micro-controller and writing RTL which can support these SoC elements would deviate us too much from our primary task. So how are we going to debug? Serial comes to the rescue. We’ve made an initial version of BIOS which removes all the code for these peripherals and VGA but left support for a COM port. In our Bochs based simulation, we’re able to “Hello World” to the outside environment over serial using compiled binaries to output on a serial port opened in Bochs and supported by the ao486 BIOS. We are in progress of recompiling an old Linux kernel (ao486 supports Linux kernel ≤ 3.13) with serial support which can output debug info on the COM port. We have to go through the whole recompilation of the Linux image because we haven’t found a way of changing the boot command line options for Linux when Bochs initially boots after jumping to the Boot segment from BIOS. Compiling Linux from scratch will help us to mitigate this because that way we won’t have to pass the support for serial as a command line option and by default Linux will output debug info on to Serial. This will probably be able to get to completion in time with our addition of support to in and out instructions.
Once both of these things get completed side-by-side then we’ll be able to test and debug our system for a Linux boot. There are a lot of nitty-gritty details about booting an OS like Linux because the BIOS needs to create a proper suitable environment for the OS. It involves features like going into Protected Mode, initializing paging, GDTs, etc. All of this will be discussed in further blog posts where we talk about the functioning of the transducer, cache-coherent ao486 L1 with support for write-back, and addition of atomic instructions to the ao486 core for proper hardware-based implementations of semaphores. Stay Tuned!