JuxtaPiton: x86 support coming through!

Kunal Gulati
9 min readAug 25, 2019

--

JuxtaPiton + x86 = Google Summer of Code 2019

In this series of posts, continued from JuxtaPiton: With some x86 goodness!, I will be describing the work done for extending JuxtaPiton’s heterogeneous ISA research platform with the open-source ao486 core. In this iteration of JuxtaPiton, we are trying to augment the infrastructure with the open-source ao486 core which is i486 compatible providing x86 ISA support connected with the OpenSPARC T1’s SPARC V9.

In my previous post, I described how the reads and writes work by:

  1. Converting unaligned accesses issued by x86 to allowed aligned accesses to P-Mesh
  2. Implementation of Instruction Fills which are non-L1.5-cacheable and directly access L2, how we overcame the partial implementation of avalon memory and I/O interface.
  3. The software infrastructure was also explained in which we lent our BIOS from the Bochs project and stripped out the unnecessary I/O support like PCI, VGA, etc. which would just add unnecessary add debugging effort at RTL simulation layer.
  4. The BIOS was prototyped on Bochs platform until we confirmed that it could run on our system which wouldn’t have support for all I/O devices and would properly dump boot information and bare-metal test info on UART.
  5. We also described how we would like the I/O in our system to behave and make changes to some software as well as RTL to convert port-mapped I/O to memory-mapped.
  6. We already got the reads and instruction fills to be working in the RTL simulation of core and P-Mesh. Apart from that, the modified BIOS is also working perfectly fine in Bochs as described in the previous post.

Our next target was to get the writes working correctly. Now writes happening from an x86 core are already tricky because of their unaligned nature and combining them with the impartial avalon support from the master doesn’t really help the case. To get them working correctly, we added buffers on the transducer input memory bus to capture all the byte-enables, data and the burst-count so that once avalon completes submitting the full request, we can then start working out the addresses and the access pattern to submit it to the P-Mesh. Since the memory bus output from the avalon interface to the transducer is 32-bits, we have 4 byte-enable bit-values. P-Mesh allows 1, 2, 4, and 8-byte writes, so we can handle each write issued from the core as a combination of 1, 2 or 4-byte writes to the fabric. All of these are submitted in serial order and the core has to wait for all of them to happen before any other memory request is serviced.

In the version of the transducer presented in the previous blog post, the reads and writes were being arbited to the mesh fabric independently assuming that the avalon master would adhere to the specifications. But due to some issue with how the avalon master responds to the ‘waitrequest’ signal handled by the slave, we needed a common state-machine to arbit the reads and writes to the mesh fabric. This allowed for a reduced number of bugs and tighter closure for each memory transaction which would have a ‘master’ timing signal.

In this process, we also came across a very curious observation in the P-Mesh. In the previous version of the transducer, we were handling the acknowledgment back to the P-Mesh for a serviced request in accordance with the burst-count signal. For Example, if 16B were requested by the core on a 4B bus, the acknowledgment to P-Mesh would accordingly be given for 4 cycles.

This wasn’t causing any problems in our initial bare-metal tests simply due to the fact we weren’t making enough requests to the mesh fabric for it note these extra acknowledgments we were mistakingly sending to it. So when both our reads and writes were working fine and we finally moved onto booting our BIOS, we observed that the NoC inside the P-Mesh was overflowing after a particular number of requests. It’s after a lot of debugging that we found that the acknowledgments are meant to be just single-cycle handshakes and not the continuous burst-count assertion that we were doing before. In this process, our code got a lot cleaner and overall manageable.

Our previous decision of moving to a single state-machine for all requests going back and forth between the core and P-Mesh kind of paid off it’s extra effort because it would really help us in converting those I/O calls from the core into memory calls for a conversion from port-based to a memory-based I/O system. In the previous blog post, we mentioned that only serial would be supported in our initial build, but later we found that the certain I/O can be isolated out like PIC (Programmable Interrupt Controller), PIT (Programmable Interrupt Timer) and RTC (Real Time Clock). These modules can be attached to the core via another transducer that would help to arbit the core I/O requests to the targeted peripheral and service them.

The next task in the pipeline is to enable full serial support on hardware. In its current state, the core is able to run the bare metal tests that use the software-based interrupts in Bochs to output on the UART. In the full RTL simulation of the core and P-Mesh, we’ve replaced the software-based interrupt with a memory transaction to kind-of emulate a memory-mapped I/O situation. An “out” to an address which is ultimately going to the memory of the system is the same as “mov”ing to that address. This kind of hack is just there for testing because obviously we can’t go on and replace every “in” and “out” instruction with a “mov” that maps to the memory instead of a port. To fully support port-based I/O in the software infrastructure we can do two things:

  1. We can change how the software interrupts work by re-writing the procedures that the BIOS initializes them with. It’s not that difficult considering serial but a little tricky with INT 13 that enables disk services.
  2. Or we can convert all the I/O calls from the core on the avalon I/O bus to memory transactions sharing the same interface as memory transactions from the core to P-Mesh. This means that they’ll share time on that memory bus and will need some careful prioritization between them to get the core running optimally.

We’ve decided to go on with the second option considering that linux doesn’t always use BIOS interrupts to accomplish certain stuff and there’s a limit to which we can go on changing the software to work with. It’s best to modify it in the layer that we know about best and make minimal changes in already stable software. It’s not like this approach doesn’t come with its own set of challenges. Guess there’s no free lunch!
The port addresses for all these peripherals clash with the reserved memory space for the BIOS and we can’t just go about to let it transduce to any memory space. So we’ve chosen a particular range which lies from 0x000C8000 to 0x000EFFFF which maps to mapped hardware and miscellaneous stuff. So in short, although the core would issue to addresses like 0x3f8 (Serial Port 1 address) and its offsets, inside the transducer RTL layer, they’ll get mapped as non-cacheable memory transactions to the addresses we define in OpenPiton SoC. In this initial version, we will be supporting the PIC, PIT, RTC, UART and Disk transfers emulated on an SD card.

In our initial BIOS run, we only targeted the correct functionality for reads and writes. We modified the BIOS to exclude any I/O calls which can halt the CPU if not serviced(or ignored) properly. By modifying the BIOS according to our needs and removing the INT 13 (Disk Services) for transferring data from a hard-disk (emulated in simulation) to main memory. We just put the right BIOS code at the correct address space in the main memory and gave it a go.

It flawlessly entered into the protected mode from the real mode (some x86 jargon) and now we're able to manipulate the 32-bit registers and refer 32-bit address space in our memory transactions. From there on, we made a bare-metal program which would output “hello world!” on the UART using a memory-mapped interface.

Current State of the System:

We started our journey with a simple transducer that was connecting the Avalon memory interface to the P-Mesh. In its current form:

  1. All the reads and writes issued from the core via Avalon memory interface to P-Mesh are working.
  2. With some internal probing to the core, we can distinguish between an instruction fill request and a simple read. Through this, we can implement non-L1.5-cacheable IFills from P-Mesh to the core as requested by the Avalon interface.
  3. The essential platform I/O like PIC, PIT, and RTC needed for full platform functionality, are successfully connected to the Avalon I/O interface directly with the help of I/O transducer.
  4. Modified Bochs BIOS can fully boot and enter the Protected mode where we can manipulate 32-bit registers throughout the system.
  5. Inside the protected mode, we can successfully write to UART with bare-metal code using memory-mapped I/O.
  6. With the current state of the system, we can re-write the procedures for Disk Services and UART software interrupts to boot a real OS on the top of the modified BIOS which would require very less effort as compared to the RTL approach discussed in detail in the next section.

The Future!

Just wanted this cool Blade Runner reference here

We’re currently working on supporting both UART and Disk services with a proper memory-mapped I/O interface in our transducer RTL to replace the hacks that we’ve deployed now for getting things to run. After this we plan on running some more elaborate bare-metal tests in the protected mode to properly stress-test all the read/write functions supported by the transducer.
Then the cool-stuff comes. The next big thing would be getting support for atomics and cache-coherency between both L1 of ao486 and P-Mesh.

There are a lot of things that can be done from this point then. We can go for full Linux or Windows boot, or even build a multi-core ao486 SoC fully coherent with each other with the help of our beloved P-Mesh.

Summary of the Commits:

The single PR that contains all of the working code is linked here. The comments in the PR contain a small summary of the work that was done but please refer to this and the previous blog post for the technical details of the code. To achieve the overall functionality of this system, some other open-source work was also referred to and commits were also made to forks as well as to their principal repositories which got merged. These repositories include a snapshot of the latest Bochs SVN hosted at GitHub linked here, the fork of x86 bare-metal examples by cirosantilli linked here and the ao486 MiSTer fork linked here.
Commits were made to all of these repositories to support the work done in the main OpenPiton repo of which my fork is linked here.
To summarize all the commits, I made a Google Sheet which is linked here.

Special Mention:

I’m really grateful to my mentor Jonathan Balkind at Princeton University for providing this opportunity to me and continuously engaging with me over this period. It’s only with his help and a really patient attitude towards all my silly queries that I have gotten this far in this project. I’m also grateful to Prof. David Wentzlaff for letting us pursue and helping us throughout this project.

--

--