GSoC’21 with FOSSi: End of the journey.

9 min readAug 22, 2021

Good day, everyone!

Here, I am pleased to represent my final report on project Multi-Level TLB Support for CVA6 CPU. CVA6, formerly known as Ariane, is the application class and Linux bootable 64-bit RISC-V core developed by the ETH Zurich and University of Bologna researchers. As part of the project, I have further worked with the OpenPiton multicore processor, built by the Princeton Parallel Group, to run the Ariane core. It is also an exciting open-source platform to look for for your experiments, where you can run up to 1/2 billion CPUs ;)

For a good structure, here is the outline of what I will put in the following blog report:

Existing Memory Management Unit (MMU) of the Ariane.
Achieved goals and work progress.
Implementation details.
Conclusion or some good wishes.

You might want to skip some part if you are keener on the implementation details.

Existing Memory Management Unit (MMU) of the Ariane.

The existing Ariane Memory Management Unit (MMU) entirely supports the virtual memory address translation in S and U modes with control over both instruction fetch and data access. It makes use of separate instruction and data L1 TLBs. They are realized as fully set-associative flip-flop register files and are configurable in size [1]. The MMU utilizes a single hardware Page Table Walker (PTW) to handle TLB misses and load a new translation from a DRAM page table. In its turn, with adding new L2 TLB to the core we can enhance the capabilities of the MMU and overall performance of the core. If translation misses in the size restrictive L1 TLB, it will not start page table walk in the main memory readily, but instead looks up the far larger L2 TLB. Thus if the translation hits in L2 TLB, it will compensate for the L1 TLB miss rate. Actually, Multilevel TLB enhancement is the industry standard for many CPUs along with multilevel caches for the data, so it was an interesting project to explore!

Achieved goals and work progress.

The initial goals of the projects were to build and connect a private L2 TLB to the existing MMU structure of the core and improve it as necessary. These milestones have been successfully achieved and here are the following deliverables:

Private L2 TLB;
Modifications to the MMU and other Ariane source files;
Behavioral simulation results;
Documentation at a separate GitHub repository;

Overall, L2 TLB currently exhibits steady results within possible scenarios, it runs seamlessly with the core as well and compiles a test “Hello, world” program in the OpenPiton platform error-free. Here is the link to the workspace repository:

GitHub - NazerkeT/MultilevelTLB: Multilevel TLB implementation workspace for (CVA6) Ariane Core…

It is a repository dedicated to the summer GSoC'21 project - Multilevel Translation Lookaside Buffer (TLB) for 64-bit…

github.com

Testbench design was also attempted as part of the GSoC to test the hardware for extended coverage of the cases and is partially ready. However, there is a need to complete a data structure for the “golden model” of the comparison which was a bit complicated due to the large virtual address space of the page table walk. Worth mentioning that, performance and area analysis comparisons for each TLB were one of the aimed deliverables. However, as there are some further important points of improvement, benchmarks will be discovered out of the GSoC scope. The additional goals for future enhancements are:

FSM based realization of the L2 TLB flush logic;
SRAM implementation of the L2 TLB memory;
Testing virtual memory performance with targeted benchmarks;
Contribution to the Ariane;

Progress during the GSoC is explained below with more details in a week-wise model.

Week 8, 9, 10: Design part.

Community bonding and Week 1:

Installation of the latest version of the Ariane, OpenPiton, riscv-tools and Verilator;
Subsequent troubleshooting over the version compatibility of the tools;

Week 2–3:

Initial implementation of the L2 TLB, changes over the MMU and PTW.
Simulations in Vivado 2017.4;

Week 4:

Secondary literature review to wave confusions over theory after feedback on the hardware.

Week 5–6:

Second implementation of the hardware. Updates for L2 TLB, MMU, PTW, LSU and LU structures.
Simulations in Vivado 2017.4;

Week 7:

Updating hardware after feedback and simulations in Vivado;

To recap my overall experience over this part of the project, hardware construction itself was not something extraordinary or extremely complicated, but it was challenging to figure out how everything is connected and functioning. The reason behind this was that this information could notably affect how a new L2 TLB unit or MMU shall change more. So every tiny detail was necessary, or in other words, “the devil is in the details”. Accordingly, there was some extensive read of the code base of the core during the project.

Furthermore, taking care of the parallel nature of the coding blocks in SystemVerilog was a determinant point during design, even though it was not something new for me. After an intensive academic year with sequential coding, it was my first pitfall. Another significant lesson was to keep hardware as simple as possible. As will be explained in the implementation section below, at the end of the program I learned that some of the multiplexers that I was adding to the MMU might not be even necessary. Nevertheless, if to be careful with these points, then coding hardware is so much fun and engaging!

Week 8, 9, 10: Testbench part.

Week 8:

Understanding general modular testbench theory and an existing example for write-through data cache in the core.

Although I was not aiming to build a UVM based one, for extensive coverage of the MMU codebase and inputs I needed a more advanced testbench than I used to. Accordingly, to my current knowledge this acquaintance period took a bit more time than I expected. Here, MMU was chosen as a base structure for tests as it includes both L1 TLBs and L2 TLB, also PTW. This way, I had an opportunity to test different hit-miss scenarios between TLBs, and test the possible improvement and exceptions. After having understood the exemplar testbench, I have moved on to experimenting with the tools.

Week 9:

Walking over tools: troubleshooting installation issues with new vivado, working with VCS in a remote machine, choosing right commands for the run.

Week 10:

Building a testbench for MMU on the example of wt_dcache.

Week 11:

Summing up results and final simulations in Vivado Simulator;
Updating hardware for single cycle, double instruction and data address translation requests.
Some trouble shooting with the version of the core in standalone Ariane and OpenPiton Ariane for the runs;
Successful compilation with the whole core and the execution of the test program;

Implementation details.

Main changes include a new L2 TLB itself, updated MMU for the Multilevel TLB structure, changes for PTW, LSU and LU. As was mentioned before, these hardwares have been modified for the course of the project several times to ensure a reliable response for all the possible input cases. So, following code snippets and descriptions are for the final version.

Realized L2 TLB is an n-way set associative and based on the simple but reliable hash-rehash structure. Accordingly, it produces an output in a variable number of cycles in contrast to the fully set-associative design. Set associativity was chosen, as with the anticipated large size of the L2 TLB, it will lead to a less expensive and power efficient implementation. Thus, a translation result is available either in the same or in the third cycle at the longest. It will depend on the size of the requested page, which varies between 4KB, 2MB or 1GB. The number of ways is configurable the same as the number of sets.

Other n-way set associative designs are available with skewing, prediction, speculation and coalescing MIX TLBs as well. The point of choosing the current hash-rehash structure was that, despite its varied response design, it was far less energy-hungry than skewing, more reliable than speculation [2]. MIX TLBs could be an interesting option too, but for the initial implementation, the hash-rehash design was more industry-proven and flexible for future improvements like adding prediction hardware on top.

Another detail worth mentioning is that this L2 TLB is private to each core and is not meant to be shared as an LLC in the multicore environments.

L2 TLB interface:

Generally, L2 TLB interface is similar to the L1 TLB and only includes one extra all_hashes_checked_o to let the other hardware units at the end of the multicycle lookup. Hence, changes to PTW, LSU and LU included adding the all_tlbs_checked_i port and corresponding constraint to stall if all TLBs have not finished checking.

L2 TLB translation:

This code snippet presents a hash-rehash structure for translation. hash_order_q counter saves the hash order for each clock cycle. It is activated only when a request is available and there is not any pending request for PTW completion. A realization is similar to the L1 TLB lookup. The difference is that it will search for the tag match in different indices at different cycles. This variance originates from the fact that there are multiple page sizes available, as discussed before. So, lower indices of each virtual page number are used for choosing an index — a hash, and the remaining virtual page bits are used for tag comparison. Along the way, it is also necessary to make sure that upper-level virtual page numbers match with the requested one till the smallest level is reached. The default hash for the current L2 TLB is for 4KB page size. The subsequent hashes are for 2M and 1G page sizes. Hence, is the name “hash-rehash” TLB :)

Update and flush logic, PLRU unit was adapted from the L1 TLBs and quite identical. For more details you might look at the source files.

Implemented Multilevel TLB:

In general, L1 TLBs for instruction and data address requests are served in parallel as before. However, with the additional L2 TLB now they update their entries from there, not from PTW for simplicity purposes. In its turn, if both L1 TLBs miss in the same cycle for a new request, then D TLB requests are fed first into L2 TLB.

To implement the above multilevel TLB several multiplexers were added to the MMU.

MMU, Multiplexer 1:

If a pending data address translation request is being served in PTW, then an instruction translation is directed to L2 TLB. Unless otherwise, a miss from L1 D TLB has higher priority for L2 TLB access if both requests are available in the same cycle.

MMU, Multiplexer 2:

Here, L2 TLB chooses between two PTW update ports.

MMU, Multiplexer 3:

Here, MMU chooses between L1 and L2 results. Thus, if the result is available in the L2 TLB and it is for a 4KB page with a right privilege to access the structure, then the hit result is available in the same cycle. 4KB pages are a common case for a lookup usually.

Nevertheless, on the other hand, it is also possible to avoid these additional multiplexers and forward L2 TLB results to L1 TLB directly, so that MMU always gets results from L1 TLB. This comes up with additional cycle overhead for hit results. However, for general cases it might be better to have this additional latency than extra multiplexers. So, for future versions one can remove them and connect to the previous L1 TLB outputs. In block diagram it will look like this:

Conclusion or some good wishes.

To conclude, I am super happy to participate in the GSoC’21 and to be part of this community! As the one who was always wondering how computers convert vague human input to commands executable at the electron level, I was glad to open myself to the next computer organization enhancement with Ariane! Thanks to FOSSi for the amazing hardware projects, for Google for organizing this program, and for my mentors Jonathan and Nils for their constant advice and support! I would also like to congratulate all the GSoC’21 participants on the finish line! I was reading the reports of many of you and I am so proud that I was taking part in this program with you!

Best wishes

References

[1] CVA6 Documentation, https://cva6.readthedocs.io/en/latest/

[2] A. Bhattacharjee, D. Lustig, 2017. Architectural and Operating System Support for Virtual Memory, Synthesis Lectures on Computer Architecture, https://doi.org/10.2200/S00795ED1V01Y201708CAC042