Many people are aware of how much of a powerhouse the University of Michigan is when it comes to Computer Architecture research. It came then as no surprise to me when the first project I started working on as I entered the first year of my PhD was solidly in the Computer Architecture camp.
To avoid going into too many details (as the seminal work is not yet published), I started assisting on a project which was aimed at redesigning some fundamental mechanisms used in out-of-order issue CPUs. As we were not interested in going to a fab with this idea, and were rather aimed at a proof-of-concept demonstration for publication (rather that industrial production), we turned to the tool most accessible to the team of not-really-computer architectures that we were: the gem5 microarchitectural simulator.
We used gem5 because it’s an accepted standard among the Computer Architecture community for rapidly prototyping and simulating new instructions or new microarchitectural features in a CPU without having to try to synthesize verilog and run on an FPGA or pay tons of money to ship off and produce working silicon. gem5 also boasts a vibrant community and a rich feature set that allows users to simulate many natively compiled binaries. It’s not the easiest tool to set up or run, depending on how picky you are with your desired simulation environment, but once you do a bit of tutorial reading you should be set to go.
However, not everything was as golden as we hoped…
What was our Problem?
It makes sense that simulating a complex piece of hardware in software is quite slow, to say the least. In my personal experience, small Linux utilities that take maybe half a second on bare metal can take about an hour on gem5; a general ballpark of 1,000–10,000x slowdown would be a safe guess for many programs. This is not too much of a problem for small tests, which could just be simulated overnight. However, we were trying to do this for the entire SPEC CPU 2017 suite of benchmarks.
One of our advising professors told us it took him around six weeks to simulate all of the SPEC CPU suite during his own PhD…once. We didn’t have that kind of time, even just to generate the initial checkpoints.
Why do we have this Problem?
Our problem is fundamental to the idea of using a software simulation of hardware — it will just be slow. Specifically, our problem is not just in the fact that detailed simulation is slow, but the setup simulations are also far too time consuming.
What do I mean by this? For example, there was another project created for the sake of speeding up simulations called SimPoints. The idea behind SimPoints was to take undetailed simulations of program execution (using a simplier CPU model that was easier to simulate in software), profile execution of the code to find the interesting bits, and create simulator checkpoints specifically at those points. This way, when you want to simulate the program, you can just simulate a few billion instructions at a handful of checkpoints, which takes far less time (minutes to hours rather than weeks).
However, there’s a big catch to this: you still have to do an initial simulation of the whole program to find those interesting portions. Even using a simple CPU model, this setup time alone can take days or weeks. One may say that this initialization cost is only a one-time cost — however, we were also trying to simulate an entire suite of benchmarks (SPEC CPU 2017), and when your initial setup cost estimate is going to cost you a month or longer…it may be time to re-evaluate our approach.
The Key Insight
Very early in the project, when we were first examining gem5, we were curious about the checkpointing mechanism in gem5, as we knew we would have to use something like it at some point for repeatable simulations.
The structure of a checkpoint in gem5 is pretty straightforward. They consist of two files created by the simulator: a checkpoint file, which contains simulator configurations and current state, such as CPU register contents; and a pmem file, which is basically a memory dump structured as the process virtual memory space.
Other than a few gem5 specific pieces of information, all of this information is basically just the state of a process that you would have if you ran the process on bare metal. Based on our insight about what gem5 really needs to run, we realized that we can get all of this information through everyone’s favorite debugger: gdb. So that’s exactly what we did.
Our Solution: Lapidary
We decided to create a tool, dubbed Lapidary, to vastly accelerate the creation of gem5 checkpoints. Additionally, we built a small utility to spin up parallel instances of gem5 simulations so that simulations on separate checkpoints could be done in parallel. Here I’ll briefly discuss how the tool works and how it’s used.
- Start up the desired program and attach it to the debugger.
- Based on a specified interval (usually once a second), interrupt the program through the debugger and (a) take a core dump and (b) gather register information.
- Transform the information gathered from gdb into gem5-formatted checkpoints. Broadly, this involves (a) inserting the appropriate register information into a gem5 checkpoint file and (b) formatting the core dump from gdb to correspond to the existing virtual memory mappings.
After that, rinse and repeat until the program is done and you have all the checkpoints you need! After this, we move on to the actual simulation.
From this point on, gem5 usage would continue as normal. However, wanting to make the best use of our resources, we created some small scripts to manage concurrent instances of gem5 running simultaneously.
Being able to do massively parallel simulation is in part enabled by our sampling methodology. We use the SMARTS approach to sampling, which causes us to gather a few hundred checkpoints, and each checkpoint is simulated on for a few million instructions — based on these simulations, we can generate tight confidence intervals on various performance characteristics, such as CPI, etc. This is in contrast to the SimPoints method, which would gather fewer checkpoints (in the interesting regions of code), but would need to run longer simulations on them (billions of instructions). The other advantage of the SMARTS methodology is that we have computationally independent simulations we can run, which lends itself naturally to running in parallel.
Overall, this all boils down to running the following two commands.
If you’d like more information on how to use Lapidary, you can check out some more technical information in the README of our repository.
Some Fun Hacks
Just to give a single example of some of the complexity behind the simple steps presented above…
So it turns out that not all of the program information is easily accessible to gdb. Specifically, some of the MSRs, such as the FS base register (used by libc), have to be read using a syscall which gdb does not perform for you when you type
info registers. Also, you can’t invoke this syscall directly in gdb, because then you’ll get the FS base register value of the gdb process.
Our solution lies in the fact that you can compile and inject arbitrary C code into gdb debuggies. Basically, we had to write a small, minimally-linked program (gdb has some restrictions on the code it compiles on the fly, so we couldn’t use arbitrary libraries) to extract and report the FS base register back to our scripts. The program is shown below.
This isn’t too important to any potential end-users of Lapidary, but I did find it a fun challenge in exploiting gdb!
An Ongoing Process…
Currently, our tool is more-or-less “research quality,” which means we implemented enough of the features to get us through to deadlines. However, as we want to release this tool to the community, we’ve been patching things up and compiling a list of things we should add or improve about Lapidary. We have a pretty big TODO list of features we’d like to add, for example:
- Native integration with cloud services. so that our parallel simulations can occur across a lot of cheap instances rather than on a single massive server.
- Compression of checkpoints. As said earlier, we use the SMARTS approach to program sampling, which can generate several hundred checkpoints for a single benchmark. However, this can take up a lot of disk space, as each checkpoint can be up to 1GB in size. We plan on creating checkpoint “key frames”, which are full checkpoints, and then create other checkpoints just as deltas off of those key checkpoints.
- Support for custom instructions. Many people who use gem5 use it to prototype new instructions in the ISA. This poses a challenge for our method of creating checkpoints, as we run benchmarks on bare metal — binaries which have been compiled with custom instructions for the sake of running them on gem5 with, obviously, not work so well. Our plan for handling these cases involves (a) changing our GDB processing script to trap properly for these unknown instructions and (b) provide programmable hooks for developers to inject software to emulate these instructions on the running process state. For example, if I implement instruction
foowhich performs some arithmetic operations on registers
r2, if I encounter instruction
fooduring gdb execution, I will trap and perform the same operation using combinations of existing instructions. Obviously this may not work for all cases, but we imagine this would be a suitable solution for many cases.
…and many others, like better supporting SMP and system-call traces. I’m sure everyone who might be interested in using Lapidary has their own set of suggestions as well!
We saw a problem with the current workflow of using gem5; not in the simulation itself, but in how programs where prepared to be simulated, i.e. the generation of reuseable checkpoints. If you need to use more than 1 or 2 small programs, the setup cost alone becomes intractable. To this end, we created Lapidary, a set of tools to accelerate checkpoint creation and massively parallel simulation by exploiting bare metal execution.
We initially created this tool just for a single project — now that Lapidary is being used to help gem5 development in it’s third project, we have decided that we should try to share this tool with the rest of the gem5 community, as we imagine that others would find this tool as useful as we have.
Our source code is available at https://github.com/efeslab/lapidary. Please feel free to poke around, ask questions, make comments, create issues, or file pull requests!