Testing Batch Sizes for Concurrency in Go (Parameter Searching)

8 min readDec 14, 2018

tl;dr — A concurrency pattern for batches of goroutines is presented (Go code included) along with tools (standard python jupyter notebook using Seaborn) for visualizing a grid search over multiple parameters of batch sizes.

This is my story of experimenting with concurrency and the lessons I learned as I increased the scale of a data structure. If you want to read the failure story that lead to this, see An Early Attempt at Concurrency in Go.

Improving concurrency performance through Data Analysis

The README in the linked repo will describe the organization of the test environment I continued to evolve as needed. Some of that evolution will be discussed here, but this section will mostly describe how I implemented batches of goroutines to improve the performance of my code.

In my program, I had two types of main loops in the code for tree creation that I wanted to speed up: one for creating leaves and one for creating inner nodes¹. The loops were different, but I refactored them to operate the same way with this intentionally very generic code:

The code above gives you the idea that for each type of loop (in my case, leaves and inner nodes). I just need to define init(), cond(), next() and process(). The two loops are actually quite different. For leaves, one leaf at a time is added, while for inner nodes, the loop takes two nodes at a time and produces a new one. Both patterns can be captured with the generic loop above which then get batched.

As an example, suppose I wanted to take 110 jobs and break them up into batches of 20 with one left over with a batch size of 10? Here’s the code to do that (I’m pretty sure I poached the code for divmod off of stack exchange — I think my python background is showing).

In the example above, batchSize would be 110 and limit would be 20. numBatches gets set to 5 (since 20 goes into 110 five times) and remaining gets set to the left over 10. So the code will loop 5 times for the full size batches (20 each which is limit) and loop once with a smaller sized batch (remaining is 10). The WaitGroup is there to do what is says, wait until the group of all 6 batches are done. You should notice the processBatch being called in the loops is the same processBatch in the first code example.

This refactor worked and my new code produced the same result on the 1M leaf example between 3 and 4 seconds (where previously it had been between 4 and 5 seconds). Success!

Batches are faster, but what size is best?

In retrospect, it was just a bit lucky that I improved my time on the first try because I was just poking in the dark for my batch sizes. I had one batch size for the leaves and one for the inner nodes, just in case they were different.

Can you answer the question: what is an optimal batch size here? What numbers would you pick? If there are 1M leaves and another 1M internal (these are processed one row at a time the way I implemented it). What numbers will give the fastest performance? Maybe the square root of 1M, ie. 1,000? And are these batch sizes correlated, ie. if you change one will it change the performance of the other? Can I search this systematically? Obviously there’s work to do to test it!

Inspired by GridSearch

I’ve taken a couple Machine Learning courses online and I remember seeing grid search which takes two parameters (in ML they’re usually looking for “hyperparameters” but those are still parameters). The approach just tries a fixed set of combinations to look for the optimal combination. I decided to do the same for my limit (batch size) parameters.

In my “MerkleTree with a twist” code I had the following definitions:

const writeBatchLimit int = 2 * 500
const leafBatchLimit int = 1000

My writeBatchLimit needs to be an even number, so I wrote it that way, partly to remind myself. I wanted to control this from a configuration file. This is from an actual config file I created (named “config” in a test directory named “ConcurrentBroad”, concurrent for obvious reasons and “Broad” because I will do a “Narrow” search next).

# exported variables are used in python scriptsexport VAR1=writeBatchLimit
VAR1_START=20
VAR1_END=200000
VAR1_MULT=10export VAR2=leafBatchLimit
VAR2_START=10
VAR2_END=100000
VAR2_MULT=10

You can hopefully guess from the names of these settings that the first variable is named writeBatchLimit in my code and I want start testing at value 20, multiplying by 10 before I reach 200,000. The second variable I want to test is named leafBatchLimit. I’ll test starting with a value of 10, multiplying by 10 each test until I reach 100,000. I just used 20 to 200000 to remind myself this was the writeBatchLimit test. Obviously, starting from 10 would have been an even number as well.

I created a tag with git for this version of the repo I want to test. I have a bash script that copies over my repo and checks out that tag. This is so that I could keep working on the repo and not affect future trials I run with this specific tagged version of the code. Then I use a python script with a regular expression to replace the variable definitions in my go program each time in the loop. In the current example, the go variable definitions above would become this the first time:

const writeBatchLimit int = 20
const leafBatchLimit int = 10

and this the second time:

const writeBatchLimit int = 20
const leafBatchLimit int = 100

And the third time leafBatchLimit would be 1000. After leafBatchLimit reaches 100,000, writeBatchLimit would be set to 200 and the loop continues. There is a top level script to run all tests, but I found it more convenient to run one at a time by changing directories to my test directory in the terminal and running ./paramSearch.sh. That starts running a new trial, going through a double loop of the variables using the settings in the ./config file and it launches the jupyter notebook when the parameter sweep is done.

I use a single _test.go file for all tests². I merely extract the run time reported by “go test”, collect all these in comma separated spreadsheet files, load them into pandas in a jupyter notebook, do some data manipulation and create a heatmap in Seaborn. Voilà!

The heatmap shows that the fastest values are clustered toward the middle. We can see that pushing the writeBatchLimit to 20,000 is definitely going too far.

There’s one little problem that I haven’t mentioned so far. When I ran “go test” multiple times, I didn’t always get the same result back. The results were close (usually) but there’s still room for error. One solution to this problem is to run this matrix multiple times and average the results. More work to do!

Averaging run times and making an animation, tracking tests vs. trials

To average my results matrix I needed to upgrade the testing harness I was developing. You’ll notice in the image above the title says “trial001”. This is from my newer setup. With my new harness, when I run ./paramSearch.sh it creates a new trial, separating it from all the others, builds an animation of all the trials

and then creates a new heatmap using an average of all the trials. Actually, the user steps through the notebook to verify the data (good practice) and evaluates the cells to update the heatmaps and animations, which also displays them in the notebook.

Running the tests multiple times will show if the lowest times converges to a single grid cell. That’s not guaranteed, but in my case it did. I won’t go into the details, but I was then able to just copy my ConcurrentBroad test directory to a ConcurrentNarrow test directory. I updated the config file to run on a smaller range of values in the hotspot found in the broad search. Then after cleaning out the copied trials, I ran ./paramSearch in the new directory and it just worked! All the file and subdirectory names were the same, so I could just update the cells of the jupyter notebook manually and get images and animations for my new test. Yay! Developing an organized testing framework begins to pay off.

If you’re curious, I ran three trials of my ConcurrentNarrow search. The averaged matrix showed that values of 4505 for writeBatchLimit and 8800 for leafBatchLimit gave an (averaged) run time of 2.53 seconds.

You can see the notebook for ConcurrentBroad on github here, although the animated gif of the heatmap sequence doesn’t render as it does running locally on my laptop (the same animated heatmap shown above).

Sharing what I have

I hope the lesson I shared is interesting to other people learning about concurrency for the first time. When in doubt, test it!

I’m not sure if my testing harness will benefit anyone, but I kept improving it for myself to answer my basic questions. Theoretically, you should be able to follow these steps with your own project to test it:

Download the testing harness repo (and install requirements).
Copy your go package directory so it sits next to this repo.
Decide if you’re doing a single variable (look at Tests/Serial) or two variable case (look at Tests/ConcurrentBroad) and copy or modify one of them for your own purposes.
Update the config file in your new Test/X directory (file is Test/X/config look at all the values!). This includes setting the variable(s) you want to sweep over.
Add a top level (next to Tests/) _test.go file that successfully runs if you were to place it in your cloned go package repo.
Delete all files from trials in your Test/X directory.
Run ./paramSearch.sh in your Tests/X directory.

Conclusion

There’s always the lingering feeling of doubt that maybe I missed something obvious, so I’d love to hear about it in the comments.

Maybe I should have spent more time looking up concurrency design patterns before forging ahead to possibly re-invent the wheel, but I’m glad I dove in and experimented on my own.

Final thoughts

One thing I enjoyed about this exploration is the thinking about Theory and Practice. Is there a theory that will predict the optimal batch sizes without experimenting? What are the causal parameters: the number of logical threads and size of data structure? Only more tests would tell, but I’ll leave that for another day.

Footnotes

Actually there were three loops for construction, but the third one dealt with many fewer nodes (knitting together a handful of subtrees) so I didn’t bother applying this pattern.
It’s important to use a single _test.go file for all tests. First this ensures that the test runs with correct output and second it creates consistency across all tests, concurrent vs serial, etc. so one is comparing apples to apples. A safeguard could be added to hash the test file before running it and compare it to the hash created before the other trials were run, halting execution if it doesn’t match.

Testing Batch Sizes for Concurrency in Go (Parameter Searching)

Improving concurrency performance through Data Analysis

Sharing what I have

Conclusion

Footnotes

Written by Durant Schoon