GSoC Final Report
--
So, we have reached the final week of GSoC. It has been an amazing journey! Let’s summarize what was done, the current state of the project and what’s next.
The Project
Whenever you run the command git fetch (or similar), git server first decides which objects it has to send. The git client and daemon compares their branch tips and decides which ones to send. For this, Git traverses from the tips to all the objects connected directly or indirectly to it. This traversal goes till a known tip (i.e. the tip which the client already has) has found. As one could imagine this could be a costly operation. More so in the case of clone where Git has to identify and send _all_ the reachable objects.
To tackle this problem, reachability bitmaps were introduced. The idea is to store a bitmap for every selected commits in a .bitmap file that helps us identify the set of commits reachable from that selected commit. This bitmap could be referred to whenever a set of reachable objects are needed. To know more about reachability bitmaps please read the following links — Counting Objects, Scaling monorepo maintenance.
My project focuses on improving the performance of these reachability bitmaps. It has two parts (actually four parts but the last two parts are more of research type; so me and my mentors decided to omit them for now). These are -
- Individual bitmaps are stored compressed (with EWAH), but we have some sense that it can be slow to decompress individual bitmaps (which we have to do in order to read them, but also to do things like negate them, OR and AND them together, etc).
One possible project could be to explore using an alternative compression scheme (like the more modern Roaring+Run technique) to see if we can improve overall bitmap performance by reducing the amount of time it takes to read an individual bitmap. - Loading a
.bitmap
file can be slow for large bitmaps. This is because we have to read the file sequentially in order to discover the offset of each bitmap within the file.
It should be possible to shave off some time from this step by including a small “table of contents” that indicates which commits have bitmaps, and where to find them in the.bitmap
file. In the past some efforts have been made to do this. But we should undertake more performance testing to prove whether this is or isn’t a good idea before submitting a patch series in this area.
The Patches
I started my work on implementing lookup-table for .bitmap
file. The lookup table is added at the end of the .bitmap
file. The table has N
rows (where N
is the number of bitmaps present in that file), each containing three information — (1) Offset position (2) commit position (in pack-index or multi-pack-index) and (3) xor position. The patch links are given below -
- [PATCH v6 0/6] [GSoC] bitmap: integrate a lookup table extension to the bitmap format
- [PATCH v6 1/6] Documentation/technical: describe bitmap lookup table extension
- [PATCH v6 2/6] bitmap: move `get commit positions` code to `bitmap_writer_finish`
- [PATCH v6 3/6] pack-bitmap-write.c: write lookup table extension
- [PATCH v6 4/6] pack-bitmap-write: learn pack.writeBitmapLookupTable and add tests
- [PATCH v6 5/6] pack-bitmap: prepare to read lookup table extension
- [PATCH v6 6/6] bitmap-lookup-table: add performance tests for lookup table
After a lot of discussions and modifications, this series has finally been merged into the master
branch :)
My next job was to integrate roaring bitmaps. This is a large project which needs backward compatibility. At first I thought I will build the Roaring library all by myself but mentors insisted that it would be better to use the existing CRoaring library. This library has almost all the facilities/features we want except that it doesn’t support big-endian systems. The license also was an issue. Because it was not compatible with Git’s GPLv2 license.
I reached out to Daniel Lemire and requested him to make the license Git-compatible. The good news is he agreed 😀😀. Go to the following link to see the discussion thread — https://public-inbox.org/git/CAPOJW5yJDq046nhq0V-syAg4ttoy++rBtq_RHSXPAKhtDDw6jQ@mail.gmail.com/
As I mentioned before that the existing CRoaring library has some short comings, I had to make some changes on top of it so that Git can use it properly. I made a new function for enabling network write (although Derrick wrote in his review that instead of making new functions I should modify the existing functions). After that I made sure that the change is backward compatible i.e. the roaring bitmap integretion do not affect the existing ewah bitmap implementation.
The patch series is currently sent in the mailing list as RFC. It needs some more work to complete. But it is near to ready. The patch links are given below -
- [PATCH 0/5] [RFC] introduce Roaring bitmaps to Git
- [PATCH 2/5] roaring.[ch]: apply Git specific changes to the roaring API
- [PATCH 3/5] roaring: teach Git to write roaring bitmaps
- [PATCH 4/5] roaring: introduce a new config option for roaring bitmaps
- [PATCH 5/5] roaring: teach Git to read roaring bitmaps
Here is the link of my Github branch where you can find my current progress — Abhra303/roaring-bitmap-exp. Note that I used chunk-format API for .bitmap
version 2 file. It brings uniformity in the structure of bitmap related files.
Challenges
I didn’t have much experience in developing such projects in C before GSoC. So I faced a lot of challenges. It took time to digest all the code and logics behind the existing bitmap implementation.
My main challenge was to solve errors! As I mentioned previously, I have never used C in such a huge amount in any of my previous projects. So obviously I didn’t know how to debug and solve unknown runtime errors. But my mentors helped me here tremendously. They taught me how to debug the codebase and I learned about various techniques to solve these errors.
I can remember one such error which proved to be hard to solve for me. This was an error generating in bitmap’s test script. It had a mysterious characteristic — the test was not failing in all of the invocations of the test. Sometimes it was passing and sometimes not. I tried to figure out what went wrong. I understood the entire midx codebase and looked for potential flaws. Johannes and Derrick also helped me in this process. But sadly I didn’t succeed. Later Taylor was able to find out the solution.
What did I learn?
I had an amazing journey this summer. I learned a lot of new things. The most amazing part of my GSoC journey is that I got the opportunity to learn from the most experienced and knowledgeable persons. I learned how to write efficient C code, how to use various debugging tools. I also learned about bash scripting. Overall it was a very good experience :)
What next?
My main priority is to finish the remaining tasks for roaring bitmaps. Once that is finished, I would like to take on some other issues. Besides I also have a plan to contribute to Kubernetes and other open source projects.
Thanks :)