Git Merge Brussels 2019, a summary

Floor Drees
11 min readFeb 6, 2019

--

Courtesy of GitHub (or: your friendly DevRel Don Goodman-Wilson), 3/4 of the Amsterdam Ruby meetup organizer team was able to attend the annual Git Merge conference, which took place at The Egg in Brussels February 1st. Carrying a serious amount of snacks (and coffee), we headed for Belgium well before sunrise, as the weather predictions weren’t necessarily in our favor.

Below is me trying to summarize my highlights of the event.

Scaling repositories

More than one talk touched on the topic of mono-repos (consolidating multiple directories). Johan Abildskov (@randomsort) is a trainer & consultant at Praqma, helping companies in Scandinavia move towards Git, CD, and a DevOps mindset. Johan introduced available tools and considerations for deciding on the repository structure (Mono vs Many) that’s just right for you.

Bridging the gap: transitioning Git to SHA-256

Brian M. Carlson (@bk2204) is a Git Ecosystem Engineer at GitHub. Git has long used SHA-1 to identify objects in its datastore, but SHA-1 is considered vulnerable to collisions, where a malicious actor could create two different objects with the same hash. Such an attack still requires lots of resources, but Git decided to make the jump to SHA-256.

SHA-256 is presently considered secure, its hash fits on an 80-column screen, it offers 128-bit security, and it’s supported in all major crypto libraries with good performance, as well as in hardware on AMD and ARM (soon Intel as well).

Regarding the transition plan, Git wants to minimize the disruption for its users, and preserve interoperability whenever possible. Users get to decide when to transition (independent of others), and the new signed objects will make use of a stronger hash algorithm.*

I kinda want to share the Implementation Design Guidelines here:
- Everything in the .git directory uses the same algorithm
- Everything producing Git data uses the .git directory algorithm
- Fetch and push can negotiate an algo to use
- All other outpost uses the same algo (even output to other Git processes)
- Conversions between algos use git fast-export and git fast-import
- Binary data formats use a fixed-length 4-byte algo identifier
- Text data formats and command line use SHA-256 or SHA-1
- Works for future algos as well

SHA-256 support is not upstream but available for tinkering in a branch. For more information, as well as the staged transition plan: click.

Git protocols: still tinkering after all these years?

Brandon Williams started his career at Google where he made many contributions to Git. He’s now a Software Engineer at Facebook using Rust. And he wants to examine recent changes to the Git protocol focusing on the introduction of protocol v2 with us.

The issue with the original protocol is that the server’s ref-advertisement includes all refs. The Chromium repository has over 1 million branches and tags, which means that potentially 10s of MBs are sent per request, … and subsequently ignored.

However, transitioning is no walk in the park. Backward compatibility (using existing URLs) is complicated as older servers will ignore additional information (the stuff that comes after \0) send.

Sorry, I just had to.

Cloning the Chromium source using the original protocol (v0) took 10.08s. Using protocol v2, it took 2.33s — roughly 4x faster! With well over a million references, 4x speed is non-trivial.

Other improvements include:
- The mechanism to switch protocols
- Server-side filtering of references
- More extensible and will allow future improvements
— CDN offloading + resumable clones
— Rebase on push
— Remore grep/log for patial clones

Native Git support for large objects

Terry Parker is a manager of the git-core and Git server teams at Google and contributes to the JGit open source project. Google hired him 11 years ago to work on Eclipse. Terry knows all too well that large binary objects pose a special challenge for Git. But Git’s new partial clone feature and a new proposal to use content distribution networks can help.

So how large are we talking here? .bin type files are definitely large. And VM images. And APK files. Certainly files over 1MiB are large. But really, all that’s considered “painful” is too large. Native support comes in the form of “partial clones” (i.e. lazy downloading) and it’s adaptable so that clients and servers both can make the tradeoffs they need, while Git protocols understand and participate in the management. Where large clones previously tied up server resources, limiting ability to serve new customers, partial clones take care of the initial filtering, and (lazy) download on-demand, using CDNs (Content Distribution Network).

Mono-repo with plenty binary files? The Git Project has you covered (“in a release near you”).

Git for games: current problems and solutions

John Austin (@kleptine) has been making games for nearly 12 years and has previously worked at Google, and Microsoft. He founded and currently leads A Stranger Gravity studio.

Git is the source control system of the modern era. Yet, the vast majority of AAA game studios still use Perforce, SVN, and other more traditional systems. It’s not for a lack of desire, rather, game developers have a unique set of constraints and workflows that make Git unsuitable for the task.

The game development workflow is infamous for its terabytes of files at the head revision, and loads and loads of binary files. In fact, most changes made are on the assets side (predominantly .bin files), while Git kinda assumes all your files are text.

Repository bloat is a pain, and binary file conflicts are lost work. Git can’t properly diff or compress binary files, and it can’t merge binary files either.

A spreadsheet detailing who ultimately owns what files and when changes were made to it (what’s the latest version) doesn’t scale to more than a handful of developers or files.

Also, isn’t this why we built source control?

Even with the most advanced neural network in the world you can’t merge art. So if you can’t merge, the best you can do is prevent binary conflicts before they occur. An early warning system of some sorts.

Error: Commit does not descend from existing work done on file.bin

But such a warning system can only add contraints on the Git level (i.ee. once you commit changes to a local repo), even if John would love to caution before, locally. The Git Global Graph project is designed to solve the above challenges without compromising on the fundamental structures and benefits of Git, with file “locking” encoded in the branching model. A pre-commit hook checks the Global Graph for conflicts. Managing un-mergeable files, in Git Global Graph binary files are paths, and a commit is valid if it descends from every commit touching the file.

Git Global Graph is open source, and written in Rust. The project still lacks work on divergent branches and file system layers, but contains a HTTP API for queries, as well as standard Git authentication.

The art of patience: why you should bother teaching Git to designers

Belén Barros Pena (@belenpena) is an Interaction Designer for software engineering tools, and free / open source software. While working as the only designer of the Yocto Project, her Linux-engineer coworkers “mustered the patience” to teach her Git. As a designer, learning Git made her more independent and more useful to her development team.

Explaining what she does, Belén says its her responsibility to build software that makes sense to the people who use it.

“A mental model is what the user believes about the system at hand (…) A mental model is based on belief, not facts: that is, it’s a model of what users know (or think they know) about a system.”

Jakob Nielsen, Mental Models

Belén: “A GUI (like GitHub) should reflect the way the people think, not how the system works. People that have never used Git, don’t have a mental model of Git.”

Belén was taught Git on a need-to-know basis by her coworkers who avoided Git jargon (or: “incomprehensible language like branches”), nor bothered too much with the concepts like the difference between git push and git fetch. She encourages aspiring mentors to do things with — never for — their mentees, so they create muscle memory. Make sure they take notes, keep a cheat sheet, and use the command line instead of a GUI (because of the aforementioned mental model disconnect).

Belén: “The wall between developers and designers is not made of bricks, but of attitudes. Designers should learn Git as a design material. And they need to know just enough to make sound design decisions. If we want more designers participating in FOSS, we need to help them to do so.”

Version control for law: Posey Rule in the U.S. Congress

Ari Hershowitz (@arihersh) is the Director of Open Government for Xcential Corporation.

Stealing Ari’s pun here.

A new rule in the U.S. House of Representatives (the ‘Posey Rule’), requires redlined prints comparing documents before and after changes in Committee.

Unfortunately, the solution isn’t as straight-forward as just using Git:
- Changes are made by amendments, not versions.
- Changes are acts (over centuries, thousands of pages), not repositories.
- Standard diff doesn’t work, because:
— Matching ‘the same’ section requires semantic judgement.
— We’re talking moves and changes in 1000-page bills.
— Meaningful grouping of changes is challenging.

As a lawyer who codes, Ari works for the U.S. House of Representatives to build document comparison software that works for law, can track changes in law and ultimately will be able to show what the law was at any point in time.

His team designed an open standard XML for legislation: the United States Legislative Markup (USLM), converted the U.S. Code to USLM, and continues to build tools for machine-readable amendments and references, machine-executable instructions and legally-relevant diffs.

Git, the annotated notepad

Aniket Subhash Kadam (@aniketsmk) is an independent consultant who’s been around the startup block “a few times”.

Aniket: “Ever forgotten what you were working on the night before and needed a few minutes to get back context?” YES. Aniket uses Git-Cola to keep track of a handful of files only, and check those diffs to get back to where he left off. But that is only as useful as your Git discipline.

For quit context reloading, more working memory and enhanced focus, always be committing:
- Commit in small ‘logical units’ (think: refactoring, a set of functions used together in a feature, one complex function) / any set of changes that make sense to view together.
- Read every diff line-by-line before commiting, and check if it passes personal code review.
- Use descriptive titles to help with code review.
- Take the time to refactor the best you can before committing.

Git & version control in the enterprise: a panel conversation with Atlassian, GitHub and GitLab

Having the 3 version control system vendors on stage was definitely a first, and one I enjoyed very much. Quick intros before I share some of my take-aways:

Erik van Zijst (@erikvanzijst) is a Principal Engineer at Atlassian, focused on Bitbucket. At 8.5 years he currently holds the title of Bitbucket’s longest contributor. James Ramsay (@jamesramsay) is a Product Manager at GitLab. And Briana Swift (@brianamarie132) is a Trainer at GitHub. CB Bailey (@hashpling), Software Developer at Bloomberg, led the panel.

CB Bailey

CB: How did Git get such a strong foothold in OSS?
Erik: I reckon GitHub had something to do with it.
James: Forking is a great way to collaborate, which is not a Git feature perse but all hosts use this model now.

CB: What are the significant differences between OSS and enterprise workflows?
James: Branching vs Forking (respectively) is one, as well as access-control.
Briana: Restrictions and permissions in terms of os, language used, developer tools, are common in Enterprise environments.
Erik: Protocol for PRs is as well.

On scaling and the mono-repo topic, Erik says he doesn’t see a lot of sudden demand for the latter. James says GitLab does, especially from projects coming from other VCS’s. Erik: “Companies might be looking at the vendors for guidance regarding their workflow.” GitLab itseld is a monolithic Rails repo, and they built a way to coordinate things like merge requests. Erik: “Refactoring a lib that’s used across your monolith, there’s no reason fundamentally why we can’t make that work, but the tools remain unsupported. A code host would just need to put enough smarts on things like code indexes.” James agrees that a tool integrated in your Git host would be “paramount”, “that’s where the daily work takes place, you don’t want yet another tool”.

CB: What’s next for Git?
Briana: “Git is accessible, but it’s not easy. We can always improve how we teach it.”
James: “We need to get better in our visual representation, so that new users can feel confident and powerful using Git.”

Technical contributions towards scaling for Windows

John Briggs (@jrbriggs) is the Engineering Manager for the Git ecosystem team in Azure DevOps. His team is tasked with helping Git scale to support some of the largest repositories in the world with contributions to core Git, Git for Windows, and VFS for Git.

Microsoft runs on GitHub. Windows OS repo, since moved to Git in 2017, grows (massively) exponentially over time. A week in early 2018 looks like this:

Wait? Only 23% is actual development?

Available in Git 2.20 is the multi-pack-index feature John’s team has been working on. multi-pack-index is the not-so-secret sauce of Azure, which accelerates the time it takes to find an object (object lookup). Really useful when you’re in an environment with many pack-files like with VFS (Virtual File System) for Git, due to the prefetch model.

John also introduced the serialized commit-graph, first workshopped in the 2018 Git Merge talk “Making Git for Windows” by Derrick Stolee and Johannes Schindelin.

For the visual learners, the current commit walk:

Reds are uninteresting, blue interesting

The current object walk:

Commits, Trees and Blobs
3.1 million interesting objects (left) vs ~100 interesting objects (right)

And the new “sparse” object walk:

A performance comparison (using a small topic) measured “trees parsed” (443.787 for the old algo, 37 for the new “sparse” object), git pack-objects time (10.1 seconds vs. 3.8 seconds), and git pushtime (14.9 vs. 8.5). The sparse push algorithm is currently under review in core Git.

One last thing is improving status performance by caching git status.

How a Git based education cultivates more resilient developers

Ben Greenberg (@rabbigreenberg) is a Developer Advocate at Nexmo. Ben is a second career developer who previously spent a decade in the fields of adult education, community organizing and non-profit management. He’s also a rabbi.

A coding education that not only incorporates Git into its education but forms its educational program around Git produces more resilient and more capable developers, or so Ben says. As a former technical coach at the Flatiron School he helped numerous emerging programmers learn the fundamentals of software development and grapple with breaking a problem down into its smallest parts through using a “Git approach”. This not only teaches good version control but cultivates a genuine trouble shooting mindset, a paradigm shift in seeking solutions. Coding Bootcamps would do good to include Git in the curriculum (and use OSS to host their materials).

The end. And then dancing might or might not have been initiated by yours truly.

*The relevant mailing list item, courtesy of CB Bailey, for reference: https://git.github.io/rev_news/2018/08/22/edition-42/

--

--

Floor Drees

Product Marketing Manager at Microsoft (Azure). Mom to 1 human, 1 dachshund and ~100 plants.