Working with Git/GitHub When Contributing to an Open Source Project

Benjamin Rouillé-d'Orfeuil
Singularity
Published in
8 min readDec 6, 2023

--

Every open source project has its own guidelines whether for improving the documentation, submitting bug reports, writing a feature request or contributing to the source code. Usually, the project has a contribution guide listing the practices to follow. Let’s take a look at the pandas Python package. Among other things, their contribution guide explains the procedure for submitting a pull request. Before getting there are several technical hurdles to go through and concepts to get familiar with.

Workflow

Let’s summarize the workflow here. You will work with Git and GitHub. Git is a version control system that allows developers to track changes in their code. It is installed and maintained on your local machine. GitHub is a cloud-based hosting service for Git repositories. There, you can share your code and allow contributors to make revisions and edits.

You will use forks to propose changes. GitHub has a good tutorial on their utilities. Briefly, a fork is a new repository on your GitHub account that shares code and visibility settings with the original “upstream” repository. The entire workflow is illustrated below.

From Syllabus for Peer Production (Open Source Software, Wikipedia, and Beyond) course by J. Howison available here

Get the Repository

Go to the repository you wish to fork on GitHub. Then, click on Fork in the top right corner of the page to fork your own copy of the repository to your account. Finally, create a local clone of your fork with:

git clone https://github.com/YOUR_USERNAME/pandas.git

We assume that you forked the pandas repository in the command line above.

Sync Your Copy

Configure Git to sync your fork with the “upstream” repository:

git remote add upstream https://github.com/pandas-dev/pandas.git

Your local Git client can keep track of many different remote versions of the same repository. By default, when you clone a repository from GitHub, the first remote is named “origin”. The command above adds another remote and names it “upstream”. This will be useful when the upstream version of the repository has code changes, and you want your local branch to include those changes, so that the only difference between your branch and the original repository is the code changes for your feature. To summarize, “origin” will point to the fork located on your GitHub account and “upstream” to the original repository. This is clearly shown on the image of the workflow above.

Branching

The Git feature that really makes it stand apart from nearly every other source code management is its branching model. It allows and encourages you to have multiple local branches that can be entirely independent of each other. I like working as follows:

  • Branch off main (could be master or any other name depending on the project you work on) to create a feature branch:
git checkout -b YOUR_USERNAME/FEATURE_NAME upstream/main

I recommend using YOUR_USERNAME/FEATURE_NAME for the name of the branch to make it clear you are the main developer on this branch.

  • Keep it up-to-date by moving your branch to the newest HEAD of main via:
git pull --rebase upstream main

Note that the more you wait to rebase the more you risk having to deal with merge conflicts, especially if the project has a large number of contributors. We recommend that you rebase onto main frequently.

Don’t Git Pull

…unless you understand how this command behaves and you’re sure that’s what you want.

By default, git pull will perform two distinct actions:

  • Making your local Git client aware of the latest commits in the default remote (running git fetch).
  • If there are any differences between the commit history of the currently checked-out branch in your local Git client and the latest commits of the corresponding branch in the default remote, they will be merged (running git merge).

If the commit histories of the two branches have diverged (i.e. each branch has at least one commit that’s not present in the other), then Git will automatically create a merge commit. This will make integrating your code back into the codebase more difficult. If there are no commits in your local branch that aren’t present in the remote, then the git merge command will result in a ‘fast-forward’ merge, where the commit history of your local branch is identical to the remote (this is good).

If you do want to run git pull, we encourage it to be run in a non-default mode with different behavior:

  • git pull --ff-only: this will run git fetch as normal but only execute the git merge step if it can be completed with a fast-forward merge (i.e. without creating a merge commit). This will only work if there are no new commits in your local branch.
  • git pull --rebase: this will run get fetch as normal and then attempt to rebase any new commits in your local branch (any commits since the history deviated from the remote) after the new commits of the remote branch. This will only work if the distinct commits in the two versions of the branch don’t have any instances of editing the same part of the same file.

If neither of these steps can be completed automatically, then your local branch’s commit history will need to be reconciled in a more manual way, e.g. rebasing and manually resolving conflicts.

For more information, see the git pull documentation.

Git can be configured to set either of these behaviors as the default behavior when git pull is called. To configure git pull to use fast-forward-only by default: run git config pull.ff only. To instead configure git pull to use a rebase to resolve the commit history by default: git config pull.rebase true. By default git config changes configurations on a per-repository basis, but it can alternatively configure behavior across all repositories via a --global flag, e.g. git config --global pull.ff only or git config --global pull.rebase true.

For more information, see the git config documentation.

Commit Message

Some project requires commit messages to be structured in a certain way. On the current open source project I am working on, the commit messages follow this semantic:

feat: add hat wobble
^--^ ^------------^
| |
| +-> Summary in present tense.
|
+-------> Type: chore, docs, feat, fix, refactor, style, or test.
  • chore: (updating grunt tasks etc; no production code change)
  • ci: (changes to the CI configuration files and scripts)
  • docs: (changes to the documentation)
  • feat: (new feature for the user, not a new feature for build script)
  • fix: (bug fix for the user, not a fix to a build script)
  • perf: (code change that improves performance)
  • refactor: (refactoring production code, e.g. renaming a variable)
  • style: (formatting, missing semicolons, etc; no production code change)
  • test: (adding missing tests, refactoring tests; no production code change)

This is a good way to keep the commit history of the project clean as shown below:

*   1614d2e Merge pull request #652 from Breakthrough-Energy/ben/import
|\
| * 3d17f0f fix: add geographical coordinates to branch and plant data frames and fix bus assignment/naming (#703)
| * 0c7d0d5 Merge pull request #682 from Breakthrough-Energy/ben/profile
| |\
| | * adf9309 refactor: add inflow to column name of carriers with inflow profiles
| | * 0444b31 refactor: simplify logic
| | * 9a4d290 feat: normalize inflow profiles by max
| | * 9c7f629 test: write tests for profile extraction
| | * 7b2de36 feat: extract profiles from pypsa network
| |/
| * 04f2d4d feat: support hydro inflow functionality (#691)
| * b805325 feat: extract substation from arbitrary pypsa networks (#674)
| * 16b190d Merge pull request #675 from Breakthrough-Energy/ben/grideq
| |\
| | * 9707a82 fix: enable grid equality for back converted pypsa networks (#689)
| | * c134725 docs: format docstring and remove note
| | * 835e48e test: add test for storage
| | * 9ff3873 fix: enable roundtrip conversion (#678 and #685)
| | * bcf3bed feat: make FromPyPSA object a Grid object
| |/
| * 3e103fc refactor: create library of constants for grid object, casemat file and pypsa translators (#667)
| * 308d946 ci: update gitignore
| * f7d3899 feat: convert PyPSA storage_units/stores to Grid storage_data (#657)
| * e99c87c feat: convert pypsa Network object to Grid object and profiles
|/
* 4a46483 Merge pull request #701 from Breakthrough-Energy/jen/linearize
|\
| * 152c5b5 feat: port ramp_30 modifications from REISE.jl
| * 1a510d2 refactor: remove overload of linearize_gencost
| * 28d6424 fix: loading grid in analyze state
| * 066deb2 fix: don't scale coal pmin
| * 36f95b1 fix: fillna to prevent downstream errors
| * 68b1f1e fix: invocation of linearize_gencost
| * a06826d feat: wip port grid modifications from reise.jl
| * df4fc45 test: port test case for pmin = pmax
| * 3fa8ca4 test: port linearize_gencost tests from julia
| * a89b0a3 feat: move pmin overrides and cost curve linearization to client side
|/
* 35bd7d4 refactor: generalize area type in check function (#702)
* d83ac6c refactor: generalize generator type in the MockProfileInput class (#699)
* 5a17941 chore: update dockerignore (#700)
* f11f64c Merge pull request #698 from Breakthrough-Energy/ben/dependencies
|\
| * 92ddfbd fix: handle FutureWarning raised by pandas
| * b77b3ed fix: use list instead of set to create column names in data frame
| * b4373b4 ci: generate pipenv lockfile
|/
* b2df44a Merge pull request #697 from Breakthrough-Energy/ben/zenodo
|\
| * 763b1ac ci: remove zenodo_get package
| * 6d49050 feat: allow user to download any version of pypsa-eur
| * 70492f0 ci: update gitignore
| * 5448fd9 feat: create zenodo download manager
|/
* c5ee9e5 refactor: combine hydro and PHS into inflow in model immutables (#695)
* e989af9 style: add slack badge to README (#696)

Clean Up Personal Commit History

If you did not follow the commit message convention or your commit history is messy, use the interactive rebase tool (see this website for more details) to revise your commit history. You will be able to reorder, reword, drop and meld commits. In short:

git rebase -i upstream/BRANCH

where BRANCH is the name of the branch you branched off, e.g., main.

Pushing Changes

When you want your changes to appear publicly on your GitHub page, push your forked feature branch’s commits

git push origin YOUR_USERNAME/FEATURE_NAME

Now your code is on GitHub, but it is not yet a part of the project you are contributing to. For that to happen, a pull request needs to be submitted on GitHub.

Pull Request

PRs are critical to good software development by:

  • Reducing code defects
  • Keeping the team up to date with new code in the code base
  • Teaching each other how to get better at coding

Open a PR as follows:

  • Navigate to your repository on GitHub
  • Hit the “Compare & Pull Request” button

You will find below the tasks that you usually have to go through for the PR:

  • Keep your PRs simple (< 400 lines) — Short PRs get reviewed faster, get better feedback, and more bugs are caught
  • Make sure your commit history is clean
  • Ensure you have appropriate tests
  • Ensure that checks (e.g. linting and testing) are in a green state
  • Fill out the form when creating the PR if the project set up a PR template
  • Keep branch up to date during the entire process
  • Perform a merge commit once your PR is approved

Other things to keep in mind

We just talked about Git and GitHub here. When contributing to an open source project, you will have to document your code (e.g. using docstrings), format your code (e.g. using black, flake8, isort) and write unit tests. Each project will have guidelines for these too.

--

--