The Opensource Workflow…
A standard workflow that you should definitely know about.
Whenever you wish to work on an opensource code base, the first thing you need to know is what to contribute, and soon after that, you will end up with the question how to contribute. Even though this is one of the most important technical phase that most people talking about opensource leave out, but I can assure you that if you understand this well enough, you will go a long way helping the community.
So what all things do you need to follow along? Well, you need:
- A Github/Bitbucket/Gitlab account.
- An opensource project to experiment with. ( You may create a dummy project to do the same, but then you will need one more account too).
To save you the trouble, I will provide you with enough example so that you can understand it right away.
Note: Before using any command mentioned here please go through their syntax. Due to the nature of this article, I have avoided going into syntactic detail of each command.
Clone, Add, Commit, Push?
If this is what came to your mind when you saw the word contribution then well, keep reading. There is more to it than these four commands. First thing you need to understand is, these four command workflow will only work if you have the permission to make direct changes to the Github repository; which might be the case in small or medium sized projects, or the project where the number of active developers is roughly up to 15 people. If the number goes beyond this, no organisation will give you permission to make changes to their repository.
The easiest way to wrap your head around this is to clone one of your own repositories on your local machine, and make changes to that code locally, adding it, committing it, and pushing it. You will see, there is no problem in that. But what if your friend does the same with your repo? You will see that he won’t be able to push as he is not the owner/collaborator of the repository.
Well, you can quickly make him a collaborator of the project by going to the setting section of your repository and adding him there as shown in the image below.
As soon as you do this, he would be able to push his code directly into your repository and make changes to it as if he were an owner too.
Never used Pull before?
Before we jump into the main theme of this article, a quick word on pull. The pull comes into play when there are more than one person involved in the development of your code base. Be it a direct collaborator or be it pull request contributor. I will help you visualise this quickly with a help of a diagram.
Firstly, there are three lanes in this diagram, one for your local machine, one for your buddy’s machine and one that represents the Github repo. Github repo is the ultimate source of truth; which basically means, whatever your Github repo says is true for all clones of your code on any machine in the universe. If it says it have 10 commits and your machine says 15, then your machine is laying and it is your responsibility to keep the Github repo updated with the truth that you believe is correct.
Lets hear a story and by the end of it, you will know what pull is for. So, let’s say you plan to create some new project to make a name for yourself, so your lane starts with the first circle, which represents a commit. “Initial commit”, and you push that code into ‘origin’, hence the arrow towards the Github lane. You made some changes to the code, added, committed and pushed again, hence the second arrow. Everything till now is in sync. Github repo is still representing the truth. Now, you mum asked you to visit you grandma, so you asked your friend to help you finish the code. So, you make him a collaborator and he clone( hence the arrow towards his machine from Github lane ) the project and now his machine will be in sync with the Github repo. It he run a git log he will see all the commits you had done so far. Now he starts adding code into your project, while you are away. He made two commits locally and pushed them to Github repo, so far so good. The forth commit was indeed from your buddy and hence the Github repo is still representing the truth. Now you get back home and start working on your code again unaware of the changes you buddy has made. You made some changes, added and committed. According to your local repo, you have made the 3rd commit, but according to your Github repo, your buddy did the 3rd commit. As soon as you try to push your code, you will be denied by Github as your version of the story does not match with the truth.
So like I said, you are responsible for making sure Github repo represents the truth, so you pull from the Github repo. The pull command is basically fetch + merge. It pulled whatever there is in the Github repo to your local machine and merges it with your commit.
You can break it down like this:
This is equivalent to this ( long version for power users )
Also note that by default git applies an option with the git pull command, ie:
git pull --merge //merge option added by default
This option even though applied by default should be kept in mind; you will see why, in later section of the article.
If there are obvious ways to merge the changes, git will do an auto merge for you. But if the changes are clashing ( like the same file has two types of code in the same place ) then it will leave it up to you to fix these conflicts. So you can quickly do a git status and find out the files that need a conflict fix. Once done, you can now do another add and then a commit saying ‘merge conflict resolved’, and finally push this code to the Github repo. This version of code will again represent the truth as we wanted it to be.
Rule of thumb: If more than one person is involved in your repo, always do a git pull, before pushing your code to resolved any conflicts.
The Big picture
Things will start to escalate when you will work with bigger organisations and obviously they won’t make you a collaborator out of nowhere. So how exactly do you submit your changes to the their code base?
We will work with few new commands in this part of the article. I will give an explanation to each and a warning in case you need to be careful while using them.
git rebase -igit push --force
Before we explore these two commands let us understand the concept of remote. What might that be? Well if you have ever used the command
git push origin master
you must have wondered what this origin here stands for.
Or may be why do we do this while creating a new repo?
git remote add origin email@example.com
Well the answer is quite simple. Origin is as alias for the url that git uses to link your local repo with the online one. The above command basically adds this information to your local repo, without which it would have no clue which online repo to push into or pull from.
Hence when you do git push origin master, what you are actually doing is telling git to push the code to your repo using the url that you have aliased as origin. Here is a Stackover flow thread talking about this in a slightly more detailed fashion.
But can we add more that one remote for a single local repo?
Of course you can. What you add as remote has nothing to do what you local branch is about. As long as you are planning to relate your local repo with that particular online one, there is nothing to stop you from adding it.
So lets talk about how everything will fall into place here…
First you need to fork from the repo that you want to contribute to. By convention we call it upstream rather than origin.
As soon as you do that, you will see that a copy of this repo has been made in your Github account with the same name like this. This copy is by default the origin.
Now you can do all sorts of cloning, changing, adding, committing, pushing and pulling from this repo; only your repo will be changed and not the one your forked from. Once you have updated origin you can submit a pull request to upstream from Github itself as given here in this link. But there’s a catch.
Let look at another diagram to understand this situation.
So, you have the repo you forked and now own as origin then you have the clone of this repo on your local machine, and you have the original repo your forked from as upstream.
You can see, origin is in sync with our local repo and hence there is no problem there, but if you compare origin with upstream you will notice that there has been some change made to upstream by it’s owner ( may be he merged some code from the other contributors ), this is a similar case if you compare it with the story we discussed above. So we need to do the same thing then. But which one is the source of truth here? It is upstream, as your final code will end up getting merged there. So here is the list of steps and we will look at it one step at a time.
- You need to bring your local branch in sync with the upstream.
- You need to force push this synced local repo into origin.
- Finally you need to drop a pull request to upstream from origin and wait for the owner to review your pull request and merge the code.
The thing that should pop into your mind by now is why not do the same git pull and merge the changes from upstream into our local repo? You are right, we will pull from upstream, but we will not merge; instead, we will rebase.
Both merge and rebase will end up with syncing your code but the difference is how they achieve it. Here is an article talking in detail about the difference between the two.
In short — Merge will resolve merge conflicts between your local and remote branches and merge the two code together into one single commit and no trace of the commits of the remote branch that you merge into your local repo. While rebase will take your local changes(commits) and put them on top of the remote changes(commits).
Hence, when you deal with large code base dealing with many contributors, you want to be as explicit as possible about your changes and how it fits with the commits of the remote. Hence a ‘merge conflict resolved’ commit is rather confusing; thus we prefer using rebase that will preserve all the commits of the remote, as well as fit your commits on top of the remote to make it consistent and explicit.
But before we try anything, we need our local repo to know what upstream is. So we will first add upstream remote to our repo by issuing the command
git remote add upstream <git-link-of-upstream>
In my case, here is the link from which I forked.
Now our repo knows about upstream and what we mean when we issue commands using this alias.
Now you can do
git pull --rebase
which is equivalent to
Other than keep your git history explicit there are several advantage of rebase.
Note: In case you get merge conflict while doing rebase, do the same thing as you did while using git merge, resolve conflict and use
git rebase --continue
When ever you think things are getting out of hand, you can use
git rebase --abort
This will stop the rebase.
The interactive mode
The interactive mode of rebase is used for editing commits. This is a feature too powerful to be used by beginers so I recommend experimenting it somewhere, before actually using it on live project, else you might end up losing your commits.
This mode can be directly used while rebasing like this
git rebase -i <branchName>
or it could be used later just to squash commit after you have done a rebase using the non interactive way like this.
git rebase <branch>
git rebase -i <commit name to consider for squashing>
A detailed article on rebase is given here.
Why squash commits?
When dealing with a large code base, your work as a contributor is to deal with a very specific feature. Now while working on your local repo, you might need many commits to finish that feature, but with respect to upstream, it is still one single feature. So you might want to squash all you commits that you made while completing your work into one single commit that sums up your work. So all your commit like ‘added new file’, ‘added new dependency’, ‘typo fix in x y z files’… are all squashed into one single commit ‘implemented feature abc’. Thus, this makes your commit more ‘to the point’ and avoid confusion. Squashing can be done, using the interactive more as discussed above.
Here is a small blog talking about the details of this.
This article will walk you through the process of squashing.
Keep in mind that both merge and rebase can be used interchangeably but both have their pros and cons. Merge is easier to handle but you will lose the explicit commit messages, while on the other hand Rebase is a bit hard to handle, especially in case of conflicts.
So the rule of thumb is try rebase first, and in case things don’t work out, fallback to merge.
Now that we have rebased our code and everything is in sync with the only source of truth, ie: upstream, we are ready to drop a pull request. But wait,we just changed the code by rebasing it, and it is now out of sync with origin. What do we do now? Shall we again make a pull from origin? Well, no. If you do that, your code will go out of sync with respect to upstream which we surely don’t want. So what we can do is, force the local repo to push whatever code (the work which is in sync with upstream) into origin, irrespective of what we have in origin.
git push origin master
But git won’t let you do this, as the two are not in sync. So we need to force git to do this.
git push --force origin master
Now git will overwrite origin. Now your origin is also in sync with upstream and is ready to drop a pull request. Follow the link mentioned above and drop your pull request…