DSX, GitHub and forking

Victor Terpstra
6 min readJun 11, 2018

--

Forking a DSX project in GitHub creates a greater separation between users and allows for a more flexibility in merging assets. A greater separation allows DSX users to commit to their linked GitHub repository frequently without immediately impacting other users of the project. Assets are merged between users explicitly using GitHub pull-requests and can only be done when and if the owner(s) of the repository agree. Solving merge conflicts can be performed outside of DSX, either on GitHub or using any of the available git tools on your local workstation.

This is a follow up to my post on DSX and GitHub where I discussed how use GitHub as a code repository for DSX. In this post I’ll be discussing how you can setup and use GitHub forking with your DSX projects.

Why forking?

Forking creates a level of separation between projects, while still maintaining a link to allow for updates back and forth.

Suppose there is a DSX demo, maintained by your colleague that you would like to extend for your own DSX presentation. However, while you are working on your extensions, your colleague may need to present the original demo and thus you would like to avoid pushing your day-to-day changes to the original repository. This is a common situation for GitHub repositories and the reason for the concept of forking a repository. Instead of linking the original GitHub repository to your DSX project, you first fork the repository on GitHub and then link the forked repository to your DSX project.

In GitHub, create a forked repository by selecting the Fork button.

How to fork

  1. In GitHub, navigate to the ‘original’ project you would like to fork.
  2. Select the ‘fork’
  3. Change the name of the repository: Settings -> Repository name. Make sure it is different from the original repository, otherwise in DSX you will not be able to have a project based on both the fork and the original repository in your DSX user account.
  4. Copy the URL.
  5. In DSX, create a project from the GitHub fork using the URL.

Now you can start developing your own features in the project and push these changes to your forked repository on GitHub, just like with any other GitHub-based DSX project.

If this is all you want to do, then you are all set. However, the forking becomes interesting if:

  1. Your colleague has updated the original repository. Perhaps has refined some notebook or dataset. And you would like to update your project with these changes.
  2. You have added a feature to the demo that would be valuable as an improvement to the original demo.

Use-case 1: Update your forked repository from the original

There are 2 ways to do this: the easy way (via GitHub) or the hard way (via your local workstation).

The easy way

The ‘easy’ way is via a pull-request in GitHub:

  1. Go to the original repository.
  2. Create a pull-request: Pull requests -> New Pull Request.
  3. Use as base fork: your forked repository, head fork: the original repository
    (For both, leave the branch at master, assuming you haven’t introduced other branches in the project.)
  4. GitHub will display the commits that where made to the original repository since you lasted synced. And most importantly, it will test if the changes can be merged automatically. If so, you can create the pull-request.
  5. From your forked repository, as the owner, accept the pull-request.
    If this worked, you are done with use-case 1: you have updated your fork with the latest changes from the original repository.
  6. However, if there are merge conflicts, i.e. with merges that git cannot do automatically, you will need to resolve those merge conflicts and that cannot be done from within GitHub.
In GitHub, create a pull request to merge changes from the original repository into the forked repository. The compare screen will show the changes, whether these can be merged automatically and thus whether this can be completed from GitHub. If this merges without conflict, you’re done.
In GitHub, if the pull request compare shows a merge conflict, cancel the pull request and resolve the merge conflict using your local workstation.

The hard way

The ‘hard’ way is via your local workstation.

Before you can start resolving merge conflicts, you need to make sure that:

  1. You have cloned the forked repository to your local workstation.
  2. You have added the original GitHub repository as a remote with name ‘upstream’ (see also configuring-a-remote-for-a-fork, are-git-forks-actually-git-clones and GitHub Forking).
    In GitGui:
    — ‘Remote -> Add..’
    — Use ‘upstream’ as the name
    — Use the SSH (.git) URL as the Location
In the Forked GitHub work-flow, your local repository relates to 2 remote repository: the regular forked repository and the original ‘upstream’ repository.
In GitGui, add the remote original repository as the ‘upstream’.

(I like using GitGui on Windows, therefore this tutorial shows screen shots of how to do the steps using the GitGui. But the same things can be done using the Git command line, or other Git clients.)

In order to do the merge:

  1. Update the forked repository via the usual fetch/merge process. (This simply updates the copy of the fork on your local workstation and should not cause conflicts.)
  2. Fetch the updates from the original (i.e. ‘upstream’) repository. In GitGui: Remote -> Fetch From -> upstream.
  3. Initiate a local merge. In GitGui: Merge -> Local Merge -> Tracking Branch -> upstream/master.
  4. GitGui will now show a list of merge conflicts that you have to resolve.
  5. The easiest way to resolve merge conflicts in GitGui is to do a right-click in the top-right pane with the text of the file and select which version you want to keep: either the local (i.e. the version in your fork) or the remote (i.e. the version in the original repository).
  6. If this doesn’t work, you would need to manually edit the file and remove the conflicts via your favorite text editor.
  7. (For notebooks, there exist dedicated tools that can help resolve conflicts, for instance see nbdime, but that is a topic for another blog post. For a deeper discussion about the challenges of using git with notebooks see also making-git-and-jupyter-notebooks-play-nice.)
  8. After you have resolved the conflicts, you’ll need to commit the changes into your local workstation version of the forked repository.
  9. Push the changes from you local workstation into the GitHub forked repository.
  10. From your forked DSX project, pull the changes from the GitHub fork.
In GitGui, fetch the upstream remote.
In GitGui do a local merge from the remote upstream repository.
In GitGui, the merge will fail due to the expected merge conflict.
In GitGui, if possible, resolve the conflict by either selecting the remote or local version of the file. If not, then use a text editor to resolve the conflicts in the file.

Use-case 2: update the original repository from your forked repository

  1. First, always do the above use-case: update your fork from the original repository.
  2. In GitHub, from your forked repository, create a pull-request:
    — Pull requests -> New Pull Request.
    — Base fork: the upstream/original repository, head fork: your forked repository.
  3. If you have properly done the step 1, you should not see conflicts and you should be able to create the pull request.
    If not, cancel the pull request, complete the previous use-case and resolve all conflicts.
  4. The owner of the original repository will receive a notification that a pull request was created.
  5. The owner then reviews the pull request in GitHub and can accept the pull request to merge the changes.

Summary

In summary, forking with DSX projects makes sense if:

  1. You would prefer a certain separation between the original and forked repositories such that the individual DSX projects can sync with their remote GitHub repositories without immediate interfering with other collaborators of the same original project.
  2. You require a fine-grained control with respect to resolving merge conflicts, allowing you to use a variety of tools on your local workstation like the Git command line, GitGui, text editors and/or other 3rd party tools.

--

--

Victor Terpstra

Senior Data Scientist — Prescriptive Analytics, IBM Data Science Elite Team. The opinions expressed are my own and don’t necessarily represent those of IBM.