GitHub for DSX projects

Victor Terpstra
9 min readApr 14, 2018

--

Linking a DSX project to a GitHub repository has a number of advantages: backup of your code and assets, easy migration to new DSX clusters, working on multiple clusters simultaneously, working off-line on your local work station, sharing your code, collaboration, etc.
This post contains step-by-step instructions of the setup and work flow for the GitHub integration in DSX.

A DSX project can be linked to an enterprise GitHub repository. That is a great option for many reasons:

  1. Backup. By syncing with GitHub, all your project files, and their version history, are stored and so you can always go back and retrieve your code.
  2. Easy migration between DSX clusters. DSX clusters tend to come and go. Increased hardware needs and DSX version upgrades tend to work best when installing new clusters. User are then asked to backup their projects and migrate to the new cluster. GitHub makes this easy by simply pasting the GitHub URL.
  3. Work on multiple DSX clusters simultaneously. GitHub allows for a quick code sync so you can easily switch working in multiple clusters.
  4. Work off-line on notebooks with Anaconda on your local workstation. Download and sync DSX notebooks with your local workstation and work with a local Anaconda/Jupyter installation on your workstation. This allows working off-line and to quickly browse through your notebooks.
  5. Share your code. Allow others to view and re-use your projects. Share the GitHub project URL.
  6. Allow others to make improvements to your projects, while still keeping control over what modifications are made (i.e. through ‘pull requests’).
  7. Collaborate with multiple data scientists on the same project. Work together on different parts of the project.
  8. Share re-usable code.

In this post, I’ll be showing you how to setup GitHub and DSX, configure a project, do push and pulls, fork a project and create pull-requests. In addition, I’ll show how you can access your notebooks off-line using Anaconda/Jupyter on your local workstation.

One-time setup of GitHub access token

If you are using Enterprise GitHub, DSX will need to have obtained an access token, so it can securely access your repositories on GitHub. This is a one-time setup per GitHub account.

In GitHub Enterprise, create an access token:
1. From the user/account-button, select ‘Settings’
2. Select ‘Developer Settings’
3. Select ‘Personal access tokens’
4. Press button ‘Generate new token’
5. Give it a name
6. Enable ‘repo’ scope
7. Generate
8. Copy the token
9. Store the token somewhere, because you cannot access it from GitHub anymore!

Create access token in GitHub.

In DSX, add the token:
1. From the user/account button, select ‘Settings’
2. Select the ‘Integrations’ tab
3. ‘Add token’:
— Set ‘Company URL’ to your enterprise git, e.g. ‘https://github.ibm.com'
— Paste your access token into ‘Access Token’
— Give it a name
— Create token

Add GitHub token in DSX.

Setting up a GitHub-linked DSX project

In your(enterprise) GitHub account and create a new (empty) project. Copy the project https link.

Copy the GitHub project URL.

In DSX, create a new project ‘From GitHub’

Paste the GitHub repository URL in the DSX project.

DSX will now download the project from GitHub, create the project and maintain a link to GitHub. Since we created an empty project in GitHub, at this time, DSX will create the default project structure.

Making changes to the project in DSX

You can now use your project in DSX, add notebooks, models, etc.

DSX will detect if there are changes to your DSX project that have not been uploaded to your GitHub project. If so, it will show a bar with “Changes made — You have local changes that you can commit”

DSX detects there are changes to commit to GitHub

You can select the commit and push link, or use the Git button and select push.

In DSX 1.2, the commit dialog allows you to review a list of files that have been modified, added or removed while typing a commit message. You can select which files are going to be included or excluded from this commit. The commit message should be a short description of the changes in this commit. This will help keeping track of the commits made to the project from GitHub. (The ‘tag for release’ is a tag that can be used to mark a deployment ready version of your project.)

Pull changes

If you or someone else has been making modifications to the project and committed them into GitHub, DSX will show that there are updates to pull by enabling the Pull menu option in the Git menu.

The Pull project menu is enabled if there are updates in the GitHub repository.

The main challenge with any version control system is how it manages conflicts. The potential for a conflict arises when you have local changes to a file, those changes are not yet pushed into the GitHub repository and, before you are able to push these updates, someone else commits changes into the GitHub repository from a different source.

DSX will do a pull of the updates and apply them to the DSX project. Note that a pull is somewhat more risky because it overwrites your existing file if the master version has been update. DSX will warn you beforehand so you can decide to refuse the updates and make a backup.

DSX pulls updates from GitHub. Conflicts are ‘resolved’ by overwriting the file.

Note, DSX v1.2 does not do a Git merge. A Git merge tries to combine changes from 2 sources into the same file. This can work well if the changes do not overlap. For instance, your changes concern a different and independent block of code from the changes that are in the update from GitHub. A Git merge can automatically resolve these changes. Only when it can’t, i.e. when the changes involve the same block of code, a regular Git merge fails and the user needs to resolve the conflict.

One advantage of using GitHub is that if you accidentally overwrite your code, you can go back into GitHub and try and get access to your previously saved version. This works best if you have regularly committed and pushed your code into GitHub.

Working on your project outside of DSX

You can clone your GitHub project to your local workstation. For instance on Windows, you can use GitGui, which comes with a regular Git for Windows download. This creates a copy of the DSX project. Inside, there is a directory ‘juypter’ which contains all notebooks. Using a local installation of Anaconda/Jupyter, you can open the notebooks and edit and run them.

One advantage of this is that you can quickly open multiple notebooks. This is great, even for simply reviewing or looking up code that you have used before.

DSX project structure synced to Windows from GitHub.

Local Jupyter environment

If you decide to make changes to your notebook on your local workstation and push them to GitHub, you will notice a challenge if the version of your local Anaconda/Jupyter Python doesn't match the version using in DSX. In that case, opening the updated notebook in DSX will ask you to select and switch kernels. You can avoid this by making sure your local Anaconda/Jupyter uses the same Python version.

You will need to have an understanding of the Anaconda environments and how to manage Python versions.

a. Use the following in a notebook to find the Python version used in DSX. Assume for instance that the Python version is 3.5.2.

import platform
print(platform.python_version())

b. In the Anaconda prompt, create a new named environment (e.g. python3) and make that environment a kernel for Jupyter:

conda create -n python3 python=3.5.2 anaconda
activate python3
python --version
python -m ipykernel install --user --name python3 --display-name "Python 3 (DSX)"

Multiple people working closely on the same project

One of the great benefits of Git comes when multiple people work on the same project. Depending on how you want to collaborate, you would choose a slightly different approach.

Working closely on the same deliverable. For instance, an optimization expert works on the modeling of the optimization problem in a notebook. A UI specialist is configuring the dashboard on the model. Both can be making updates in their own DSX project and sync changes via GitHub. Since they are making changes to different files, there is a low probability of conflicts and thus low risk of overwriting files. It is a good practice to agree who is the main responsible person for each file in the project.

Once a project is getting closer to completion, or is in maintenance mode, it becomes more important to ensure the code is stable and tested before code is shared. As a developer, you would like to finish fixing a bug or completing a feature before sharing it with your colleague. On the other hand, you would like the ability to backup your code and roll-back when a code experiment doesn`t work out. The full-blown Git has capabilities like branches and pull-requests to handle these use cases. DSX does not (yet) implement these features. However, we can work-around this limitation by choosing a different approach to share projects in GitHub, namely by forking a project. Which is the approach for the next level of collaboration.

Working more loosely on the same project

One user has developed a demo to a stable version. A second user would like to tweak the demo for a particular customer presentation. A simple clone of the GitHub project runs the risk that pushing any modification will change the stable demo version. The owner of the demo repository can make sure that no one else can make changes to the main demo repository (i.e. no one else has write access).

You can change the project collaboration settings in your GitHub project via Settings -> Collaborators & Teams -> Collaborators. You can add, remove and change permission levels.

If the second user now clones the original project repository, it is not possible to push changes into GitHub. And this user losses all other advantages of synchronizing with GitHub.

Alternatively, perhaps the second user has made some improvements to the demo that both developers agree need to be merged back into a new version of the original repository.

For these use-cases, GitHub has the feature of ‘forking’.

Forking a project

From the top-right button on the main project page in GitHub, you can select Fork.

Fork a repository.

This will create a new repository, owned by the user who performs the fork, but with a link to the original project in GitHub.

A forked project maintains a link to the source project.

The second user now has all the necessary write privileges to this forked project and can make any changes as necessary without directly impacting the original project.

Pull Request

After finishing the upgrades of the demo, the second user can now create a pull request. The pull request compares (a branch of) the original repository with (a branch of) the forked repository and collects the differences. It sends a request to the owner of the original repository (the ‘base’). That user can now review the changes in the GitHub UI and decide whether to accept or reject them. If accepted, the changes are merged into the original repository.

Vice versa, while the second user has been working, user one has also made some upgrades to the demo. In the same fashion, either user can create a pull request to update the fork from the source repository.

Conclusion

Git and GitHub are not the easiest technologies to use. And DSX so far has implemented only a limited set of Git capabilities. For both reasons, incorporating GitHub in your DSX project can seem somewhat challenging. In this post, I addressed the main challenges and how to work with them. The advantages of using GitHub with DSX are significant and can make collaborating on a project a lot easier. I would advise to always start a project by linking it with GitHub, even when there is no immediate need for collaboration.

--

--

Victor Terpstra

Senior Data Scientist — Prescriptive Analytics, IBM Data Science Elite Team. The opinions expressed are my own and don’t necessarily represent those of IBM.