Using GitHub for project control in DSX

Try as we might, sometimes our projects don’t fit nicely into notebooks. Notebooks are nice for R&D, but at some point you may need to build a more integrated folder structure for your project. One approach in Data Science Experience (DSX) is to use a source control environment to store and persist code, and use DSX as the spark engine for running your code. Here I am using GitHub, and pulling and pushing code changes to be tried on DSX. I am using DSX notebooks as a CLI interface to Spark for pyspark, as well as a shell interface for git with the magic “!” and “%” commands in Jupyter notebooks.

Create a GitHub repo on github.com

First thing you need to do is create a repository for your code. Here I am using github.com, but this process would work fine with any git system that you have access to through DSX.

After you have created the repo, you can load it up with code, either from your local environment, or just use the text editor in GitHub. Here is a nice hello world on git.

Clone repo to DSX

To clone the repo over to DSX just use the hashbang and magic calls in Jupyter notebooks to call the underlying Linux and git commands that you need to manage files and kernel location.

Below is a screenshot of the how I cloned a GitHub repo over to my GPFS in DSX, and here is a link to an overview of distributed data in GPFS storage [pdf format] which is the format used in case you are interested. One nice attribute of GPFS as opposed to HDFS is that you can use the file format to hold and access kernel data without having to resort to hdfs program calls, and here we are just the GPFS as local data store for our code.

Here’s the code for cloning a repo in GitHub in python. Notice that you are going to need the path to your DSX home directory, which you can retrieve using python by simply running the !pwd command. The part of the address before “/notebook/work” is your home directory.

python for cloning repo in DSX:

# if you don't care about persistence, just save to the notebooks/work with:
!git clone https://github.com/jimcrozier/testdsx.git

# if you do care about persistence
# move up gpfs out of notebooks
# get your DSX_HOME with !pwd
%cd /gpfs/global_fs01/sym_shared/YPProdSpark/user/DSX_HOME
!git clone https://github.com/GITHUBNAME/testdsx.git

# move to the project's home directory
# notice that you will need your DSX_HOME, just use !pwd
%cd /gpfs/global_fs01/sym_shared/YPProdSpark/user/DSX_HOME

# pull the repo, from the home directory of the project
!git pull

# you will rarely need to push a repo, but here is how
# from the home directory of the project
!git add .
!git commit -m 'something useful message'
!git push

You can also do this in R instead of python

# if you don't care about persistence 
system("git clone https://github.com/jimcrozier/testdsx.git", intern=TRUE)

# if you do care about persistence
# move up gpfs out of notebooks
setwd("./../..")
system("git clone https://github.com/jimcrozier/testdsx.git", intern=TRUE)

# move to the project's home directory
# notice that you will need your username, just use getwd
setwd("'/gpfs/global_fs01/sym_shared/YPProdSpark/user/DSX_HOME/testdsx")

# pull the repo, from the home directory of the project
system("git pull", intern=TRUE)

# you will rarely need to push a repo, but here is how
# from the home directory of the project
system("git add .", intern=TRUE)
system("git commit -m 'something useful message'", intern=TRUE)
system("git push", intern=TRUE)

Happy hacking, and feel free to reach out if you have any problems or suggestions for better managing this type of work flow.


Originally published at datascience.ibm.com on January 31, 2017 by Jim Crozier.