Can I use GitHub?

April C
14 min readJan 2, 2020

--

Introduction

GitHub is well known as an open-source code collaboration environment for software development. Clearly, the creators of GitHub want to market this platform to non-coders. Banners on GitHub’s site proclaim:

Learn Git and GitHub without any code! marketing banner on GitHub’s website

But can GitHub be of much use to non-coders? Many data science bloggers reference their personal GitHub projects and advocate for its utility. However, learning GitHub can be intimidating for a those of us most comfortable in a GUI environment. In fact, I’ve started to learn GitHub three times before writing this. Getting started on the platform looks easy enough but becomes intimidating quickly. Most of the tutorials about GitHub start out too simple to be useful and quickly become too complex follow.

In this paper, I’ll break down the key benefits and drawbacks of GitHub from the perspective of an aspiring data scientist. Then, I’ll step through a plain-English instructional demonstration of the initial steps you should to take in order to get rolling on this platform, using a couple of relevant examples.

Summary of Findings

In a word, no — you do not need to be a coder to find GitHub useful. However, there certainly are a lot more opportunities on the platform (today) for people who are interested in coding. And, many of us will have to learn not to jump backward at the mere sight of raw code because it is totally unavoidable.

Code is unavoidable because one of GitHub’s jobs is comparing differences across unformatted raw text files (code, or prose perhaps). It can also display web-ready formats like html, javascript, and markdown. It will not compare differences across these formatted files and will not display application file types (like Excel or Tableau) at all. However, the file repository can accept any file type (as far as I could tell), which means that its other big jobs — sharing and tracking versions of any file — can be accomplished just as well for my projects!

Of course, there are other ways to do this. For example, Microsoft 365 now keeps better track of file versions than it ever has before, so you can mostly trust that you’ll be able to roll back a file when needed. You can also work asynchronously with collaborators on their cloud. Many similar capabilities exist across other platforms.

GitHub, though, is much better at visually displaying who, when, how, and why changes have been made with each version. You can roll back to a prior version without losing the changes you made later. It is also better at sharing with people, even those you don’t know (yet), which is very useful to data scientists who need to hear about potential errors or biases that may be lurking in their models. It also encourages you to ask for help and easily find ways to help others.

In short, GitHub is a very useful platform for data scientists, even those of us who are just getting our coding fingers. It is intuitive (once you learn the basics), and a robust mechanism for project control and collaboration.

Demonstration

If you want to watch a video tutorial, the most approachable GitHub tutorial I found is “Git and GitHub for Poets” published by The Coding Train on YouTube. It fulfils its promise to be a useful amount of instruction that requires absolutely no programming background.

https://www.youtube.com/playlist?list=PLRqwX-V7Uu6ZF9C0YMKuns9sLDzK6zoiV

Below are the steps that will get you comfortable using GitHub for your projects, and for expanding your data science experience.

1) Get Started — initiate your account, create a repository and upload a file

This is very easy and described well in all of the tutorials.

Go to https://github.com/, enter a username, email address, and strong password.

I selected the free version but noted two important limitations — only 3 collaborators are allowed on private repositories, and only 500MB of storage is allowed. I’m not sure what 2,000 “action minutes” equates to, but it’s worth noting that this is apparently only 200 actual minutes on Mac OS. More info here — https://help.github.com/en/github/setting-up-and-managing-billing-and-payments-on-github/about-billing-for-github-actions.

Then, create your first repository. This is basically a project folder for storing all of your files on a related topic.

The repository name will be combined with your username as a searchable reference.

The description will be published in the README, which can be expanded later.

You can decide whether to open it as private or public (this can be updated later); either can be shared with specific collaborators.

Be sure to initialize with a README. This is published on first page of the repository, so it is the first thing people see when they visit. It should tell people what the project is about, and it’s helpful to update it (later) in order to describe the contents of the repository or how to use the various files.

If your work requires a license, this can be included.

GitHub Create a new repository

Now upload a file! I uploaded a CSV file, a Jupyter Notebook (.ipynb), an ebook[i], and a Tableau file. These all worked pretty well. CSV files display in row and column format. Jupyter Notebook displays as the formatted web page, but it is not interactive in GitHub. The ebook and Tableau are coded in XML, so the actual file won’t display, but the code and any changes will. Microsoft files and PDFs don’t work as well because Git won’t even display the raw code, but you can still track the versions in the repository and share the files.

Just click Upload files, navigate to your file, and click Commit. Commit saves that version of the file to your repository.

The default is to save to the master branch, which is also called the base branch. A branch is a set of versions, with the most recent version displayed.

New Repository created
File uploaded

Now, you can open the file.

Open the file

2) Editing, Branching and Merging

It’s time to edit this file and create a new version. For this type of file, it’s easy enough to just use the little edit pencil.

Now it opens in that unavoidable raw code format. This file is pretty easy to read since its just a CSV. I can see the update I want to make to the Order ID.

display of the open file

I will create a new branch for demonstration instead of merging directly to the master.

Branching has a couple of advantages. First, if a team is working asynchronously, additional changes can be made to the original file while this particular version of the file is being simultaneously updated. Second, if a totally new idea pops up that wouldn’t be appropriate for the master, this branch can become its own separate sub-project with its own branches and commits.

Commit (save into GitHub) changes

Now I can see the difference between my new branch and the master. Red indicates delete, green indicates add, blue indicates hidden rows with no changes.

For CSV files, there is a darker shade on the row/column intersection where the change occurred.

Comparison of original file to updated file

GitHub shows no merge conflicts and offers a pull request so I could bring this update back into the master now, but I want to explore the branches first.

Open a pull request

There is a visualization of all the branches and versions under the Insights tab; navigate to Network on the left menu.

The black line shows the master branch. Each dot represents a file update (commit).

The colorful lines are branches. Where the blue lines point back to the black line, this shows an update that was merged back in with the master. The purple line shows the commit that was just made — hovering over the dot gives some details, including the identification number referred to as the hash.

From the Code tab, you can get to any branch from the ‘Branch:’ dropdown. Then, you can select the file and review the history of that branch.

This version can always be reviewed and can be identified by its hash. In the screenshot, the hash code is 9b870c1. The commits to the left are the master branch commits that preceded it. The two to the right on the green line are commits that haven’t been made to the master (they are on their own branch). Note, if we were looking at the master branch, we would not see the last update because it has not been merged in yet.

Select a branch and review the history
History of

Now we can create a pull request from any of the prompts. The pull request will be indicated on the Pull requests tab.

I can choose to merge it, or close with a comment.

Merge

3) Get GitHub Desktop and Clone your repository

If you want to use GitHub for files that require applications (non-code files), be sure to get GitHub Desktop right away. This will allow you to download (clone) the files in a repository onto your computer’s hard drive so that you can view and update them using your own software.

You can either select the ‘Clone and Download’ button on the main repository page, or the computer icon on the file page. Either will clone the full repository and prompt you to download the desktop version of GitHub at https://desktop.github.com/.

Clone repository in GitHub Desktop

Now, you will have a folder under Documents called GitHub that has all of the files from the repository inside.

GitHub repository on my Mac in Finder

All of these files can be accessed and edited by the applications that run them.

4) Fork

So far, we’ve reviewed controlling versions in your own repository. A fork is a full copy of a repository from someone else’s site. This allows people to build on each other’s work. If the work is improved, a pull request can be made. Collaboration!

In this example, my friend aprilcrompy wants to review my Tableau file. In order to do this, she needs to fork it.

This screenshot shows that she found my account in GitHub and opened this project, then selected Fork.

Now that she has her own repository with these files, she can create a clone on her desktop.

Screenshot of another person’s fork of my repository

From her local GitHub folder, she can open this file in her desktop with the Tableau application.

Reviewing the Tableau file on her desktop
Updating the Tableau file — changed line to bar chart and changed title of ‘Profit’ measure to ‘Money’

She makes some changes, and saves the file as usual. This updates the local GitHub repository on her computer. She can review the changes to the code in the GitHub Desktop App.

View of Tableau file in the GitHub desktop App under History

She’s not an XML developer, but she can see that a caption was added to change the display of the Profit data element to ‘Money,’ and she can see that the chart class was updated from ‘Line’ to ‘Bar,’ verifying the changes she made!

Now, she can push this commit to her online GitHub repository (where it was forked from mine).

Push from desktop Git repository to online Git repository

Once the file is committed back to her online Git repository, if she wants me to incorporate these changes into my original file, she can create a pull request.

I’ll review the pull request and accept the merge (just like before).

Now, selecting this file from my repository shows 2 contributors, and the history identifies that my friend, aprilcrompy, created the last update.

5) Collaborators

When you’re working with a team, you may trust your collaborators and may not want to be bothered by all of these pull requests.

This can be accommodated by adding collaborators to your repository. Under the Settings tab, select Collaborators from the left nav. Then, type in their GitHub userid and select the right one when it pops up. Add collaborator.

This will send an email invitation to your friends, which they can accept. Now, they can hack away at all of the files, commit and merge as they see fit.

6) Issues, Exploring and Helping

Collaboration is a terrific way to learn. On GitHub, anybody can add an issue to a repository (if they have access to it), including the originator. These issues can be labeled to help people find them so they can offer to help out.

Aprilcrompy starts by entering my repository. She selects the issues tab and creates a new issue. She enters a title and a longer comment.

In my repository, I see the that the issue notification has incremented, and click on the tab.

I can click on the title of the issue to open it.

This issue is now available for others to find, fork, fix, and submit a pull request.

You can find files to work on by clicking Explore link at the top, or using the search field. There is an advanced search that allows you to filter on labels. Then, additional select-filters by type and language help narrow the issues to those you most want to review.

For example, since I want to learn more about writing Python 3 in Jupyter Notebooks, this issue labeled “good first issue” looks like it may be an opportunity to update code from Python 2 to Python 3. Since it is also labeled “help wanted,” the author will probably be kind if I have questions (or make mistakes) since I will be taking a load off of them by trying.

Conclusion

After exploring GitHub for many hours, I now agree with all of the data science bloggers who have recommended learning to use it early-on, and advocate that it’s a core data science platform. Although it’s most useful for code updates, it can be used as a collaborative repository and traceable version control system for analytics of many types. Also, I think this is a great place for aspiring data scientists to practice and discover new methods by exploring and contributing to others’ analytic tools.

GitHub functionality gets far more complex from here. There are many external plugins that build upon what you’ve seen, as well as native utilities like varying workflows and using GitHub from the command line. I hope this paper helps you get productive in the environment, and we can learn the rest together. Explore what you’ve seen here at https://github.com/alcrompton/DS730Final and check out my ongoing GitHub trials, tribulations, and contributions at https://github.com/alcrompton.

Glossary

@mention — request feedback on pull requests from specific users

Base Branch — The main branch, versus Feature branches that will be merged back into the Base eventually.

Branching — Creating a new line of development by making updates to a copy of the original work. Allows for version control when those changes are merged with the original work as an update.

Clone — download a repository from GitHub online as a local copy on your computer

Commit — a defined point at which an update is made. Typically, commits are made after changes to multiple files have first been added to a staging environment. Commits allow you to review changes over time or roll back to a particular point.

Diff — difference between the content on one branch or commit from another. Git highlights the diff in green if it’s an add, and in red if it’s a deletion. A change is seen on two lines, as both an add and a deletion.

Daff — data difference, displays the differences in rows and columns of a .csv file (not working)[ii].

Fork — create a full copy of a repository of another user under your authority, including all of its history, in order to work forward on the project.

Git — The version control application (originally invented by Linus Torvalds for controlling Linux open source updates[iii])

GitHub — The collaborative environment that allows people to copy, comment on, and merge updates to Git (not invented by Linus Torvaldsiii)

Hash — the unique identifier for each commit, which enables tracking.

Issue — a comment that can be posted on a repository to identify a problem in the file.

Markdown — the syntax used to format text in the README file. A Markdown editor is available so you don’t have to remember how to write the syntax.

Merge Conflicts — when multiple people are working simultaneously on a fork or branch that they want to push back to the master branch, conflicts often arise if the same line of code is updated. This is managed in the workflow as a merge conflict.

Pull — Owners will pull your changes into the Base Branch once they approve.

Pull request — After making updates on a fork, request the original owner to bring those changes in to their original branch.

Push — Commit changes to a remote repository

README — High level documentation about what the project is and does

Repository — A location to store all files associated with a project. These can be different file types. Referred to as “repo” or “repro.”

Staging environment — also called index, cache, or staged files, where your desktop updates are committed to when you save the file in the application. In order to update an online repository, desktop commits must also be pushed from the staged files to the online environment.

Wiki — the wiki is for hosting long form content, such as user documentation. Direct updates can be limited only to authors of the repository, or can be opened to the public.

[i] https://standardebooks.org/

[ii] https://paulfitz.github.io/2014/07/09/diff-merge-csv.html diffs daff?? CSV-aware merge driver

https://github.com/paulfitz/daff daff: data diff, http://paulfitz.github.io/daff/ live demo

[iii] https://www.wired.com/2012/05/torvalds-github/

https://github.com/paulfitz/daff daff: data diff, http://paulfitz.github.io/daff/ live demo

http://eforexcel.com/wp/downloads-18-sample-csv-files-data-sets-for-testing-sales/ sample sales data file

https://towardsdatascience.com/introduction-to-github-for-data-scientists-2cf8b9b25fba

https://towardsdatascience.com/why-git-and-how-to-use-git-as-a-data-scientist-4fa2d3bdc197, Admond Lee 2/23/19

https://towardsdatascience.com/introduction-to-github-for-data-scientists-2cf8b9b25fba

https://towardsdatascience.com/lessons-learned-using-google-cloud-bigquery-ml-dfd4763463c

https://product-hubspot-com.cdn.ampproject.org/c/s/product.hubspot.com/blog/git-and-github-tutorial-for-beginners?hs_amp=true

https://guides.github.com/activities/hello-world/

--

--