The Herpetologist Social Scientist guide to Git for Data Science — Part 1

Giulio Gabrieli
The Herpetologist Social Scientist
5 min readDec 30, 2021

I always get weird looks whenever I tell people that I am using Python as a core part of my Ph.D. in Psychology. The stereotypes revolves about a old man with a beard and thick glasses that question someone laying down on a bed, not a nerd guy with a laptop and a dark mode IDE putting lines of code after the others. In my previous posts I introduced some of the reasons for which I decided to “adopt a python”, and I listed some of the tools and packages I use daily for my research.

A Python. Photo by rushil shrivastava on Unsplash

However, a core part of research work and coding involves organizing datasets, codes, scripts, documents, images, and notes. I have seen countless projects fail because of a poor organization, lack of structure, impossibility of sharing data, and absence of backups. While many solutions can be employed to address the problem, my go is to use Git, Github, and GitKraken. But what are they? And how can they help you?

Git

Git is a version control software, created by Linus Torvalds in 2005. While in the beginning it was aimed at supporting developers of the Linux Kernel, it evolved and it’s with no doubt the most employed version control software. To put it in simple terms, Git is a tool that allows to manage the history of a project (called repository), providing versions tracking capabilities and support for non-linear development. Let’s make an example. You have the folder of a project —your repository— in which you have code, documents, and data. Every time you make an edit, git helps you keep track of the changes, so that if you make an error you can easily revert to a previous version of the project, or revert back to a previous stage of your code. Moreover, it allows you to create a copy of the project, work on it, and subsequently merge the copy back with the original code, supporting you in finding conflicts in your code or data. As such, Git can support your research work by giving you a framework that can be used to track changes and revert back to previous stages of your project, in a smart way. However, Git has some limitations:

  • it’s a command line tool: as such, it may be difficult to use for non-expert users.
  • It works on your computer, but you still need to make backups on external supports.
  • it has a step learn curve: it uses a unique set of terms that may be difficult to grasps, especially if you don’t use them daily.

For these reasons, I use Git in combination with other two tools: Github, an online cloud hosting with Git support, and GitKraken, a software that allows to manage Git repositories using a graphical interface, and that supports Github. Let’s dig into this a little more.

A Git repository visualized. Photo by Yancy Min on Unsplash

Github

Github is a cloud hosting aimed at developers. It allows to create git repositories online, host web pages, host software releases, and track issues. There are three main reasons I use Github: cloud’s backups, code and data sharing, and automation. While the latter is a more complex aspect that will be covered in a later part of this tutorial’s series, the possibility of storing my projects on the cloud and to share the projects are crucial. For example, this is the repository of one of the Python packages I have created: https://github.com/Gabrock94/Pysiology. This allows me to store online the code to run the analysis, sample data (but you can store whole datasets too), examples, an introduction to the package, keywords for discoverability and many more. To manage repositories on Github you will need to create an account, and to “push” —upload— or “pull” —download— a repository or part of it you will need Git on your machine.

Github solves the problem of creating copies of your projects that you can use as backup, and of sharing your materials. In the second part of this tutorial you will learn how to exploit GitHub for your research. However, GitHub doesn’t allow you to manage your projects versions easily. And here’s where GitKraken comes into play.

GitHub Mascotte. Photo by Roman Synkevych on Unsplash

GitKraken

GitKraken is one of the tool that is always present in my Linux distro. It’s a powerful tool that allows to manage git repositories visually, it works with Github (and also other online Git hostings), but also create boards and timelines for your projects. The Graphical Interface of GitKraken is clear and very intuitive, and it helps you having on a single screen all the details you need: current version offline, current version online, last changed files, tracked changes, and a timeline of the last versions. A full list of all the available features is available on GitKraken’s website.

A screenshot of the projects I am currently working on, opened in GitKraken. On the left column i can see where the project is hosted and I there is a summary of open issues, the central columns shows the latest edits and versions in chronological order, while the right columns shows the files changed and not yet synchronized with either the online or offline repository. The upper bar is used to pull and push code to the repository online, open boards, and visualize the timeline.

What you need to know

To sum up, here I introduced Git, Github, and GitKraken, three tools that help organizing and maintaining your research work. In the next part of this tutorial you will learn how to setup a GitGub and GitKraken account, store your projects effectively on Github, and how to use GitKraken to manage your projects visually.

--

--