Git. A Data Engineer’s must have tool. Part 1 of 2.

Israel Phiri
Analytics Vidhya
Published in
11 min readJan 4, 2021

Peer reviewed by @ Eduardo Lomonaco

A simple explanation of Git and its use.

If two or more participants were to collaborate on updating a computer stored file within a project, how would they do that? One would say it can be done through emails, phone calls, face-to-face meetings or virtual chats, etc. While these methods may work for small projects and few resources, they tend to be very inefficient as the number of resources grows or as the project scope increases. Keeping track of changes and the ability to rollback these changes when necessary, becomes a daunting task. An open-source (free) tool called Git, helps mitigate most of the challenges.

So one may ask, what is Git?

Well, Git is a Version Control System for tracking changes in computer files. It is an open-source tool freely available for anyone to use as long as they adhere to its terms and conditions.[1]

Why would we use Git and how?

Git is mainly used for coordinating work between multiple contributors, it tracks who made changes to a file , what the changes were and when they were made. Git can rollback changes on a file/s to any “commit” point in the past. The file/s are kept in a directory called a repository. The repository can be stored on a remote server (the cloud) or on a local machine/server. You do not need internet connection to work on a repository that’s stored locally but you will need internet to access a remote repository.

Git vs GitHub.

Git is the actual software that lets you do version control on a file. GitHub on the other hand, is the website or User Interface (UI) that allows one to access Git on the cloud and also store their project or repository on the cloud. It is worth noting that Git can be used without GitHub. You might come across these two terms (Git and GitHub) being used interchangeably. It is important to understand the distinction between the two.

So, for aspiring Data Engineers, Data Scientists and Developers out there, knowing the basics of this tool is a good indicator to a potential employer that you could easily collaborate within a project! Git does get complicated as one dives deeper into it, but for beginners and intermediates, the scope covered here is a good start.

A typical Git use case.

Let’s have a look at a scenario where Git can be used. Say there is a team working on a project, building an end-to-end ETL or any computer file stored project. Each team member is tasked with coding a different part of the project deliverable. The project lead would create a master repository on Git then give access to each team member. This will allow members to download or ‘clone’ the master repository to their local machines, make changes in the cloned repository, put the changes in the staging area, commit these changes and then upload or ‘push’ them back to the master repository. Now, these changes can be reviewed by peers before they are ‘merged’ with the master repository. I will admit that this part may be a bit confusing, hang in there, more details will be discussed in part two of the series. For now, we need to understand that changes made to the repository will be tracked by Git. Should there be a need to rollback whatever changes or commits that were made in the past, Git will allow us to do that and hence the term ‘version control’.

Enough of theory, lets walk through an illustration:

Git Install on Ubuntu / Debian.

If you are using Ubuntu / Debian, you should have Git installed already. You can confirm this by bringing up your terminal and running:

$ git --version

You should get something like this:

git version 2.25.1

This tells you what Git version your machine has, in my case at the time of writing, I was on version 2.25.1

If you don’t have Git, to install it on Ubuntu / Debian, run:

$ sudo apt-get update

Then run:

$ sudo apt-get install git

You might need to wait a few minutes for the installation to finish.

When it’s done run:

$ git --version

…and you should get the version.

Git Install on Mac OS.

For Mac OS users, you most probably have Git installed already because Apple actually maintain and ship their own fork of Git, but it tends to lag behind mainstream Git by several major versions.[2]

Should you need to install Git on a Mac, head over to Atlassian tutorials and follow instructions for Mac installs.

Git Install on Windows.

For Windows users, the most official build is available for download at the Git website. Choose Git for Windows setup, then download. Go to the downloads folder on your machine, double-click the Git .exe file to start the installation. Accept the default settings by clicking “next” on all the windows, click “install” on the last window. Unless you have specifically chosen a none default option during your installation, you will use a different terminal than the regular windows “cmd” terminal to run Git commands. There should be a Git-Bash icon on your desktop after the installation. Use it to launch the Git-Bash terminal for your Git commands. If you can’t find it on your desktop, you might find it under your apps or installed programs.

Try to avoid using the GUI when working with Git as this does not give you an inner understanding of what you are doing. Stick to the terminal or what’s called Command Line Interface (CLI), when interacting with Git.

Creating a project repository and a README file.

To get started, let’s create an empty folder/directory anywhere convenient on our local machine and call it myFirstGit. I will create mine on the desktop using the terminal, type:

$ cd ~/Desktop

This command takes me from wherever I am into my Desktop directory, then I create “myFirstGit” directory by running:

$ mkdir myFirstGit

Now this directory myFirstGit is where I will put files related to our project. Next, lets go into our created directory by running:

$ cd myFirstGit

First we are going to check if Git knows anything about this directory, so let’s check the “git status” of this directory. We run:

$ git status

We should get an error like this:

fatal: not a git repository (or any of the parent directories): .git

This means our directory has not been initialized as a Git repository. Let’s do that:

$ git init

We get a response similar to this :

Initialized empty Git repository in home/isr/Desktop/myFirstGit/.git/

This tells us that Git has started tracking changes on this directory, though our directory is empty for now. At this point this directory is officially a repository, albeit a local one. Let’s create a file called README inside our repository. I will use my preferred text editor called Nano, you can use whatever editor you like as long as you will use the CLI. So in my case I will run:

$ nano README

Nano will create a file named README and open it at the same time. Inside the file just type:

Image by owner

For those using nano for the first time, after entering your message, press Control plus O, then press Enter to save it, then Control plus X to go back to your terminal.

A few notes about README file.

It’s a best practice to always include this file called README when creating a new repository or when making a commit to an existing repository. This saves as a brief description of either what the repository is all about, including any relevant instructions associated with the project or it could be used to describe changes that have been made and why. So, take note here, never miss the README file, make it as brief but as clear as possible.

Lets check the git status of our repository:

$ git status

And we should get:

Image by owner.

Git Add” a file to the staging area.

As we can see from the previous screenshot, Git is now aware of this new change in the repository and it is telling us to to “git add” for the file to be moved to the staging area and have it ready to “commit”. Lets do that:

$ git add README

Lets check git status again:

$ git status

We get:

Image by owner.

Git is now tracking our README file, it has placed the file at the staging area, ready to “commit” to our repository.

Commit a file

We might as well commit the README file and see what things will look like, but first we need to tell Git our “user name” and our “user email” if we haven’t done so already.

Lets do that using the “git config” command:

$ git config --global user.name "your-user-name-here"$ git config --global user.email "your-email-here"$ git config --global user.password "your-password-here"

You can check your Git configuration list at anytime by running:

$ git config --list

Lets commit the README file:

$ git commit -m “my first commit to this repository”

We get a confirmation:

Image by owner.

This states that we have created the master branch of our repository with the short note “first commit to this repository”, it also shows the changes that have been made by this commit, i.e. there was only 1 file created, README.

For the sake of housekeeping here, let’s step back and see if we can answer one or two “what if questions”.

Dealing with multiple files and unstaging files

Suppose we had more than one file to commit, how would this work? Lets go back to our terminal and create two files (we could choose any file type we need, be it excel, doc, csv, .py, jpg, etc) .

To keep it simple, we will create two text files again using nano. First we create “secondFile”.

$ nano secondFile

We enter a short text in secondFile:

Image by owner.

Save the file like we did before, then create thirdFile and write anything inside, I wrote “a third file for our repository”. Save it and then run:

$ ls

This gives us a list of files we now have in our present directory:

Image by owner.

Lets check “git status”:

$ git status

We get:

Image by owner.

Git is now aware of two new untracked files as shown above, lets “git add” them to the staging area:

$ git add .

Notice that this time I did not type any file name, I just used a “ .” (single space and period or full-stop) at the end instead. Using the single space and period tells Git to add all untracked files in the directory. I could have added them one by one, using their names like we did before.

Run “git status” command again and we get:

image by owner.

Our two files are now staged and ready to “commit”, but what if we want to unstage any or all of them?

As seen in the above image, Git already gives us a hint inside the brackets “( )” on how to do it.

Let’s unstage secondFile.

$ git restore --staged secondFile

Run “git status” and we get:

Image by owner.

As we can see secondFile is now untracked while thirdFile still remains staged. At this point we can even delete secondFile if we want or make any changes without affecting the master branch. If we wanted, we could go ahead and “commit” thirdFile by itself, no problem.

Moving on, I won’t be showing commands we have covered above. It’s a good time to see if you remember them.

Lets “git add” secondFile again and then check “git status” before we “git commit” both files to our repository, including a message “added two new files” in our commit. We get:

Image by owner.

The two new files have been added to our local repository master branch and if we run “git status” we get a “nothing to commit” message.

That’s it for folks! Part 1 of a 2 part series ends here.

Quick wrap up.

So let’s quickly go through the Git commands we have learnt so far.

$ git initInitializes the present directory with Git so that we can start tracking changes in it.$ git statusTo check the git status of our directory/repository. It tells us if the directory has been initialized as a Git repository, it tells us which files are being trucked in the repository, it tells us if there are files ready to “commit”.$ git add <file name> or git add .Will add the named file or all files to the staging area of Git ( in case of single space and period, “ .” ). They will now be tracked and they are ready to “commit” to the repository.$ git restore --staged <file name>To unstage a file, removing it from the tracked list.$ git config --global user.name “your username here”To register a user name with Git. We need this to initialize a repository.$ git config --global user.email “your preferred email here”To register an email address with Git. Again we need this to initialize a repository.$ git config listTo check your configuration list at anytime.$ git commit -m “short description message here” <file name>To commit changes on one file into our repository.$ git commit -m “short description message here” .To commit changes on multiple files into our repository. Note the single space and “.” at the end of this command.

Right now, if our computer was on a local network, the repository could be shared with anyone in the local network with rights to access its location. They could “clone” the repository to their local machine, make changes and then commit the changes back to the master branch. Simulating the local network scenario will not be possible for us but however, in Part 2 of the series, I will go through how we can signup with GitHub and then push our repository to the cloud so that anyone with internet and rights, can access our repository from anywhere.

Thank you for the read, stay tuned for part 2…

Disclaimer.

This tutorial is shared with the sole intention of helping others in getting to understand how to use Git and GitHub. Be advised that it does not over-ride information found in the official documentation of Git, GitHub, Mac OS, Linux Ubuntu/Debian, Microsoft and their affiliates. Use it at your own discretion.

References:

[1] Traversy Media, February 5, 2017. https://www.youtube.com/watch?v=albr1o7Z1nw

[2] Atlassian Bit-Buckets tutorials https://www.atlassian.com/git/tutorials/install-git

--

--

Israel Phiri
Analytics Vidhya

Data Engineer, with expertise in Apache PySpark , PySparkSQL, Spark Structured Streaming, Azure Data Bricks, Apache Kafka, Airflow and Snowflake.