Achieving Success as a Data Scientist Part 2: Setting up your Data Science workspace

Byte Brilliance
8 min readJan 5, 2024

Note: This article is part of series designed to guide aspiring Data Scientists. The series is structured in such a way that Part 1 is for complete beginners and increases in complexity later on. Feel free to explore different parts of the series according to your experience level:

Welcome to the second part of the series. In this article we will be exploring how to set up your workspace to achieve success as you continue on your journey into the realm of Data Science. Specifically, the topics we will cover are:

  • Git
  • Installing Anaconda and using Jupyter Notebooks
  • Exploring some of the popular Python libraries for Data Analysis and Machine Learning

Git

Git is a version control system that helps track changes in your code, allowing multiple contributors to collaborate seamlessly. GitHub is a web-based platform that utilizes Git, providing a central hub for hosting and managing Git repositories, fostering collaboration and offering additional features like issue tracking and pull requests. If you are familiar with the Command Line (Terminal), you can interact with Git/GitHub using commands. GitHub Desktop is a user-friendly application that simplifies Git commands, providing a graphical interface for beginners to manage their Git repositories without the need for extensive command-line knowledge. Personally, I prefer GitHub Desktop as it makes it a lot easier (especially for beginners) to build a GitHub repository.

One of the most important skills to learn early on in your journey is working with Git. As someone who only started using Git later on his career, I can attest to the fact that building repositories early on and familiarising yourself with version control principles will help to position you as an above-average Data Scientist. Although it might not seem necessary now, using Git will prepare you for:

  • Building a portfolio for potential employers. Imagine building your experience by undertaking a few Data Science projects, but keeping all that code on your personal computer. You might know what you’re capable of, but future employers do not. By creating Git repositories on GitHub, you are building a public record of everything you can do.
  • Collaborating on larger Data Science projects once you land your first job. Right now, you may be the only one working on a particular project, but in a company it is commonplace for multiple professionals to be working on different parts of the same project. Without Git, you and your colleagues would have to send emails with the updated code. This opens the door for human error, which Git will help you avoid.

Following this tutorial on the Git website will help you install Git on your computer, whether you are running Linux, Windows or MacOS. Once you’ve got Git installed, you can create an account on the GitHub website to start building repositories for all your awesome Data Science projects. Finally, you can download and install GitHub Desktop on your Windows or MacOS machine. To install GitHub Desktop on Linux, you can follow this tutorial.

You can create a Folder on your computer and Initialise the repository by opening GitHub Desktop and clicked File -> New Repository … to open this screen:

The name should be short but descriptive. For example, “project1” does not tell you (or potential employers) anything about the project. Whereas “xente-fraud-predictor” gives some insight that the repository is about a Fraud Predictor for Xente.

The description is not required, but it is a good idea to add something short like “This is a repository to build a Fraud Predictor for Xente transactions”. This will help you remember what this repository is for in a few months when you’ve added (hopefully) many more projects to your GitHub.

The Local Path is the location of the folder for your project on your project. Click the Choose… button and navigate to the folder. Note here that I’ve got a folder called “Zindi” which is where I keep all the Zindi-related projects I’ve worked on, but the folder I want to create my repository in is the specific project folder. In simple terms, you want a separate repository for each project you work on.

The Git Ignore file specifies to Git what types of files you want to exclude in your repository. For example, suppose you are working on a Computer Vision project where you have got thousands of .jpg images taking up a few GB of storage space. You would not want to store all those images in your repository because uploading them from your computer to Gits servers will take a very long time. Furthermore, anyone who wants to clone your repository will have to download those images. Therefore, in your Git Ignore file you would add the following line

*.jpg

This line of code tells Git to ignore all .jpg files when copying your project to your repository. In the ReadMe (which we will discuss a little later) you can add a link to wherever you downloaded your dataset from so that anyone who wants to reproduce your results can access the dataset used.

Once you click Create Repository, your project folder will be intialised as a Git repository. You will now see this screen:

This will show you any changes made to your local copy of the repository that are not yet on GitHub. To commit the changes, choose a name for the commit. This should be short and informative, so that you know what the change was. For example “Updating data pre-processing code.” is sufficient for a name. The description is optional. After clicking Commit to main, you must click the Push Changes in the top tab so that the changes will reflect on GitHub. If you’ve successfully created and pushed your repository, your GitHub homepage should look something like this:

Where on the left hand side you will be able to see all your repositories.

If you have a repository that you want to work on, you’ll have to clone the repo by opening it in GitHub and copying the link and pasting it into to GitHub Desktop by clicking File -> Clone Repository… and following the on-screen prompts.

A ReadMe is meant to be an informative summary about your project. This is done in the markdown format. I will not go into markdown in detail, but this is a great tutorial to get you up to speed in no time and here is a template to follow once you know the basics. The ReadMe should contain a background of the project, highlight the methodology used and also provide any links to datasets, blog posts etc. that are relevant to the project. You can choose how much or how little detail to add, but remember that potential future employers are likely to read these so it’s a good idea to make it good. Here is an example of one of mine that needs a bit more work:

And that’s it. You are now familiar enough with Git to start building an online record of all the amazing Data Science projects you will be working on over the coming months and years! For more information, feel free to follow this tutorial.

Anaconda

Anaconda is an open-source platform that facilitates the management and deployment of data science environments. It includes a distribution of Python and R programming languages, along with a package manager and a collection of pre-installed libraries for data science and machine learning. Anaconda simplifies the process of creating isolated environments, managing dependencies, and deploying projects, making it widely used in the data science and scientific computing communities.

Overall, Anaconda provides a beginner-friendly, all-in-one solution for setting up a robust data science environment, making it an excellent choice for those who are just starting in the field.

Jupyter Notebooks are a user-friendly and interactive computing environment perfect for beginners. Imagine a document that combines live code, equations, visualisations, and narrative text, all in one place. Jupyter Notebooks make learning to code more accessible by allowing you to run code in small, manageable chunks, seeing results immediately. They’re widely used in data science, research, and education, providing a powerful yet intuitive platform where beginners can experiment with code, visualize data, and document their analyses seamlessly. In the next section, I will show you what a Jupyter Notebook looks like.

Python Libraries

Python is a programming language, and to be able to do cool things like generate data visualisations, train machine learning models, or manipulating data we need libraries. A library is a collection of pre-written code that provides useful functions and tools, allowing beginners to leverage existing code to perform tasks without having to write everything from scratch. Some of the most common libraries used in Data Science are:

  • Pandas (data manipulation)
  • NumPy (data manipulation)
  • Matplotlib (data visualisation)
  • Scikit Learn (machine learning)

This is what a Jupyter notebook looks like in your browser:

Here is a visualisation showing the number of fraudulent transactions that occurred on an e-commerce store per different Providers and Products:

For an in-depth tutorial on a Data Science project please see Part 3 of this series and I’ve also made the code available on GitHub.

Thanks for reading and keep up with future parts of this series as we delve deeper into the fascinating realm of Data Science!

As always, remember to follow for more interesting Data Science related content!

--

--

Byte Brilliance

Data Science information, tutorials, and advice from an industry expert with multiple years of experience.