Checklist to Set Up a Data Science Project Repository

Published in

Semantix

4 min readDec 28, 2021

How to set up a data science project?. Where do I start?. How do I structure my project?. Just a few questions we frequently have when starting a project.

In this post, you will find:

An introduction to data science projects,
What a Repository and Git are,
Good practices, and
A checklist to set up a data science project repository.

Frequently, we work on projects of any size that involve the collaboration of several developers:

Data Science, Machine Learning, Web Apps (back end, front end)
Data scientist, Data engineer, Software engineer, ML engineer, DevOps

And simultaneously are unrolled:

Implementation of new features,
Fixing bugs,
Performing updates,
In addition to having the code in production in complete working order.

So, it is vitally important to ensure the project’s success:

Having the control to plan,
Coordinating tasks with the team and
Being able to deliver.

Repository

Repository or Version Control is a data structure that stores metadata for a set of files or directory structures.

There are two types of repositories:

Distributed such as Git or Mercurial, or
Centralized such as Subversion, CVS, or Perforce.

GIT

Git is a distributed version control system created by Linus Torvalds in 2005. Git is practical, simple, fast, efficient, and open source.

Git is command-line software that tracks changes made to a project over time.

Save changes made to a project,
Stores these changes and
Allows the developer to reference them as needed.

GitHub - mafda/git_101: Brief introduction to Git. List of some of the basic version control…

Brief introduction to Git. List of some of the basic version control commands. - GitHub - mafda/git_101: Brief…

github.com

Platforms

GitLab and GitHub are web hosting providers for software development and version control using Git.

Help manage code and share changes to local files with a remote repository,
GitHub was launched in the year 2008, and
GitLab was launched in the year 2014.

Some topics to set up a data science project repository

1. Main README

Short and memorable repository names
Description and essential information
Technologies, libraries, and frameworks used
How to install and use content
Current status of the project or results
.gitignore , CONTRIBUTING.md , LICENSE

2. Project Management

Teamwork methodologies like Agile and workflows like Gitflow for repositories are practical tools in scenarios where we need to interact with more contributors.

Issues tracking system (JIRA, Trello, Planner)
Task lifecycle (backlog, to do, doing, review, discarded, done)
Definition of done
Code review process
Commit message style and tracking tags
Branching management: Single branch or Multi-branch

How to Have More Productive Projects Using Agile and Git

We present a series of 6 steps to integrate some useful tools like Kanban and Git and have more productive projects.

medium.com

3. Project Folder Structure

Organization and directory structure (depending on the project type: data science, back-end, front-end)

Tool suggestion: cookiecutter
Another tool or standard: CRISP-DM

Organizando um projeto em Ciência de Dados

Como o método CRISP-DM pode te ajudar a estruturar o desenvolvimento do seu projeto de Ciência de Dados.

medium.co

4. Environment

How to manage virtual environments, packages, and dependencies in Python?. We need to understand the importance of maintaining and managing the environment in each project when we start a data science project.

Description and how to use
Reproducibility
Virtual environments (Local development)
Dockerization (Deployment)

Getting Started with Conda or Poetry for Data Science Projects

How to manage virtual environments, packages, and dependencies in Python and start your data science project with Conda…

medium.com

5. Code Standarization

Style definition
Framework or tools to be used and its configuration
Adaptation to specific project context
Continuous integration pipelines

How to Make your Code Shine with GitLab CI Pipelines

Introduction to some tools for a cleaner Python code by applying isort, Black, Flake8, and Pylint automatically using…

medium.com

6. Documentation

Documentation and communication language definition

Variables names as 0-level of documentation
In-code documentation (numpycode)
multi/modular READMEs
Wiki
Other (office doc, slides, etc.)

Documentation

In general source code is read more often then written. This is why source code should be coded in a clear and readable…

ibm.github.io

Some interest links — Good Practices

GitHub - kvarak/git-best-practices: Do's and Dont's when using Git

https://xkcd.com/1597/ Start here Working locally Working remote Workflows Links NOTE: In the text below, I use the…

github.com

GitHub - ck3g/git-best-practices: A set of my best practices working with git

Many of us are using git in our day-to-day work. I've noticed that some of you (or may be many of you) are using only a…

github.com

GitHub - frankcarey/git-best-practices: A repo to store best practices when using git (and github)

I've collected some best practices while using git for the years and I'd like to share them in a way that other teams…

github.com

How to set up a Data Science Project

More often than not, those of us who are new to the Data Science field wish to implement newly acquired Data Science…

medium.com

made with 💙 by mafda.

Checklist to Set Up a Data Science Project Repository

Repository

GIT

GitHub - mafda/git_101: Brief introduction to Git. List of some of the basic version control…

Brief introduction to Git. List of some of the basic version control commands. - GitHub - mafda/git_101: Brief…

Platforms

Some topics to set up a data science project repository

1. Main README

2. Project Management

How to Have More Productive Projects Using Agile and Git

We present a series of 6 steps to integrate some useful tools like Kanban and Git and have more productive projects.

3. Project Folder Structure

Organizando um projeto em Ciência de Dados

Como o método CRISP-DM pode te ajudar a estruturar o desenvolvimento do seu projeto de Ciência de Dados.

4. Environment

Getting Started with Conda or Poetry for Data Science Projects

How to manage virtual environments, packages, and dependencies in Python and start your data science project with Conda…

5. Code Standarization

How to Make your Code Shine with GitLab CI Pipelines

Introduction to some tools for a cleaner Python code by applying isort, Black, Flake8, and Pylint automatically using…

6. Documentation

Documentation

In general source code is read more often then written. This is why source code should be coded in a clear and readable…

Some interest links — Good Practices

GitHub - kvarak/git-best-practices: Do's and Dont's when using Git

https://xkcd.com/1597/ Start here Working locally Working remote Workflows Links NOTE: In the text below, I use the…

GitHub - ck3g/git-best-practices: A set of my best practices working with git

Many of us are using git in our day-to-day work. I've noticed that some of you (or may be many of you) are using only a…

GitHub - frankcarey/git-best-practices: A repo to store best practices when using git (and github)

I've collected some best practices while using git for the years and I'd like to share them in a way that other teams…

How to set up a Data Science Project

More often than not, those of us who are new to the Data Science field wish to implement newly acquired Data Science…

Written by fernanda rodríguez