Checklist to Set Up a Data Science Project Repository

fernanda rodríguez
Semantix
Published in
4 min readDec 28, 2021

How to set up a data science project?. Where do I start?. How do I structure my project?. Just a few questions we frequently have when starting a project.

Photo by J. Kelly Brito on Unsplash

In this post, you will find:

  • An introduction to data science projects,
  • What a Repository and Git are,
  • Good practices, and
  • A checklist to set up a data science project repository.

Frequently, we work on projects of any size that involve the collaboration of several developers:

  • Data Science, Machine Learning, Web Apps (back end, front end)
  • Data scientist, Data engineer, Software engineer, ML engineer, DevOps

And simultaneously are unrolled:

  • Implementation of new features,
  • Fixing bugs,
  • Performing updates,
  • In addition to having the code in production in complete working order.

So, it is vitally important to ensure the project’s success:

  • Having the control to plan,
  • Coordinating tasks with the team and
  • Being able to deliver.

Repository

Repository or Version Control is a data structure that stores metadata for a set of files or directory structures.

There are two types of repositories:

GIT

Git is a distributed version control system created by Linus Torvalds in 2005. Git is practical, simple, fast, efficient, and open source.

Git is command-line software that tracks changes made to a project over time.

  • Save changes made to a project,
  • Stores these changes and
  • Allows the developer to reference them as needed.

Platforms

GitLab and GitHub are web hosting providers for software development and version control using Git.

  • Help manage code and share changes to local files with a remote repository,
  • GitHub was launched in the year 2008, and
  • GitLab was launched in the year 2014.

Some topics to set up a data science project repository

1. Main README

  • Short and memorable repository names
  • Description and essential information
  • Technologies, libraries, and frameworks used
  • How to install and use content
  • Current status of the project or results
  • .gitignore , CONTRIBUTING.md , LICENSE

2. Project Management

Teamwork methodologies like Agile and workflows like Gitflow for repositories are practical tools in scenarios where we need to interact with more contributors.

  • Issues tracking system (JIRA, Trello, Planner)
  • Task lifecycle (backlog, to do, doing, review, discarded, done)
  • Definition of done
  • Code review process
  • Commit message style and tracking tags
  • Branching management: Single branch or Multi-branch

3. Project Folder Structure

Organization and directory structure (depending on the project type: data science, back-end, front-end)

4. Environment

How to manage virtual environments, packages, and dependencies in Python?. We need to understand the importance of maintaining and managing the environment in each project when we start a data science project.

  • Description and how to use
  • Reproducibility
  • Virtual environments (Local development)
  • Dockerization (Deployment)

5. Code Standarization

  • Style definition
  • Framework or tools to be used and its configuration
  • Adaptation to specific project context
  • Continuous integration pipelines

6. Documentation

Documentation and communication language definition

  • Variables names as 0-level of documentation
  • In-code documentation (numpycode)
  • multi/modular READMEs
  • Wiki
  • Other (office doc, slides, etc.)

Some interest links — Good Practices

made with 💙 by mafda.

--

--

fernanda rodríguez
Semantix

hi, i’m maría fernanda rodríguez r. multimedia engineer. data scientist. front-end dev. phd candidate: augmented reality + machine learning.