Checklist to Set Up a Data Science Project Repository
How to set up a data science project?. Where do I start?. How do I structure my project?. Just a few questions we frequently have when starting a project.
In this post, you will find:
- An introduction to data science projects,
- What a Repository and Git are,
- Good practices, and
- A checklist to set up a data science project repository.
Frequently, we work on projects of any size that involve the collaboration of several developers:
- Data Science, Machine Learning, Web Apps (back end, front end)
- Data scientist, Data engineer, Software engineer, ML engineer, DevOps
And simultaneously are unrolled:
- Implementation of new features,
- Fixing bugs,
- Performing updates,
- In addition to having the code in production in complete working order.
So, it is vitally important to ensure the project’s success:
- Having the control to plan,
- Coordinating tasks with the team and
- Being able to deliver.
Repository
Repository or Version Control is a data structure that stores metadata for a set of files or directory structures.
There are two types of repositories:
- Distributed such as Git or Mercurial, or
- Centralized such as Subversion, CVS, or Perforce.
GIT
Git is a distributed version control system created by Linus Torvalds in 2005. Git is practical, simple, fast, efficient, and open source.
Git is command-line software that tracks changes made to a project over time.
- Save changes made to a project,
- Stores these changes and
- Allows the developer to reference them as needed.
Platforms
GitLab and GitHub are web hosting providers for software development and version control using Git.
- Help manage code and share changes to local files with a remote repository,
- GitHub was launched in the year 2008, and
- GitLab was launched in the year 2014.
Some topics to set up a data science project repository
1. Main README
- Short and memorable repository names
- Description and essential information
- Technologies, libraries, and frameworks used
- How to install and use content
- Current status of the project or results
.gitignore
,CONTRIBUTING.md
, LICENSE
2. Project Management
Teamwork methodologies like Agile and workflows like Gitflow for repositories are practical tools in scenarios where we need to interact with more contributors.
- Issues tracking system (JIRA, Trello, Planner)
- Task lifecycle (backlog, to do, doing, review, discarded, done)
- Definition of done
- Code review process
- Commit message style and tracking tags
- Branching management: Single branch or Multi-branch
3. Project Folder Structure
Organization and directory structure (depending on the project type: data science, back-end, front-end)
- Tool suggestion: cookiecutter
- Another tool or standard: CRISP-DM
4. Environment
How to manage virtual environments, packages, and dependencies in Python?. We need to understand the importance of maintaining and managing the environment in each project when we start a data science project.
- Description and how to use
- Reproducibility
- Virtual environments (Local development)
- Dockerization (Deployment)
5. Code Standarization
- Style definition
- Framework or tools to be used and its configuration
- Adaptation to specific project context
- Continuous integration pipelines
6. Documentation
Documentation and communication language definition
- Variables names as 0-level of documentation
- In-code documentation (numpycode)
- multi/modular READMEs
- Wiki
- Other (office doc, slides, etc.)
Some interest links — Good Practices
made with 💙 by mafda.