Cookie Cutter: Organizing Data Science Projects

Taiwo Owoseni
devcareers
Published in
4 min readSep 12, 2019
Image of yummy python-logo cookies

A well-structured data science project is just like a neatly cut out cookie — visually re-callable. A data science project consists of a lot of artifacts; raw data (e.g data from websites), processed data, python files, notebooks, images, reports, requirements, excel sheets, etc. To avoid confusing piles of these files littered on your computer, its best to sort them and group them in related subfolders of the project’s directory.

Cookie cutter is a command-line utility that creates projects from project templates. cookie cutter creates a project directory consisting of subfolders to properly structure data science projects. Enough of the grammar! The best way to understand what cookie cutter really does is to use it! Let’s dive in.

Setting Up Cookie Cutter

To use cookie cutter, install it on via pip. pip is the standard package manager for Python. It allows you to install and manage additional packages that are not part of the Python standard library. Launch command prompt, type the code in the snippet below to install cookie cutter:

pip install cookiecutter

To create a new project, copy this link https://github.com/cookiecutter/cookiecutter to your command line.

#after installing, make sure you are in the directory you want to #create the project
cookiecutter https://github.com/cookiecutter/cookiecutter
Cookie cutter — creating a new project

Enter all details excluding aws_profile, repo name and s3_bucket. You can include these listed fields if you have them. If this is your first time using cookiecutter, you should not see the texts “Is it okay to delete and re-download it”.

Your directory structure should look like this:

Titanic Disaster Project — cookiecutter

Open the README.md file. It contains a short tutorial guide on the organization of the project.

Titanic-Disaster==============================Hey! This is a nice intro tutorial to data scienceProject Organization------------├── LICENSE├── Makefile           <- Makefile with commands like `make data` or `make train`├── README.md          <- The top-level README for developers using this project.├── data│   ├── external       <- Data from third party sources.│   ├── interim        <- Intermediate data that has been transformed.│   ├── processed      <- The final, canonical data sets for modeling.│   └── raw            <- The original, immutable data dump.├── docs               <- A default Sphinx project; see sphinx-doc.org for details├── models             <- Trained and serialized models, model predictions, or model summaries├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),│                         the creator's initials, and a short `-` delimited description, e.g.│                         `1.0-jqp-initial-data-exploration`.├── references         <- Data dictionaries, manuals, and all other explanatory materials.├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.│   └── figures        <- Generated graphics and figures to be used in reporting├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.│                         generated with `pip freeze > requirements.txt`├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported├── src                <- Source code for use in this project.│   ├── __init__.py    <- Makes src a Python module│   ││   ├── data           <- Scripts to download or generate data│   │   └── make_dataset.py│   ││   ├── features       <- Scripts to turn raw data into features for modeling│   │   └── build_features.py│   ││   ├── models         <- Scripts to train models and then use trained models to make│   │   │                 predictions│   │   ├── predict_model.py│   │   └── train_model.py│   ││   └── visualization  <- Scripts to create exploratory and results oriented visualizations│       └── visualize.py└── tox.ini            <- tox file with settings for running tox; see tox.testrun.org--------<p><small>Project based on the <a target="_blank" href="https://drivendata.github.io/cookiecutter-data-science/">cookiecutter data science project template</a>. #cookiecutterdatascience</small></p>

The Data Folder

It contains four subfolders: external. interim, processed and raw.

The external folder consists of data files you got from external sources(third parties). This includes data scraped from a website. It is good practice to save your downloaded data and scrapped data to this folder.

The interim folder usually stores data that have not been transformed. For instance, a dataset that has been cleaned but still consists of categorical data like a person’s name, user’s ID, home address etc.

The processed data holds data that will no longer go through any other processing because the data has been thoroughly manipulated.

The raw folder contains data that will be processed. Save data that are considered raw into this folder.

The Model Folder

It contains all the models that are built for data science projects. An example of such a model includes a Logistic Regression model. Save models in this folder.

The Notebooks Folder

It contains the jupyter notebooks files which are called notebooks. It is good practice to save your notebooks to this folder.

The Docs Folder

Consists of documents pertaining to the data science projects. It aids proper documentation of the code.

The Reports Folder

Houses the figure folder, which will contain the visualization of analysis from the dataset(data frame). This could be Histogram, frequency distribution graph, pie chart, scatter plot and the likes.

The Src folder

You can save a .py file in this folder. For instance, a python module that scraps data from a web page, a python module to extract rows that have missing values, etc.

I hope you find these explanations useful in your next data science project!

Farewell!

--

--