Data Science
10 Best Practices for Data Science
Lessons from 100+ data science projects with new startups to Fortune 50 companies
Do you find that data science and ML projects often lack structure and organization? Have you ever struggled to make sense of someone else’s spaghetti code or spent hours trying to reproduce results without clear documentation?
DrivenData Labs, the team behind the popular Cookiecutter Data Science template with over 7.8k stars on GitHub, knows these frustrations well.
After working on over 100 data science projects with a range of organizations, from new startups to large foundations and Fortune 50 companies. They’ve identified a major issue: the lack of standardization in project organization, collaboration, and reproducibility.
To tackle this, they released Cookiecutter V2, designed to embrace the latest data science tooling and MLOps changes.
Their paper, “10 Rules of Reliable Data Science,” lays out practical guidelines to keep projects on track. These rules are inspired by decades of hard-earned lessons from software engineering.
I spent an evening reading it, and I love this line in the introduction.
the main bottleneck in data science are no longer compute power or sophisticated algorithms, but craftsmanship, communication, and process
And that the aim is to not only produce work that is accurate and correct, but also can be understood, work that others can collaborate on, and that can be improved and built upon in the future even if the original contributors have left.
In this article, I distill the takeaways, practical tips shared in the paper, and the latest tools today that can help you achieve them.
Let’s dive right in.
Table of contents
navigate this article 👇
Rule 1: Start Organized, Stay Organized
Rule 2: Everything Comes from Somewhere, and the Raw Data is Immutable
Rule 3: Version Control is Basic Professionalism
Rule 4: Notebooks are for Exploration, Source Files are for Repetition
Rule 5: Tests and Sanity Checks Prevent Catastrophes
Rule 6: Fail Loudly, Fail Quickly
Rule 7: Project Runs are Fully Automated from Raw Data to Final Outputs
Rule 8: Important Parameters are Extracted and Centralized
Rule 9: Project Runs are Verbose by Default and Result in Tangible Artifacts
Rule 10: Start with the Simplest Possible End-to-End Pipeline
Lessons
Rule 1: Start Organized, Stay Organized
“Pipeline jungles often appear in data preparation. These can evolve organically, as new signals are identified and new information sources added. Without care, the resulting system for preparing data in an ML-friendly format may become a jungle of scrapes, joins, and sampling steps, often with intermediate files output. Managing these pipelines, detecting errors and recovering from failures are all difficult and costly. … All of this adds to the technical debt of a system and makes further innovation more costly.” — Sculley et al, “Machine Learning: The High Interest Credit Card of Technical Debt” (2014)
Starting a data science project with a clean and logical structure and maintaining that organization helps data scientists understand, extend, and reproduce analysis.
Why the rule?
- Chaos Prevention: Projects can quickly become chaotic with disorganized code and data without a clear structure, leading to difficulties reproducing results.
- Collaboration: A well-organized project makes it easier for others to understand and contribute, fostering better collaboration.
- Self-Documentation: Organized code is self-documenting, reducing the need for extensive documentation and making it easier to return to the project later.
How to achieve this rule:
- Use a Template: Start with a project template like Cookiecutter Data Science, which provides a sensible and self-documenting structure.
Below is the structure of Cookiecutter:
├── LICENSE <- Open-source license if one is chosen
├── Makefile <- Makefile with convenience commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default mkdocs project; see www.mkdocs.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── pyproject.toml <- Project configuration file with package metadata for
│ {{ cookiecutter.module_name }} and configuration for tools like black
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── setup.cfg <- Configuration file for flake8
│
└── {{ cookiecutter.module_name }} <- Source code for use in this project.
│
├── __init__.py <- Makes {{ cookiecutter.module_name }} a Python module
│
├── config.py <- Store useful variables and configuration
│
├── dataset.py <- Scripts to download or generate data
│
├── features.py <- Code to create features for modeling
│
├── modeling
│ ├── __init__.py
│ ├── predict.py <- Code to run model inference with trained models
│ └── train.py <- Code to train models
│
└── plots.py <- Code to create visualizationste
With a well-defined, standardized project structure, other people, including your future self, will thank you.
Rule 2: Everything Comes from Somewhere, and the Raw Data is Immutable
“Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.” — Andy Hunt and Dave Thomas, The Pragmatic Programmer
This rule emphasizes the importance of ensuring that all data in a project is traceable back to its source. The raw data should remain unchanged, and any transformations or analyses should be reproducible from this original dataset.
Why the rule?
- Reproducibility: Ensures that every conclusion or result can be traced back through a clear, unbroken chain of transformations to the original raw data.
- Accountability: Helps in verifying the validity of the data and the results derived from it.
- Clarity: Reduces ambiguity by clarifying where each piece of data originated.
How to achieve this rule:
- Track Data Lineage: Use a directed acyclic graph (DAG) to track the dependencies and transformations applied to the data.
- Keep Raw Data Immutable: Store the raw data in a read-only format and never modify it. Any cleaning or transformation should create a new dataset.
- Document Data Acquisition: Document how data was obtained, including any preprocessing steps, in the README or another accessible file.
- Use Dependency Management Tools: Tools like Apache Airflow or Prefect can help manage and visualize the data pipeline, ensuring traceability.
Tools and Packages:
- Apache Airflow: For creating and managing data pipelines.
- Prefect: Another tool for data pipeline management focusing on simplicity and flexibility.
- DAGsHub: For versioning data and machine learning models along with code.
By ensuring that all data is traceable back to its source and keeping raw data immutable, data scientists can enhance the reproducibility and reliability of their analyses, making their work more trustworthy and easier to audit.
Rule 3: Version Control is Basic Professionalism
“If you don’t have source control, you’re going to stress out trying to get programmers to work together. Programmers have no way to know what other people did. Mistakes can’t be rolled back easily.” — Joel Spolsky, “The Joel Test: 12 Steps to Better Code”
This rule emphasizes the importance of using version control systems (VCS) like Git to manage changes in code and data. It ensures that all modifications are tracked, reversible, and reviewable.
Why the rule?
- Collaboration: Facilitates teamwork by allowing multiple people to work on the same project without conflicts.
- Accountability: Tracks changes and identifies who made which modifications, enhancing transparency.
- Reversibility: This makes it easy to revert to previous versions of the code or data if something goes wrong.
- Review and Quality Control: Enables code reviews and audits, helping maintain high-quality standards.
How to achieve this rule:
- Use Git for Code: Regularly commit code changes to a Git repository. Use branches to manage different features or stages of development.
- Avoid Storing Large Data in VCS: Store only small, rarely changing datasets in the VCS. For larger datasets, use tools like DVC (Data Version Control) or Git LFS (Large File Storage).
- Automate Versioning: Use scripts or tools to automatically version datasets and models, ensuring every change is tracked.
- Code Review Practices: Implement a robust code review process using pull requests. Ensure all changes are reviewed by at least one other team member.
- Document Changes: Maintain a changelog to document significant changes and updates in the project.
Tools and Packages:
- Git: The most widely used version control system for tracking code changes.
- Git LFS: For managing large files in Git.
- DVC: For versioning data, models, and pipelines alongside code.
- GitHub/GitLab/Bitbucket: Platforms that provide repositories, code review tools, and CI/CD integration.
Using version control is essential for any professional data science project. It enhances collaboration, accountability, and quality control, making managing and maintaining code and data easier over time.
Rule 4: Notebooks are for Exploration, Source Files are for Repetition
“The majority of the complaints I hear about notebooks I think come from a misunderstanding of what they’re supposed to be. … It’s decidedly not there for you to type all your code in like an editor and make a huge mess.” — Mali Akmanalp, @makmanalp
This rule highlights the different purposes of notebooks and source files in data science projects. Notebooks are great for exploratory analysis and visualization, while source files are better suited for reproducible and automated tasks.
Why the rule?
- Exploration: Notebooks offer an interactive environment ideal for experimentation and visualization.
- Reproducibility: Source files, when organized and managed properly, ensure that processes can be repeated reliably.
- Collaboration and Review: Source files are easier to manage in version control systems, facilitating code review and collaboration.
How to achieve this rule:
- Exploratory Analysis in Notebooks: Use Jupyter or R Notebooks for initial data exploration, visualization, and iterative analysis.
- Extract Common Functions: As you develop reusable functions and processes in notebooks, extract them into source files (e.g., Python scripts).
- Organize Source Code: Place these scripts in a well-organized directory structure, such as
/src
or/scripts
. - Version Control: Commit these source files to version control, enabling collaborative development and code reviews.
- Testing: Write tests for the functions in the source files to ensure they work as expected outside of the notebook environment.
Tools and Packages:
- Jupyter Notebooks: For interactive data analysis and visualization.
- VS Code or PyCharm: For developing and managing source files.
- nbconvert: Convert Jupyter notebooks to scripts.
- pytest: For testing Python code extracted from notebooks.
- Git: To manage version control for both notebooks and source files.
Notebooks are excellent for exploratory and iterative analysis, but key functions should be extracted into source files to ensure reproducibility and maintainability. This approach leverages the strengths of both environments and promotes cleaner, more organized workflows.
Rule 5: Tests and Sanity Checks Prevent Catastrophes
“Code without tests is bad code. It doesn’t matter how well written it is; it doesn’t matter how pretty or object-oriented or well-encapsulated it is. With tests, we can change the behavior of our code quickly and verifiably. Without them, we really don’t know if our code is getting better or worse.” — Michael Feathers, Working Effectively with Legacy Code
This rule emphasizes the importance of writing tests and performing sanity checks on data science code to ensure correctness and reliability. Testing helps catch errors early and provides confidence that the code works as expected.
Why the rule?
- Error Prevention: Tests help catch errors before they become more significant issues.
- Confidence: Assures that the code performs correctly under various conditions.
- Maintenance: Makes it more accessible to modify and extend the codebase, knowing that tests will catch regressions.
- Reproducibility: Ensures that the results can be reproduced reliably over time.
How to achieve this rule:
- Write Unit Tests: Focus on writing tests for individual functions and components to verify their behavior in isolation.
- Use Sanity Checks: Implement sanity checks and smoke tests to validate data and basic functionality.
- Test with Sample Data: Create tests using small, representative datasets to verify that the code handles typical scenarios and edge cases.
- Automate Testing: Integrate tests into the development workflow using continuous integration tools to run tests automatically.
- Document Tests: Clearly document what each test is verifying to make it easier for others to understand and maintain.
Tools and Packages:
- pytest: A framework for writing and running tests in Python.
- unittest: A built-in Python module for testing.
- Hypothesis: For property-based testing in Python.
- tox: For automating testing across multiple environments.
- Continuous Integration (CI) Tools: Such as GitHub Actions, Travis CI, or Jenkins to automate the running of tests.
Testing and sanity checks are crucial for ensuring the correctness and reliability of data science code. They help catch errors early, provide confidence in the code’s behavior, and make the codebase easier to maintain and extend.
Rule 6: Fail Loudly, Fail Quickly
“This is a problem that occurs more for machine learning systems than for other kinds of systems. Suppose that a particular table that is being joined is no longer being updated. The machine learning system will adjust, and behavior will continue to be reasonably good, decaying gradually. Sometimes tables are found that were months out of date, and a simple refresh improved performance more than any other launch that quarter!” — Martin Zinkevich, “Rules of Machine Learning”
This rule emphasizes the importance of designing systems to fail visibly and promptly when encountering unexpected conditions. It advocates for defensive programming practices that make errors apparent and actionable.
Why the rule?
- Error Detection: Helps catch errors as soon as they occur, preventing them from propagating and causing bigger issues.
- Debugging: Makes it easier to identify and fix the root cause of problems.
- Reliability: Ensures that the system behaves predictably and fails in a controlled manner.
- Accountability: Provides clear error messages that help developers understand what went wrong and how to fix it.
How to achieve this rule:
- Validate Assumptions: Implement checks to ensure that inputs and intermediate results meet expected conditions.
- Use Assertions: Add assertions to enforce assumptions about data and code behavior.
- Log Errors: Implement comprehensive logging to capture detailed information about errors and their context.
- Raise Exceptions: Use exceptions to handle unexpected conditions and ensure they are handled appropriately.
- Fail Fast: Design the system to detect and respond to errors immediately, halting further execution if necessary.
Tools and Packages:
- Logging Libraries: Such as Python’s built-in
logging
module to capture detailed error information. - assert: The
assert
statement in Python to enforce conditions. - Error Handling Libraries: Packages like
bulwark
for Python to enforce data validation and assumptions. - Testing Libraries: Tools like
pytest
to write tests that ensure the system fails correctly under invalid conditions.
Designing systems to fail loudly and quickly helps catch and address errors promptly, improving reliability and maintainability. By enforcing assumptions and providing clear error messages, developers can ensure that issues are detected and resolved efficiently.
Rule 7: Project Runs are Fully Automated from Raw Data to Final Outputs
“People can lull themselves into skipping steps even when they remember them. In complex processes, after all, certain steps don’t always matter. … ‘This has never been a problem before,’ people say. Until one day it is.” — Atul Gawande, The Checklist Manifesto
This rule emphasizes the importance of automating the entire data pipeline, from raw data to final outputs, ensuring that the process is reproducible, reliable, and can be executed by anyone with minimal effort.
Why the rule?
- Reproducibility: Ensures that the entire process can be repeated with the same results.
- Efficiency: Saves time by automating repetitive tasks.
- Error Reduction: Minimizes human error by reducing the number of manual steps.
- Consistency: Ensures that the same steps are followed every time, leading to consistent results.
How to achieve this rule:
- Use Build Tools: Implement tools like GNU Make or Apache Airflow to manage and automate the data pipeline.
- Write Scripts for Each Step: Create scripts for data extraction, cleaning, transformation, modeling, and reporting.
- Automate Environment Setup: Use tools like Docker or virtual environments to ensure the analysis environment can be reproduced.
- Document the Process: Clearly document the steps and commands needed to run the pipeline in a README or similar file.
- Continuous Integration: Integrate with CI/CD tools to automatically run the pipeline whenever changes are made to the codebase.
Tools and Packages:
- Apache Airflow: For orchestrating complex data pipelines.
- GNU Make: A simple and powerful tool for managing build processes.
- Docker: For containerizing the environment to ensure consistency across different setups.
- Vagrant: For creating and configuring lightweight, reproducible, and portable work environments.
- Jenkins/CircleCI/GitHub Actions: CI/CD tools to automate the running of pipelines.
Automating the entire data pipeline from raw data to final outputs ensures reproducibility, efficiency, and consistency. It reduces the likelihood of human error and makes it easy for anyone to execute the process, leading to more reliable and trustworthy results.
Rule 8: Important Parameters are Extracted and Centralized
“Explicit is better than implicit.” — Tim Peters, The Zen of Python
This rule focuses on centralizing and clearly defining important parameters in a project, rather than scattering them throughout the code. This practice enhances clarity, reproducibility, and ease of modification.
Why the rule?
- Clarity: Centralizing parameters makes it easier to understand how the project is configured.
- Ease of Change: Modifying parameters in one place reduces the risk of inconsistencies and errors.
- Documentation: A centralized configuration serves as documentation for the project’s settings and parameters.
- Reproducibility: Ensures that all parameters are explicitly set and can be tracked, making it easier to reproduce results.
How to achieve this rule:
- Use Configuration Files: Store parameters in a central configuration file (e.g.,
config.yml
,settings.json
). - Environment Variables: Use environment variables for sensitive information or parameters that may change between environments.
- Parameter Management Tools: Use tools that facilitate parameter management and enforce consistency.
- Document Parameters: Clearly document what each parameter does and its possible values in the configuration file or a separate documentation file.
- Centralized Access: Ensure all parts of the code that need to access parameters read them from the centralized configuration.
Tools and Packages:
- YAML/JSON/TOML: Formats for configuration files.
- ConfigParser: A Python module for handling configuration files.
- dotenv: For managing environment variables in a
.env
file. - Hydra: A framework for managing configuration files in Python projects.
- Cerberus: A lightweight and extensible data validation library for Python.
Takeaway: Centralizing and clearly defining important parameters enhances data science projects' clarity, maintainability, and reproducibility. Keeping all configurations in one place makes changes easier to manage, and the project becomes more understandable and reliable.
Rule 9: Project Runs are Verbose by Default and Result in Tangible Artifacts
“Capturing useful output during data pipeline runs makes it easy to figure out where results came from, making it easy to look back and pick up from where it was left off.” — DrivenData
This rule emphasizes the importance of making data pipeline runs verbose and ensuring they produce tangible artifacts that document the process and results.
Why the rule?
- Transparency: Detailed logs and artifacts make it clear how results were obtained.
- Debugging: Verbose output helps identify where things went wrong if there are issues.
- Documentation: Automatically generated artifacts serve as a record of what was done, aiding future reproduction and understanding.
- Accountability: Ensures that every step of the pipeline is documented, making it easier to review and audit.
How to achieve this rule:
- Enable Detailed Logging: Use logging libraries to capture detailed information about each step of the pipeline.
- Generate Artifacts: Ensure that each run produces artifacts such as logs, configuration files, intermediate datasets, and final results.
- Timestamp and Version: Include timestamps and version information in the logs and artifacts to track changes over time.
- Store Artifacts: Save artifacts in a structured and accessible location, such as a version-controlled directory or a cloud storage bucket.
- Document Runs: Create a summary report for each run, detailing the steps taken, configurations used, and results obtained.
Tools and Packages:
- logging: Python’s built-in logging module for capturing detailed logs.
- MLflow: For managing the ML lifecycle, including experiment tracking, model registry, and artifact storage.
- WandB: Weights & Biases for tracking experiments and visualizing results.
- TensorBoard: For visualizing TensorFlow logs.
- Structured Storage: Tools like S3, Google Cloud Storage, or Azure Blob Storage for storing artifacts.
Making project runs verbose and ensuring they result in tangible artifacts improves transparency, facilitates debugging, and provides comprehensive documentation. This practice makes it easier to understand, reproduce, and build upon previous work, enhancing data science projects' overall reliability and efficiency.
Rule 10: Start with the Simplest Possible End-to-End Pipeline
“A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.” — Brian Kernighan and John Gall, Systemantics
This rule emphasizes the importance of starting with a simple, functional end-to-end pipeline before gradually adding complexity. Begin with a minimal viable product (MVP) that processes data from start to finish, then iteratively enhance and optimize.
Why the rule?
- Foundation: Establishes a working baseline that ensures all parts of the process are connected and functional.
- Iterative Improvement: Allows for gradual refinement and optimization, reducing the risk of introducing bugs.
- Focus: Helps maintain focus on the primary objectives before getting bogged down in details and optimizations.
- Flexibility: Provides a flexible framework that can be adjusted and extended as needed.
How to achieve this rule:
- Define the Minimal Pipeline: Identify the essential steps needed to process raw data to a final output and implement them.
- Iterative Development: Start with the most straightforward implementation and iteratively add features, optimizations, and complexity.
- Validate Early: Ensure that each stage of the pipeline works correctly before moving on to the next.
- Simple Tools First: Use simple, well-understood tools and methods initially, and only introduce more advanced techniques when necessary.
- Document the Process: Keep documentation up-to-date with each iteration to ensure the evolving pipeline remains understandable.
Tools and Packages:
- Make: For simple build automation.
- Pandas: For data manipulation and initial data processing.
- Scikit-learn: For basic modeling and machine learning tasks.
- Jupyter Notebooks: For prototyping and exploring initial implementations.
- Docker: For creating a reproducible environment.
Starting with the simplest possible end-to-end pipeline ensures a solid foundation to build. It allows for iterative development and refinement, ensuring that each addition is built on a functional and validated base, reducing complexity and improving maintainability.
Lessons
In the appendix section, they share the following hard-earned software lessons that are adapted to the rules above.
- Version Control is a Must: Embrace tools like Git to manage code and changes efficiently. This facilitates collaboration and ensures that every modification is tracked and reversible.
- Keep it Simple, Stupid (KISS): Opt for simple solutions whenever possible. Simple code is easier to reason about, debug, and maintain. Complexity should only be introduced when essential and after careful consideration.
- Separation of Concerns: Divide your code into modules that handle specific tasks. This modularity makes your code more understandable and easier to test.
- Separate Configuration from Code: Centralize all settings and parameters in configuration files. This practice enhances clarity, makes it easier to adjust parameters, and ensures that changes are systematically documented.
- You Aren’t Gonna Need It (YAGNI): Avoid over-engineering. Start with concrete implementations and generalize only when a clear need arises. This principle helps prevent wasted effort on unnecessary abstractions.
- Premature Optimization is the Root of All Evil: Focus on making your code work correctly before trying to make it fast. Address performance issues only after the correctness of the code is assured and the need for optimization is clear.
- Don’t Repeat Yourself (DRY): Minimize duplication by refactoring reusable code into modules. This reduces the risk of inconsistencies and makes maintaining your codebase easier.
- Composability: Build your project with small, interoperable components. This approach respects the separation of concerns and DRY principles and enhances flexibility and reusability.
- Test the Critical Bits: Implement basic tests and sanity checks to catch common errors and ensure your code behaves as expected. This practice increases confidence in your code and helps prevent bugs from reappearing.
- Fail Fast and Loudly: Design your system to detect errors early and handle them explicitly. This defensive programming approach ensures that issues are caught and addressed promptly, reducing the risk of subtle bugs causing significant problems later.
Thanks for reading
Be sure to follow the bitgrit Data Science Publication to keep updated!
Want to discuss the latest developments in Data Science and AI with other data scientists? Join our discord server!
Follow Bitgrit below to stay updated on workshops and upcoming competitions!
Discord | Website | Twitter | LinkedIn | Instagram | Facebook | YouTube