Data Science Collaboration: Options for End-To-End Workflows

The Data Detective
3 min readJul 18, 2020

Data Science collaboration is changing. Gone are the days of sharing files over email. No longer is it necessary to have new team members download dependencies in order to work on a project. This article is going to cover some options that are out there now that will improve the efficiency of your Data Science team and the quality of your workflow.

Databricks

Databricks is an amazing Apache Spark product that has evolved into and end-to-end workflow that incorporates parallel processing into a machine learning workflow. Environments can be loaded pre-configured and customized for a particular project. Utilizing notebooks, data science team members can make comments and add to the project in several languages including Python, R, Scala, SQL and Java all in the same notebook. Databricks even allows the team to test and productionize models from the UI. Databricks has partnered with Microsoft and AWS to integrate with their products as well.

MLFlow

MLFlow is an open source project that focuses in four areas of the ML lifecycle: tracking, projects, models and registry. Many of the products that are mentioned in this article use the packages and contribute to the content. It is an option to consider if the Data Science team is looking to avoid operational costs at the expense of more upfront setup work. MLFlow has API packages in Python, Java, R and has its own REST API.

Azure Machine Learning

Azure Machine Learning (Studio) is a data science workflow platform product developed by Microsoft. It includes the option of creating ML workflows via a drag and drop interface or using Jupyter notebooks, Jupyter labs and R studio. Deployment options include container instances for test and development or Kubernetes for real-time scalable solutions. One standout of this product is its ability to provide explanations via feature importance and data disparity (bias) (in preview). Also has the ability to monitor data drift.

AWS Sagemaker

Amazon’s Sagemaker is an IDE platform that, like the other products listed here, perform end-to-end workflows. Sagemaker utilizes notebooks and can perform automatic debugging. Sagemaker also offers a Ground Truth solution that is a labeling service designed to assist in creating high quality data training sets. Deployment and model monitoring are also simple endeavors through Sagemaker services. Sagemaker even offers an augmennted AI human review of models.

These are just a few of the options out there. Maybe your data science team has found a solution other than the ones listed. Feel free to share your team’s workflow solution.

--

--