Data Science Mystery — How to move from“Death in Dev” to “Prove in Prod” ?

Deepak Sekar
5 min readApr 13, 2020

--

Algorithmia, which found according to the “The findings of the 2020 [State of Enterprise Machine Learning] study” that while machine learning maturity in the enterprise is generally increasing, the majority of companies (50%) spend between 8 and 90 days deploying a single machine learning model (with 18% taking longer than 90 days). Most peg the blame on failure to scale (33%), followed by model reproducibility challenges (32%) and lack of executive buy-in (26%).

The majority of the work done in data science is dying in dev without getting promoted to production because of the following:

  1. Lack of Data Science Skills
  2. Lack of an environment which caters to the ask of a Data Science project
  3. Lack of Model explainability
  4. Turnaround time from the business requirement to model evaluation
  5. It’s a world of open-source — Who takes the responsibility to maintain, upgrade and fix issues?

Gartner reported in January that AI implementation grew a whopping 270% in the past four years and 37% in the past year alone. And according to the McKinsey Global Institute, the subsequent labor market shifts will result in a 1.2% increase in gross domestic product growth (GDP) for the next 10 years and help capture an additional 20% to 25% in net economic benefits — $13 trillion globally — in the next 12 years.

There is a definite future. Hence the need to be addressed and done the right way.

Let us look at the Data Science Lifecycle.

The CRISP-DM model (Cross Industry Standard Process for Data Mining) has traditionally defined six steps in the data mining life-cycle. Data Science life cycle incorporates all these six steps + more.

The CRISP model steps are:
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation and
6. Deployment

CRISP DM Lifecycle

What are the two additional steps in a Data Science Life Cycle?

MLOps:

7. Monitoring — Drift/ Bias Detection
8. Feedback — Real-time De-biasing and Model Tuning

Do enterprises just look for a platform that provides capabilities to achieve all these 8 steps?

Yes in terms of processes and more in terms of capabilities…

What are these additional capabilities?

9. A platform that bridges the gap between citizen Data Scientists and Experts — Auto ML, Data Prep Recommendations..etc.
10. Explainable Models — Not just locally but even Globally (better than what LIME/ SHAP can do locally)
11. Native Big Data Execution Environment (Apache Spark is a good example)
12. Scalable and affordable Infrastructure (Money always matters in Data Science)
13. Model portability — Host anywhere, no vendor lock-in (Because of the Multi-Cloud world we deal with)
14. Governance and Access Control

Based on this can we consider the following as KPIs for a Data Science Platform?

Data Science Platform’s KPIs

Oracle Cloud Infrastructure(OCI) Data Science

OCI Data Science is a collaborative, scalable and a powerful Data Science platform that provides the following

  1. Scalable Infrastructure
  2. Powerful and diverse compute (Intel Xeon, AMD, NVIDIA Tesla Pascal/ Volta GPU)
  3. Easy Environment Setup
  4. Collaborative Workspace/ Shared Environment
  5. Jupyter Lab IDE
  6. IAM based Access Control + OCI Governance Capabilities
  7. Model Catalog
  8. Transparent Pricing — Only charged for the Compute & Storage used. Turn on/ off based on the requirement

And most importantly, a homegrown SDK that is provided free

6. Accelerated Data Science (ADS) SDK

Accelerated Data Science (ADS) SDK

ADS SDK helps Data Science teams to innovate faster. It provides capabilities for

a. Data Connection (Oracle DB, Autonomous DB, MySQL, Object Storage, AWS S3, SQLLite…etc.)

b. Data Manipulation (Profiling, Correlations, Feature Selection, Recommendations..etc.)

c. Native Dask support (If you are interested in Dask then please visit https://towardsdatascience.com/why-every-data-scientist-should-use-dask-81b2b850e15b)

d. ML Framework Support (Tensorflow, Keras, XGboost, and scikit-learn..etc.)

e. AutoML

f. Model Evaluation

g. Model Explanation (Oracle MLX — Global & Local)

Now we know what OCI Data Science is capable of. Let us look at the journey from development to production in OCI Data Science.

The journey from Dev to Prod — OCI Data Science

**OCI Functions (Oracle Functions is a fully managed, multi-tenant, highly scalable, on-demand, Functions-as-a-Service platform. It is built on enterprise-grade Oracle Cloud Infrastructure and powered by the Fn Project open-source engine). This helps us achieve model portability since the function artifacts can be ported to any other function as a service provider that is powered by Fn Project.

**OCI API Gateway (The API Gateway service enables you to publish APIs with private endpoints that are accessible from within your network, and which you can expose with public IP addresses if you want them to accept internet traffic)

How does OCI Data Science help in moving from “Death in Dev” to “Prove in Prod”?

First, let us see how OCI Data Science map to the Data Science Platform KPIs

OCI Data Science helps in

a. Reducing Cost

b. Access to more data

c. Reducing Time

d. Increased Security

e. Increased Flexibility

f. Increased Trust

and thereby helps reduce “Death in Dev” and makes way for you to “Prove in Prod”.

Welcome to the world of Data Science done right!

The views expressed are those of the author and not necessarily those of Oracle. Contact Deepak Sekar

Additional Resources

https://www.oracle.com/a/ocom/docs/cloud/oracle-cloud-infrastructure-platform-overview-wp.pdf

--

--