Accelerating Data Science

rorodata
rorodata
Published in
6 min readJan 11, 2018

Ideas

Enterprises have numerous problems that can be solved for cheap, and at scale, using Data Science. However, as data science teams are formed and undertake this journey, they find several bottlenecks that can potentially slow them down and significantly affect the impact they can create for the organization. In this post, we address the significant technical challenges faced, and the need to address them properly.

Source: Electric Autosport

Three Parts of End-to-End Data Science

Data Science effort can be divided into three parts (1) Idea to Model (2) Model Deployment to Production (3) Post Deployment Management

Idea to Model

This stage involves all the modelling activities in Data Science. The process is highly iterative, but typically involves pre-processing data, trying out various features and models, and through cross-validation, arriving at a good machine learning model / pipeline for the task at hand.

Here are some activities that consume effort, and become painful if not done well:

  • The main focus during this phase of activity is the reproducibility and management of different experiments being conducted towards developing the ML model at hand. This usually involves lot of ad-hoc book keeping about the software libraries, the version of code, and the specific data sets and transforms used in each experiment / model
  • In the case of deep learning models, another important yet cumbersome activity is babysitting the process of training neural networks (especially with several models in training at the same time). Many a times, this just translates to the data scientist spending long durations (spanning days) watching graphs or numbers being plotted on a screen
  • Equally important to this process is creating a data science environment (hardware + software + data) to run and reproduce experiments. Managing this activity on a local computer does not offer the flexibility to access more compute or storage on-demand, and hence there is a strong preference to perform these activities on the cloud

To conduct the above tasks with the least amount of manual setup and monitoring activities is key to effective model building.

Production Deployment

Deploying a model in production is primarily a technical activity, and not a data science activity. However, it is extremely important part of the overall end-to-end data science process, because of the following reasons

  • Often, data science teams come up with error fixes, modifications, and better solutions for the same model. These need to be deployed to production without excessive delays
  • The process of handoffs from a data science team to a technology/DevOps team for production deployment can create reproducibility issues due to variations in environments, library versions, etc. At the least, this leads to numerous iterations between data science and production teams; at the worst, it leads to model failures in production
  • Deployment activities are technically complex because they often work with federated services / environments that are themselves changing. E.g. a cloud service introduces additional tasks in the setup process, changes API specifications, etc. Because deployment activities are often performed manually, tribal knowledge and glue code makes its way into the process and makes it hard to maintain and troubleshoot
  • A key part of production release is a pre-production alpha/beta testing with internal or limited sets of users. This process involves rapid iterations and fixes, that become very slow due to handoffs between data science and DevOps teams, and limits the number of iterations that can be done

Unless the data science team finds a repeatable, streamlined process to deploy models into production, this stage may become a significant bottleneck to serving the needs of the organization.

Post Production Management

Once a model in deployed into production, it must be monitored for both predictive performance and for technical performance. Here are some of these aspects in more detail

  • Models in production need to be monitored for predictive performance. Data science teams need to be watchful for performance drift (i.e. models getting worse slowly over time) as well as sudden degradation in model performance (e.g. as a result of problems in the periodic retraining process). When such events occur, data science teams should be able to quickly detect as well as take corrective actions e.g. restoring a previous stable version of the model
  • Once in production, models need to be monitored for technical performance, e.g. throughput, latency, etc. Data scientists need to have a simple way to manage this themselves, instead of delegating this to Devops, for two reasons — first, the handoff cycle is unnecessary and may result in slow turnaround time and second, understanding technical performance directly may push data scientists to try models that are more technically efficient, e.g. smaller, faster models
  • Data science teams may also wish to deploy multiple models simultaneously and e.g. have new models work in shadow mode for some time, before cutting over partially or completely to production. They may also want multiple models to serve requests, by dynamically choosing the model serving the request

Again, it is easy to view the above tasks are purely technical in nature. However, if an organization is to truly benefit from data science, a streamlined, simplified process must be put in place that helps the data science team assume full control of the above activities.

In addition to being able to manage the above activities, data science teams need to be able to answer questions such as when performance degrades, what is the reason? Is it due to infrastructure e.g. network bandwidth, or is it due to certain types of inputs? Do models perform poorly for certain types of queries/tasks than others? To do this, they need access to detailed performance logs that help them understand and improve their data science models and pipelines.

The Elephant in the Room

As you can conclude from the challenges discussed in the above section, data science CAN get quite complex, even for competent technical teams. Here, we list the issues commonly noticed.

  • While the majority of a data science team’s efforts and time should be focused on core data science tasks (i.e. feature engineering, model building, and analysing experiments), data science teams are forced to spend as much as 80% of their time on non-core technical tasks that require software engineering, data engineering, and devops skills e.g. setup of cloud services, managing docker images, deploying to production, monitoring infrastructure, etc.
  • Managing infrastructure and system components by-hand generates a lot of glue code and ad-hoc procedures, making the entire data science process complex, error-prone, and time consuming
  • Cloud services / tools to perform each of these individual tasks are too low level, and each comes with its own complexity and learning curve
  • The talent required to bridge this gap is expensive, hard to find, and creates inefficient, unbalanced data science teams with high coordination costs

In addition, there is a strong reluctance from technical teams to admit to the above issues. Many times, these issues are dismissed as the nature of the job, just not enough hours in the day to addresses these properly, we have people to do these things, etc. It is the organization and the business that eventually pays for the price of these inefficiencies.

Need for Agile, Frictionless Data Science

How does an organization overcome the above problems? After all, data science is not merely about technical activity but building data products that can be used by the organization and bring about greater efficiencies and new business capabilities.

We believe in the following four tenets for data science teams:

  1. Simplify and automate all repetitive, non-core, low-level engineering devops tasks
  2. Give data scientists simple APIs to manage their workflows at every stage
  3. Provide data science teams with the right abstractions to build and deploy data products fast
  4. Give data scientists the autonomy of controlling their products from conception to production and post-production

Author: Ananth Krishnamoorthy

Please do write to us with your views and comments. If you are a company/startup looking for help with machine learning, we’d be more than happy to help. Just drop us a line and we’ll get back.

--

--