Introducing rorodata: Making data science work for you in production

rorodata
rorodata
Published in
4 min readOct 4, 2017

Introduction

If you are a data scientist building machine learning applications for cataloging, marketing or fraud then you will be intimately familiar with Scikit-learn, Keras, Tensorflow (TF), Jupyter, etc. But there is a good chance that you are not familiar with a lot server software engineering and infrastructure code. As a result, productizing your applications becomes challenging. Of course, you can figure these things out, but the time it takes to learn and build software system is far from trivial.

One solution to this problem is creating an assembly line of data scientists and developers. Data scientists build the models and hand it over to developers who further integrate them into “production systems.” But data science does not lend itself well to this approach. Why? Because data science is iterative, with far too many explore, transform, visualize and model cycles before a solution is selected. Getting this data scientist-developer hand-off right is tricky and expensive. Suppose you do manage to get this hand-off to work, you will still a have an unhappy bunch of developers who always need to fix someone else’s work. On the other side, data scientists are unhappy too because their models are not in production fast enough. We have experienced these issues first hand and completely understand the pain in getting machine learning applications to work seamlessly.

A better solution is to build a data science platform that data scientists can use, with minimal handoffs with Development. This way, the data scientists control their machine learning applications all the way into production.

Under this approach, Data Scientists work on creating production-grade APIs and/or dashboards based on ML models, while Developers work on building a platform which abstracts away all the complex, low-level wiring and configurations that are required to productionize the ML models.

A good data science platform needs to be able to:

  • Bring in structured and unstructured data of any size and any format, for exploratory analysis
  • Build, train machine learning models on the desired hardware
  • Manage and track experiments and models, along with all metadata for audits
  • Deploy models as consumable endpoints/APIs/Dashboards, for prediction tasks
  • Track the model performance and roll-out upgrades on the fly

This approach can significantly improve the velocity in machine learning application development, allowing companies to capture more opportunities and give them greater competitive edge in the market.

…so as a business, no matter what your size, your agility — your ability to roll out a new product, change processes, manage your people, etc. — is equivalent to your ability to develop and change software. So your software development velocity determines your competitiveness. — James Lindenbaum

While working with startups and enterprises, we found some critical gaps and missing features when it came to data science platforms — while there are a variety of tools in the market and in the data science open source ecosystems for beginners and big businesses, there is a large gulf in the middle, i.e., in the small and medium business (read small and medium sized teams) space.

Specifically, there is not much out there for small data science teams who are keen on building vertical applications fast, and on a budget. Most of these companies/teams do not have enough resources or the time to build such platforms in-house, and they prefer (rightfully) not to do so.

Today, we’re excited to introduce rorodata — a data science platform that lets you explore, build, and deploy machine learning models in minutes. You focus on the science e.g. feature engineering, models, etc. and leave the non-science part e.g. infrastructure, devops, experiment management, book keeping, etc. to us. We are fans of Heroku, so we wanted something like a Heroku for Data Scientists. We want data scientists to have the same developer experience as web developers have on Heroku.

We will focus on solving problems that simplify and automate all the non-data science tasks that data scientists have to wrestle with, and make data scientists self-sufficient all the way into production. We’re excited by the possibilities and look forward to innovative machine learning solutions.

Using rorodata, you can do the following:

  • Standardize your project workflows
  • Deploy your model instantly on AWS
  • Version your model artifacts and API endpoints
  • Run Jupyter Notebooks on demand, on specific hardware
  • Access all the execution logs
  • Provision additional storage using a simple API

We support all the popular python frameworks such as Scikit-learn, Keras, Tensorflow, PyTorch. If you don’t find something just ask us and we’ll make it available.

We are still in early stages, and actively looking for beta users who can try it and share feedback, report back bugs, problems or ask more questions. We can make machine learning simpler with your help and support. Join us on Slack.

If you are a startup and need help with machine learning, we’re more than happy to assist you. Just share some more details with us, and we will get back soon.

Lastly, we’re hiring as well. We are building a robust data science platform that gives data scientists wings, by taking on challanging software engineering and design problems. We are looking for highly motivated individuals, who can help us tame these difficult problems. Please look up our careers page for more details.

--

--