Data science taken to the extreme

Published in

BuiltOn

8 min readJul 29, 2019

Extreme data science is all about saving time and being able to present complete solutions quickly. I.e, How can you extract value from data fast?

Some of the significant challenges with conventional data science are the time-consuming exploration, cleaning and feature engineering, that are specific to each dataset. And then there are almost always hours of manually setting value and human decision making required. On top of that, every manipulation to the dataset needs to be logged and redone in the inference step in a couple of different ways depending on each transformation. The inference step has the same standard challenges as other server applications; maintenance, scaling, fault tolerance, availability and so on.

At BuiltOn, we deal with many different datasets and use cases, all of which end up with actionable APIs. We encounter all of the hurdles above, along with many others, on a regular basis.

To overcome these issues, we have developed our own extreme data science library which empowers our data scientists to solve the whole problem from exploration to production with minimal code and time. The four pillars of our technology are:

“One-liners” using our BuiltonAI library.
Auto pipeline for inference.
Serverless deployment.
Easy scaling for big data.

1. The BuiltonAI machine learning library

We developed our own “XFrame” ( Xtreme data science), which is very similar to pandas and sklearn APIs, for easy on-boarding of the technology.
Out-of-core and lazy evaluation capabilities to explore, manipulate and train on hundreds of GBs with ease.
Every manipulation is a transformer, which is saved in the background — for auto-pipeline generation.
Every algorithm is wrapped to handle the most common issues and use cases.
When the exploration is done, the pipeline is fully ready for production, either for inference or retraining with more data.

Let’s start with a simple example by using the Titanic dataset, where we can try to predict who survived the Titanic. The data is pretty dirty, with missing and heterogeneous values, and has a variety of input types. We will see how to deal with these issues later.

In a single line, we can explore the dataset, regardless of size, since we use out-of-core technology.

Note that the Titanic dataset is small, but our library would run the exact same way on much larger datasets.

In the code above, we import the dataset and print it to the screen. There we can see the columns, their type, and a sample of the values. It is clear that we have missing values in Age, together with heterogeneous values in Ticket and in the structure of Name, along with other issues.

In just one line, we can use our variation of XGBoost to add a column of predictions.

In general, we try to cater for each manipulation by adding a column, which allows us to create simple feature engineering and model stacking.

Evaluation is also on one line as you would expect:

Now let’s take it to the extreme

The first challenge we set ourselves is to add a column of images to the tabular data. By doing so, we will demonstrate how easy it is to do substantial data wrangling, modelling and eventually to produce an easy to consume output with the final results through minimal code.

We add the new column of images to the Titanic dataset by searching random pictures of men and women and matching it to gender in the dataset. This is obviously nonsense that won’t help get better results, but we added this step to show how easily we can handle more complex cases. We can now model a dataset with tabular and image data combined!

Next, we are going to do some data wrangling to demonstrate how easily it can be done. Then we are going to pack a few columns for inference, and finally, we will prep it for API consumption.

Let’s look at some basic data wrangling.

Every line is another column, which makes it easy to debug and verify that the transformations make sense. Unlike, for example, writing functions and generators to run on files. This is very similar to how you would do it with Pandas.

More advanced data wrangling.

As an example of more advanced data wrangling, we will use TF-IDF, a common technique for text mining which gives every word a weighted value based on how valuable it is in the text vs the entire corpus. We apply it to the “Name” column, which again, is not very informative, to showcase how easy it is to use.

LabelEncoder and One-Hot-Encoder are classic methods of turning each string to an int, or list of ints, to be consumed by machine learning algorithms that cannot handle string inputs among other use cases. We demonstrate the LabelEncoder in this case, but have other options available.

FeatureBinner puts numeric values in bins, For example, it can be used to categorise 0–18 year olds as children, 19–50 as adults, and 50+ to old ( I know, 50 is not old…).

QuadraticFeatures takes combinations of features and creates a new feature based on their combinations. If you have a weather column and a day of the week column, it will create a column for the combinations of both, e.g. Sunny+Monday.

CountTransformer runs a group-by, count and joins, which is very helpful in event data.

Our ImageToFeaturesTransformer uses transfer learning, using the Resnet50 deep learning model, to turn each image to a vector of features which we can consume as any other column. This is very helpful for any standard out-of-the-box image learning.

RandomProjections is a dimensionality reduction method, one way to combat the curse of dimensionality.

The FeatureHasher takes every value and hashes it, which helps with a very high cardinality of categories, a.k.a. the hashing trick.

Finally, we use XGBoost twice, first to generate features by taking the leaves of the deepest level, and second for predictions, using all the features we created.

We can continue processing the dataset after modelling and prep it to be consumed as APIs as a backend would do, but since the data scientist knows the data best, it’s easier for them this way. In SageMaker, you will need another lambda or backend service to fit any algorithm to the way you want to consume it.

Let’s have a look at some of the manipulation and modelling:

Lastly, let’s have a look at some of the manipulation and modelling:

We can pack a few columns into a single response for the inference phase to make it easier to consume on the client side.

For most machine learning projects, cleaning and feature engineering are where most of the time is spent. This is why having something like our Imputer that can figure out missing values, and transformers which can be run dynamically for testing and debugging, is pure magic.

When considering machine learning for APIs, on the other hand, most of the time is spent building pipelines. One will need to re-do many of the transformations before inference in the right way, and for training on new data, there will be a need to rerun the cleaning procedures.

2. Auto-pipelines

How long would it take to create a pipeline that can provide you with predictions using the Titanic case above? … you guessed it, no time at all as we only need one line of code!

Which, in turn, can be saved and loaded in a server to retrain with more data or to run inference for predictions on any reasonable format like JSON, Numpy, Pandas, and our own XFrame.

Note that the predict() function knows when to use values from “train”, like in the ‘standardize_age’ case, when to calculate on the fly like in the ‘forename’ case, and how to avoid filtering and cleaning in predictions time like the ‘SibSp’ case.

his allows us to quickly train and deploy all kinds of pipelines for classification, regression, clustering and ranking with just a few lines. And we can deploy the pipelines on the same infrastructure. Smart, right?

Deployment:

3. Serverless

As a start-up that focuses on bootstrapping fellow startups, we don’t want to maintain an auto-scaling cluster for each company before they have a significant amount of data and volume. In addition, we can’t afford to provide servers for every company that tests our system for free. Our solution? Serverless!

Serverless is an infrastructure where the servers are maintained with a cloud provider like AWS, and the software engineer just needs to write a function, not a server. Most importantly of all, you pay for what you use, perfect for small volume inference that is practically free.

We set up our pipelines to be deployed automatically within AWS Lambda, saving us costs and time spent on operational management. And it scales easily with no action on our side with extreme reliability for fault tolerance and availability.

Only when a client needs a very high volume of predictions do we move them to an auto-scaling server cluster using docker, which has no changes in code between projects, datasets, business solution and pipelines.

4. Scale

As already mentioned, the package is out-of-core, which means we can just run on a bigger single instance without the need for distributed computing (although we are very well parallelized on that single instance). Due to the pipeline being easy to save, load and retrain, AWS Batch fits perfectly to run our training. As a result, if you have more data, we just change the type of instance, without any other changes needed.

With the high variety of instance options, we have yet to come across a challenging dataset size for our solution.

Summary

The challenges of extreme data science are similar to traditional data science. You always want to reduce the time and resources needed for exploring, cleaning, manipulating, training and deploying data at scale. The solutions must be extremely efficient and robust.

We addressed these issues by creating a unique machine learning library which puts the data scientist in the driver’s seat. Write less code, prevent common mistakes, run best practices and build pipelines behind the scenes to make complete solutions quickly.

The limitations associated with the traditional approach to data science led us to build our own AI platform. It allows us and our users to train and deploy dozens of different pipelines used in our APIs. Our goal at BuiltOn, is to democratize AI and level the playing field, giving developers easy access to ready-made building blocks and powerful AI ready APIs for e-commerce.

If you want to learn more, follow us on Twitter, LinkedIn and Github, have a look at our website, eat healthy and exercise. Maybe pick up a new hobby; after all, you might save a few hours by going extreme.