Data science infrastructure at HL

Published in

HLTechnologyBlog

5 min readAug 12, 2019

Recently, the HL data science/engineering team have been working on freeing our machine learning models from the stuffy confines of laptop RAM. This is a quick overview of some of the tools we have found helpful as we move data science projects from proof-of-concept to production.

Current tech

The company has a number of operational databases from different vendors that support client-facing applications and the website. Some data is centralised in a data warehouse but increasing volumes are taking a toll on performance. Analysts around the business will obtain data from a mixture of the data warehouse and direct queries on (backups of) operational databases. The team have historically performed data science and analysis on local machines but this has become impractical with increasingly large datasets.

Hadoop/Hive/Sqoop

Over the last 12 months, we have been integrating a six-node Hadoop cluster into the data ecosystem, offering a very significant increase in compute power and storage. This benefits a number of areas:

Data pipeline

We use the Sqoop/Hive/Impala/Hadoop stack for ELT (extract, load, transform) processes. Data is initially copied to Hive verbatim with no cleaning or validation applied. Adding a new table to the ingest process is as simple as adding an entry to a config database and your data will be available the next day. Because no cleaning or validation is required, these “raw” imports are not accessible to most users. However in the event that access is required immediately we can grant access via a view — the user just has to do their own data cleaning.
Once the data is ingested to Hive we run each table through a cleaning process. The main issue we have here is incompatibility between Hive’s timestamp datatype and source DB date fields. When incompatible datatypes are found, Sqoop converts a field to string, which needs to be converted to a Hive timestamp at the other end. This and trimming trailing whitespace account for the majority of the data cleaning operations at this stage. We refer to these lightly-cleaned tables as “processed”. At this point the data is suitable for use by analysts, and with a few minor amendments SQL queries written against these processed tables will work as they did in the source systems. This allows analysts to move existing code to the cluster relatively painlessly, and allows them to write queries using tables from multiple source systems (which was almost impossible previously). At this point we will also restrict access to any fields deemed sensitive, particularly those considered PII under the GDPR.
The final layer of data processing combines this cleaned data into “processed” tables — the cleanest and most usable data format we have. Here we move away from the traditional star schema and start constructing very wide tables that are easier for analysts to use. We also feed processed data into our machine learning models and store the output here. The progression of data from raw > staged > processed is intended to improve usability and simplify analytical queries while keeping the pipeline simple and transparent.

R packages

R is our primary language for data science, although our use of Python is increasing. This is a brief overview of some of the packages we are currently using in production.

H2O

H2O (available for Python and R) is a package similar to sklearn etc. that simplifies model building. It runs in a JVM, meaning that models trained in Python can be imported into R and vice versa. Models can also be exported as POJOs (Plain Old Java Objects) and MOJOs (Model Optimised Java Objects) which — assuming you use Java — improves deployability by removing the need to run a separate R or Python process in production.

The model-building process is very straightforward with most of the heavy lifting done for you. You can tune models by hand or use a grid search function to explore the hyperparameter space automatically and return the best model.

A H2O multiclass classifier using the iris dataset

We have incorporated H2O models into parts of the data processing pipeline, with the results being written to Hive tables. Currently this is a synchronous process — something we hope to implement in future is a workflow with a dependency graph so parts of the daily routine can be parallellised.

Shiny

Shiny is a package that allows you to write full-stack web apps natively in R. We have found this particularly useful for rapid prototyping internal tools as data collection, modelling, UI design and back-end logic can be written seamlessly in a language that all our data scientists are familiar with. To date we have used it for wall-mounted dashboards and tools to surface information to the customer helpdesk.

Plumber

Plumber is an R package for building REST APIs. It is beautifully simple, even more so (in my opinion) than Python’s FlaskAPI package. You can spin up a REST API with multiple routes, parameter parsing and input validation in a few lines of code. The major advantage over using Shiny is that we remove the requirement to use R — any client application capable of making REST calls can consume the data.

Below is an example of a plumber route definition. The #’ comments both document the function and control its behaviour. In this case we have used a specific JSON serialiser and specified that this route only responds to GET requests to /.

Plumber syntax is extremely user-friendly

Because single values in R are represented as character vectors of length one, the “unboxedJSON” serialiser is used to prevent them being returned as single-item arrays, i.e.

{ 
  "response": "hello, world"
}

rather than

{ 
  "response": ["hello, world"]
}

Home-grown R packages

As the analytics ecosystem has evolved we have found it helpful to write a number of internal R packages to handle common tasks like database connectivity and analysing marketing campaign performance. Internal package development is made much easier by the drat and devtools packages. Drat handles repository management and devtools makes developing, testing, building and documenting packages extremely simple. The comprehensive R Packages by Hadley Wickham is an excellent guide to authoring your own.

Conclusion

The tools mentioned above have made life much easier for us but a number of challenges remain. On our to-do list at the moment are:

Build an automatic model monitoring, scoring and re-training framework.
As we build new features for the business we will need to put more effort into scalability, monitoring and fault-tolerance.

We are constantly evolving and aim to stay flexible so we can incorporate new tools as quickly as possible (sparklyr and Ludwig are two others that we have looked at, and we are experimenting with Python). Our toolbox will look very different in six months time!

We take client privacy very seriously. Please click here for more information