Caspian: a Serverless, Self-Service Data Pipeline using AWS and ELK Stack

Published in

Onfido Product and Tech

11 min readApr 17, 2018

At Onfido, we are using data to improve our services, solve challenging problems using machine learning, and serve our customers efficiently. We have built a robust, secure and scalable pipeline to handle our data and help us to achieve the above goals.

Why do we need that?

Research: Machine learning is at the heart of our products, from extracting document information to identifying fraudulent documents, we are leveraging machine learning to increase the automation and accuracy of our services. The most crucial part of creating an ML model is to support data labelling, and to be able to easily search/query and download the relevant data for training various models.

Privacy/Security: Privacy is crucial within the company. We are dealing with sensitive user data and we care about it. One of the purposes of having such a unified data-pipeline is to make sure that all the data used across the engineering team is stored in a secure manner. Therefore, data has to be be stored and encrypted at rest and transit.

Internal Metrics: Each team develops new microservices or deploys a new version of a service every day. We need a way to easily expose different metrics/data and visualise them to be able to pinpoint any limitations and errors…

Caspian: a Serverless, Self-Service Data Pipeline using AWS and ELK Stack

Why do we need that?

Written by Hamed Saljooghinejad