Machine Learning With Highly Sensitive Data

Published in

Jumio Engineering & Data Science

9 min readAug 21, 2020

Most machine learning models in academia are trained and evaluated on publicly available datasets. For companies working in the finance sector or in identity verification this isn’t always possible. Identity verification transactions are highly sensitive, because each transaction contains one or more snapshots of the client’s passport, national ID or other photo ID, along with a selfie. This constant stream of images provides a great opportunity to develop computer vision algorithms by training machine learning models in several domains like data extraction, biometrics or fraud detection. However, the opportunities provided by this data carry a cost in the need for accountability to both government and industry regulators.

Companies working with sensitive data have to comply with a number of regulations. PCI (Payment Card Industry) certification regulates environments where credit card data is processed. ISO 27001 determines best practices for managing sensitive data. Companies operating in the European region have to comply with GDPR (General Data Protection Regulation). So what are the consequences of being compliant and how is it possible to train models in such environments?

Data Access

Machine learning needs exist in tension with requirements coming from best practices of managing sensitive data. ML engineers and data scientists require maximum flexibility to carry out their work without compromising security and the privacy of customers. They need to access production images, to view them and to run experiments but in a strictly controlled manner. The environment they are working in has to enable running exploratory data analysis, quick prototyping and visually inspecting images while maintaining strong access control, audit logging and physical security standards.

Access controls. Only those people who need access to sensitive data can obtain it. As a rule of thumb, engineers working on machine learning infrastructure and tooling for ML have no access to data. Multi-factor authentication is mandatory for ML engineers/data scientists to access any of the production data/infrastructure.

Audit logging. All access to images and metadata is logged, tracking who, when and from where has accessed a given file. Data must not leave the environment where access is controlled and logged (meaning no copies on laptops or other external storage media).

Physical security. Computers connected to the network that permits direct access to images are in a special room with extra physical security. This network does not allow public internet access. Additional security measures include cameras and fingerprint readers.

Exploratory data analysis. ML engineers and data scientists require the ability to quickly infer basic statistical properties of the input data stream to understand data modality. This includes analysis of image properties (e.g. distribution of face sizes in selfie images) and metadata (e.g. structured data and distribution of document types).

Computer vision. ML engineers and data scientists need to visually inspect images on their computer displays. Images are an information-rich data format that cannot be described in sufficient detail using only their statistical properties. Analyzing outliers is impossible without viewing the images.

Prototyping. ML engineers and data scientists often need to quickly build prototypes and evaluate them using production data. This allows them to compare approaches and to decide which one is more likely to work or better solves the business problem.

Dataset Management

For training and evaluating ML models, datasets have to be built from the unorganized data lake containing all data from the incoming data streams. The aforementioned data access restrictions combined with further requirements that are applicable to sets of images (opposed to single transaction instances) shape our dataset management policies.

Encryption. An indisputable requirement is to encrypt all data both in transit and at rest. Nobody wants to find out decommissioned data volumes contain their personal data or that this data is sent unencrypted over the wire. This can be handled transparently so the ML engineers and data scientists don’t have to deal with additional complexity. This is not just a practical requirement but also a subject of regular audits by the PCI-DSS certification authority.

Dataset retention. In ML, we often work with derived data that potentially contains personally identifiable information (PII). Determining the presence of PII can be very difficult: random image crops, for example, may contain a portion of the ID holder’s address. In some cases, PII is not even directly visible. An embedding or weights of a generative model could be used to extract memorized training data and thus PII. To reduce the risk of accidentally storing invisible PII, derived data must have a defined retention period.

Deletion requests. GDPR requires that all deletion requests coming from clients are processed in a timely manner (this is known as data minimization in GDPR parlance). This means it must be possible to trace all copies of an image in all datasets starting from the originating transaction.

Client consent. Only data with client consent may be used for ML. At Jumio, this rule has to be strictly adhered to. Only data from customers that have given their consent enters the environment where training jobs are executed. This guarantees that we are not accidentally building and training on user data without proper consent.

Data siloing. Customers may grant consent to use their data for training models that are used for them but not for other customers. This is another reason to maintain traceability from input transactions through datasets to trained models.

Model Productionalization

The ML life cycle does not end after having a model trained. Running models in production systems brings additional challenges that may not be visible when working with a single static dataset. Compliance puts very strong constraints on release management and access to production systems. These have to be addressed when designing ML infrastructure.

Reproducibility. Key to providing a high-quality service is the ability to easily retrain a model with updated versions of the dataset. This means dataset versioning and automation for model retraining is needed. Reproducibility is also needed to exactly recreate decisions made by production models in the development environment to understand why certain decisions were made and analyze cases where the model does not behave as expected.

Online monitoring. Governments regularly update the national ID and passport documents which our ML models are making decisions upon. The world is dynamic while a dataset is only a static snapshot from the past. A marginal document type (i.e., a document type we do not verify in great numbers) might grow and become more important over time. On top of all this, identity verification operates in a world with adversaries, where fraudsters are tirelessly working on tricking the system. These factors require robust monitoring of our production models to detect concept drift.

QA. All releases to production have to be quality assessed from both ML and engineering viewpoints. This limits the possibility to quickly iterate and evaluate experimental workflows in the production environment, but is important for data security and integrity.

Release documentation. Releases are documented in order to trace model versions, serving application and underlying container versions, datasets used for model training and evaluation. This is a requirement for compliant release management and also helps to analyze suspicious decisions made by models in production.

Environments

The guiding philosophy in this design is decoupling the software development life cycle from ML model development and creating secure environments with centrally managed data access for model training and evaluation. In the software development realm a standardized three-stage pipeline is used with development, testing (QA) and production environments. The same environments can’t be shared for ML development for the following reasons:

ML depends on production data that is not available in software engineering staging and development environments. These environments are not considered secure enough to hold such sensitive data.
ML development is a very iterative process which requires even more prototyping than software development. It can’t be iterated fast enough given the complexity of rollouts in the PCI-compliant production environment. Writing prototypes in production quality code is an inefficient use of a ML engineer’s time.
Hands-on access to production data is needed (e.g., exploratory analysis, computer vision). It is not possible to give this level of access to ML engineers in a production environment.
Even if a workaround was found for the above two issues, it is too risky to run experimental code in the production environment.
ML depends on a stable infrastructure and tooling for running training, labeling and evaluation jobs (from engineering-perspective they are production-level services).

Software development vs. ML development life cycle

The solution to this involves building separate staging and development environments for the machine learning teams. The ML training environment is the playground for ML engineers and data scientists, where they may test any crazy ideas they have on production data. Only data with customer consent enters this playground.

In the ML staging environment, silent workflows are executed — alternative variants of the production workflows that only output logs. This allows for multiple challengers (alternatives) to the champion (primary version) running in production. Unlike the engineering staging environment, the ML environment contains real production data, allowing outputs of the challenger workflows to be directly compared to the current production champion without corrupting production data. This is also the place where ML engineers analyze model decisions in case of suspicious behavior.

From a software and infrastructure engineering perspective, this ML environment is stable, containing production-grade infrastructure and services for managing experiments, training jobs, etc. This guarantees the safety of the data in the environment and the reliability of computational outputs.

Dataset Management Service

This service plays a central role in the overall architecture. It regulates data flow from the production environment to the ML environments. It provides strong guarantees that we never train on data without customer consent. The secondary (but equally important) role of the service is to process deletion requests.

Data querying. This service automates dataset building. An entire data lake can be queried and filter transactions to select a well balanced dataset for the given task.

Dataset versioning. Datasets are not just collections of images, but immutable first class objects. Immutability aids reproducibility. Datasets support a set of well-defined operations (e.g., merging, joining, filtering). The result of these operations is a new dataset version. This prevents confusions that could arise from evaluations on modified datasets.

Deletion requests. Dataset management service is responsible for processing customer deletion requests. Such requests delete all images and PII metadata associated with the given transaction from all datasets. This is the only exception to dataset immutability. This is an unfortunate requirement from the data scientist’s viewpoint, but they have to keep this in mind when evaluating models.

Data Viewing and Labeling

Most state-of-the-art computer vision models are trained in a supervised manner. Having an abundance of labelled data is key for training these models. For this purpose, Jumio maintains a large pool of labeling specialists to aid in building datasets. From the perspective of the data scientist, executing a labeling job is well-automated. They create a dataset, define the labeling task, explain the pitfalls (e.g., what to look for in hard cases), then submit the task to a processing queue. Because labeling jobs are not outsourced to third parties, very high standards for privacy and security are maintained — that is, images never leave the secured environments and are only viewed in special processing rooms.

Labelling task from the perspective of a ML engineer

These processing rooms maintain hardened security which enables accessing and viewing images. They have biometric access controls (i.e., fingerprint readers) and meet other stringent requirements such as no see-through windows. Processing rooms in OPS centers are used by labeling specialists to process data labeling tasks. In engineering locations ML engineers can use them to safely work on exploratory analysis, rapidly prototype models, debug silent workflows, etc.

ML engineer accessing PII data using a terminal in a processing room

Conclusion

In some sensitive domains it may seem very problematic to take advantage of otherwise extremely valuable data. ML engineers have to be very careful when tapping the opportunity PII data provides for training machine learning models and for tackling challenging computer vision problems. In this blog post, we have demonstrated that it is possible to design an infrastructure and workflows that do not compromise security or put user data at risk while enabling the flexibility machine learning engineers and data scientists need to carry out their daily tasks developing predictive models.