Detecting and Redacting Identifiers in Datasets at Workday

Published in

Workday Technology

5 min readOct 28, 2022

Machine Learning is the backbone for building innovative features for the Workday Product Suite. Real-world data is the lifeline for training and improving ML models and keeping them relevant. However, real-world data can often contain sensitive information.l

Workday takes various approaches when applying pseudonymization and scrubbing methods to datasets. These approaches, which are use-case specific and evolve with advances in technologies, reduce the risk of data being associated with specific individuals. In this blog, we will describe one approach that Workday takes, along with the key technical components that comprise the service.

Datasets & Identifier Recognition

The datasets used by our Machine Learning technologies are diverse in structure. We typically have a mix of highly curated tabular data where the content is generally known, and free flowing texts where the data is generally unknown. The key to an effective scrubbing of free-form texts is to recognize identifiers effectively before employing any scrubbing process or scrubbing techniques.

Examples of identifying entities are:

We built an identifier tagging framework that uses a combination of off-the-shelf solutions such as Stanford NER, pre-trained BERT-based uncased NER transformer model, custom regexes and other methods to effectively identify and tag entities in text. Once we identify relevant entities, we employ one of the following operations to effectively scrub the data.

Note that as part of our approach to scrubbing, we employ a host of measures to reduce the likelihood of any data being linked to individuals, as discussed above. The controls we’ve described so far are simply some of the technical controls we employ. We employ additional techniques and contextual controls on the data and the processing environment to bolster the methods described in this blogpost.

Scalable Scrubbing

We need to run our scrubbing process on large datasets stored in AWS S3 buckets both on a daily and on-demand basis. The raw data in S3 is stored as Parquet files. We built a generic pluggable Apache Spark-based scrubbing framework that can be configured to tag and scrub identifiers in large datasets at scale. There are various kinds of datasets in our ecosystem each with their own schema. The framework allows for custom identifier tagging using pluggable tagger implementations, custom scrubbing and column specific scrubbing operations through a scrubbing specification file.

For the nightly runs, we use a cron-triggered scrubbing flow to spawn an AWS EMR cluster and run scrubbing jobs on the day’s raw dataset. The output is stored as json files in a different S3 location as per the specification.

In the flow above, it’s important to note that scrubbing requires a mix of technical and contextual controls to limit the risk of identification. Below is a sample model specification (spec) file corresponding to an input schema. As an example, the below spec file would result in the nested column competency_categories.display_id to be scrubbed, while competency_categories.instance_id to be hashed.

"spec":{
    "competency_categories":{
      "array":{
        "display_id":{
          "value":"scrub"
        },
        "instance_id":{
          "value":"hash"
        }
      }
    },
    "competency_rating_behaviors":{
      "array":{
        "display_id":{
          "value":"scrub"
        },
        "instance_id":{
          "value":"hash"
        }
      }
    },
    "competency_ratings":{
      "array":{
        "display_id":{
          "value":"keep"
        },
        "instance_id":{
          "value":"tenant_hash"
        }
      }
    },
    "competency":{
      "array":{
        "display_id":{
          "value":"scrub"
        },
        "instance_id":{
          "value":"tenant_hash"
        }
      }
    },
    "competency_name":{
      "value":"scrub"
    }
  }

As part of automating the steps needed to onboard new datasets for scrubbing, we implemented a model spec generation tool which would take in a parquet dataset and deduce a default scrubbing spec based on the types of the field. As mentioned earlier, the dataset will be a mix of curated columns and free-flowing text. As we can easily deduce the semantic information for curated columns, we can make the decision to keep, drop or hash the entire column during specification creation time. For other columns, we employ runtime identifier tagging of the text and configure operations such as scrub, hash, drop at both column and entity-type (name, address..) levels.

Apart from specifying the operations, another common task is to fine tune the scrubbing specification to improve the precision and recall of the process. In the following section, let’s look at the development flow to fine tune the scrubbing specification.

Tuning the Spec

Let us look at the regular development flow for tuning the scrubbing specification:

Raw input data is present in S3.
We run the specification generation tool “print-spec-for-schema” to generate the default scrubbing specification.
We manually review the specification, edit it to customize for the input dataset and come up with the candidate specification.
We run a script “prepare-evaluation-data” to generate a random sample of the identifier dataset, run scrubbing using the candidate specification and prepare a labeling task for human-in-the-loop annotation.
We use an annotation service such as AWS Ground Truth to label the identifier tags in the input sample and evaluate the performance using precision and recall metrics.
If we find that the metrics do not meet business-specific thresholds, and the performance needs to be improved, we go to step (3) and iterate.

Here are the quality metrics we used to evaluate the performance of the scrubbing model specification.

Datasets & Identifier Recognition

By building a generic scrubbing process and implementing the source and sink data in a modular fashion, we were able to leverage scrubbing methods for non-S3 sources such as DBMS tables. Apache Spark comes with JDBC integration and we used it to build a scrubbing pipeline to scrub columns in a DBMS table and store them in an auxiliary table.

Conclusion

As Workday is making strides innovating across multiple machine learning projects, the role of the scrubbing service is paramount.

The scrubbing service at Workday contains multiple approaches, is designed from the ground up to process big data at scale, has the ability to connect with different types of data sources, and has an automated data processing pipeline. We have described one aspect of this approach in this blogpost. As we continue to innovate and improve our NER models, this service has the potential to play a key role in Workday’s multi-faceted data scrubbing strategy in the future.

Detecting and Redacting Identifiers in Datasets at Workday

Written by Jeyabarani Seenivasagam