Data-as-a-Service — a Global Data Platform
By Erol Guney, Principal Engineer/Data-as-a-Service Architect, Workday
The journey to build a global data platform started with a relatively simple business statement: “Workday will be unveiling a new Data-as-a-Service offering, for which benchmarking will be the first service.” The benchmarking service lets customers select metrics that they wish to contribute to the service. In return, they gain access to benchmarking data on the same metrics from peer groups.
Workday customers have the ability to find answers for analytical questions such as “How does my company’s turnover rate compare to small technology companies?” or “What’s the median profit margin of similar size companies in my industry and location?”
Customers not only decide whether they participate in the benchmarking solution via legal agreement, but also which categories they would like to subscribe to in order to get access to respective benchmarks. Customers can view benchmarks for the metrics in which they contribute their own data.
We were challenged to build a data sharing platform that could satisfy these requirements when our fearless leader (then SVP of User Experience and now CTO) Joe Korngiebel announced Data-as-a-Service based on customer demand at Workday Rising in September of 2016.
The system that can satisfy these requirements must have these properties:
This is possible for Workday because we are on a single codeline. If customers were on different software versions, this would have been a near impossible task. On the contrary, the underlying building blocks were perfectly right for a global analytical system that this use case required. However, it does not mean that creating a single global analytical system that handles commingled data did not have any challenges. Keeping architectural flexibility without introducing too much complexity, privacy concerns, and security requirements are some of the challenges that we have to keep at the top of our priority list at all times.
The diagram below depicts the Data-as-a-Service (DaaS) architecture in a layered structure. This layering standardizes the data collection and data access controls:
- A global data architecture (single schema) for aggregate (de-identified) data.
- Scalable and secure cloud data storage.
- Rationalized definitions across customers (a common taxonomy).
The warehouse sits on top of Amazon Redshift, which enables the warehouse to be highly scalable and to meet growing demand. In this sense, the Data-as-a-Service Platform uses scalable infrastructure components of a public cloud, such as load balancers and managed services. The cluster can also be programmatically resized at any time as the data size grows. In this architecture, Workday transactional data is the source of truth for any data that is contributed to the warehouse. The system components have been split into two main categories in the producer-consumer pattern: Push components, and Pull components.
Asynchronous Data Contribution
The Push components consist of all software subsystems that we need to curate, de-identify, validate, and contribute the customer data for the data sets that they have opted in for. The push job runs asynchronously on each customer tenant that has opted into at least one data set, and contributes only that data set. The data collection frequency and periodicity are governed by each data set. This model builds resilience into the architecture, and handles:
- Privacy, Ethics and Compliance (PEC) requirements; allows customers to opt out and be forgotten.
- Built-in disaster recovery, long running or intermittent job failures.
- Schema changes and new data sets with every Workday deployment, and bug fixes.
Realtime Reporting Requirements
The Pull components consist of runtime query request, query parameters, privacy controls, and query DSL (Domain specific language). Workday applications connect to the DaaS Data Warehousing System to issue real-time analytical queries using the Workday microservice for DaaS. Customers have the ability to run reports within their Workday application against the global warehouse with the dimensional slicing and filtering as they choose. The query request is built and executed in real-time without any caching.
The microservice interfaces with other Workday services that are only within the Workday network, Amazon Simple Storage Service, and Redshift services that are within the Workday Amazon VPC (Virtual Private Cloud). This API-driven access allows any Workday applications to interact with the data in the DaaS Platform within Workday data centers, providing flexibility to other Workday services. There are a few layers of APIs defined that can be used to interact with DaaS data sets:
- Native REST call: Service-to-service access.
- Application layer access: Internal Workday XpressO API.
- Framework services access: Low level Java API access.
The Data Ingestion (Rebuild the World)
The microservice periodically monitors the buckets for their corresponding storage locations, and decides when to trigger the rebuild process. The rebuild process is atomic, in that while the rebuild is happening the queries can still be served against the existing data with read consistency. Once the new version of the data set is published, any new query then uses the latest published data set.
At every rebuild, a rolling 12 months of data sets are published for each category. In this scenario, if a new customer opts into a data set, that customer will be contributing 12 months of their data (if it exists). Similarly, if a customer opts out of a data set, it will remove all of their contributions for that data set. This process enables the warehouse to grow organically with each data set contribution. The stateless nature of the warehouse rebuild process makes it possible to increase how far the data collection can go back and its periodicity.
Privacy and Security
On a software-as-a-service platform, tenant data is strictly segregated to maintain separation between the data of each tenant. In the benchmarking use case, it is desirable to share certain measure data for comparison purposes and to get a more complete view of a situation (example: salary surveys or other industry benchmarks). This sharing requires preparation for the tenant to scrub the data of any proprietary or sensitive information. Because of this, the sharing needs to take place periodically with an extract and transformation job to be able pass through the de-identification filters and aggregation functions prior to sending the contribution.
In addition, the determination of whether the report data can be linked back to a tenant or the tenant can be inferred based on the report output is performed automatically by the query processing unit based on the report data source, aggregation functions, and query parameters being utilized. For example, if a user wishes to execute a report to determine the median high potential turnover of female full-time employees in the United States working in technology companies, the query DSL will evaluate the report parameters and determine applicable contributed data to match the report request, then pass the results to a privacy function in the microservice to determine if enough contributors participated in the data set for an aggregation after which an individual tenant is not inferable. This means the data is not attributable from the tenant requesting the report execution or the contributing tenants.
Permission and Configurable Security
The framework supports segmented security at subcategory granularity. More enhancements in this area will surface as we add more functional areas, such as intersection security for geographic location or worker type, for example. The diagram below depicts the level of control the customers’ administrators can implement for their tenant.
The Data-as-a-Service Platform is only accessible via the Workday Microservice. The authorization is at a service level between Workday Services and Microservice. From Microservice to Redshift, it is a Virtual Private Cloud with an SSL certificate and username and password authentication.
The data is transferred from the customer via TLS (Transport Layer Security, https). It is then encrypted at rest via AWS KMS Keys.
After much research, we have identified two methods of protecting the customers’ anonymity:
- Differential privacy and error injection.
- Thresholding as a strategy.
We have received customer feedback against injecting error into the benchmark results. This made the thresholding strategy more desirable. It also fits well with the way the quantiles are computed. Below is an example report output with mock data using the thresholding strategy:
Use Cases That Are Prime for the DaaS Platform
Benchmarking solution: A global automated data collection and analytical system that allows Workday customers to compare their company performance indicators to their peers in their industry.
Single version for 3rd-party data: The DaaS Platform can act as a single global, secure and performant storage with key-value pair search capabilities. For example, Workday applications can utilize publicly available data for supply chain vendor integration or the geo-location data sets.
Marketplace for configuration data across customer tenants: The DaaS Platform can be enhanced to store and allow sharing of configuration data across customer tenants. For example, professional services firms can share high-value custom report definitions, or a customer can share configurations that they find useful.
Billing and metering: The Workday Cloud Platform API usage and metering data sets are currently hosted on the DaaS Platform. This can be expanded to be a single billing system for all Workday applications that require a billing solution.
Near-real time performance optimizations: Workday services can register incremental usage statistics to the DaaS platform. Those services can then make data-driven runtime analysis and adjustments by utilizing the real-time aspect of the DaaS query capabilities. Examples are dynamic ordering of report filters at report runtime based on field execution statistics, or transaction commit logic that can be optimized per task.
The Road Ahead
As far as the product road map, we’re looking at:
- Custom benchmarking.
- More enhancements on Workday Cloud Platform API usage and metering.
- Global data sets such as geo location data, Japanese and Korean postal codes.
- Machine learning and taxonomy mapping.
Customers can join the Benchmarking solution by signing the Innovation Services Order form and opting into the Categories within their tenant.
We’re excited for you to join our Data-as-a-Service family! Thank you for reading.