Multi-tenancy for Big Data: Part 1

Modern businesses understand that data is not just important to your business, it is your business.

Published in

Crunchyroll

8 min readNov 26, 2019

Data being important to the operation of a business has been known since the advent of bookkeepers laboriously entering business transactions into paper ledgers with quill pens. Modern businesses understand that data is not just important to your business, it is your business. By extension, corporations that operate multiple business entities need to support data across those individual lines of business and develop meaningful insights both within and across them.

This requires a multi-tenant data environment. Each business unit (BU) runs its business across its brands with individual data ecosystems both to operate the business and conform to the data governance requirements appropriate for the business. Each brand will have additional requirements unique to its purpose. Each brand then rolls up to a BU and each BU rolls up brand aggregates to the business. Finally, the business aggregates BUs to build insights.

Brand-aware Multi-tenancy

The ability to have a multi-tenant data architecture that both provides the ability to build insights across BUs and supports the data governance model for each business is called Brand-aware Multi-tenancy. The Crunchyroll multi-tenant data architecture looks something like this:

Multi-tenant data lifecycle

The multi-tenant data lifecycle defines the tenancy throughout the evolution from raw data to refined business aggregations. For ingestion, client events and transactional data are typically tenant-specific. This simplifies the processing, normalization, and aggregation of this data for individual tenants. However, ingestion of 3rd party data may be multi-tenant, so it can be important to build out tenant-agnostic ingestion and normalization to make the tenant-specific copies of data sets.

Perhaps this is better explained through the use of a table:

The cross-tenant data lifecycle begins with the tenant trusted data and builds cross-tenant trusted aggregations. In principle, these can be as simple as rolling up the individual tenants across their BUs. In practice, cross-tenant trusted aggregations for similar tenants rely on business-defined metrics as well as quality and governance requirements which are agreed upon across the businesses. Even across similar lines of business, the differences are often significant. This fact is often overlooked in the rush to “put all of your data in one place” when in reality there’s a whole lot of work to figure out what subset of that data even makes sense across tenants.

Multi-tenant Data Lifecycle Example

The tenancy model is complex and a sample visualization might help:

This Sankey diagram visualizes the tenancy model using some sample data:

All of the data from tenant 1 and 2 flows into the raw zone
The multi-tenant ingested data is segmented by tenant and then also written to the raw zone
All of the data is cleaned and repartitioned for analytics by tenant
Some of each tenant’s data is processed into the trusted and derived zones according to the business use case
For trusted data that is agreed across tenants, that data is rolled up into cross-tenant aggregations
Like any real environment, not all of the data is useful

Note that multi-tenant and cross-tenant mean two different things. The former is tenant-specific data, which is intermingled with other tenants. The latter is data that has been combined according to agreed-upon business semantics across tenants.

An example might help here. Most of the brands at Crunchyroll and our sister companies offer SVOD service as a primary business model. While each brand reports on viewership of the SVOD content, each brand also has brand-specific requirements to report on viewership. In order to roll up viewership insights across multiple brands, the differences across the individual brands need to be resolved into a common set of insights across those brands in order for the roll ups to be meaningful.

Data access versus governance

Data access is conceptually as simple as making the right data available to the right personas. Data governance is much more complicated, but typically it involves:

Data policies
Data quality
Regulatory compliance
Risk management
Business policies

As a modern business, Crunchyroll must operate within the regulatory environments in which it does business. Regulatory compliance results in well-defined data and business policies which specify how to classify any data managed by Crunchyroll. These policies result in access controls that are used to uniformly enforce these policies.

In order to implement data access and governance in a durable manner, one needs to define personas that map to end-user roles used for access control across the data life cycle. At Crunchyroll, we have the following personas defined:

Multi-tenancy complicates this. Remember that the notion of trust is defined by the business, and multi-tenancy inherently requires a notion of trust that’s agreed upon across BUs. From there, the agreed-upon information can be combined across BUs. So in general, cross-tenant access is limited, and only useful to data sets which are trusted or refined (fit for purpose) across multiple businesses.

All of this requires a major investment in the metadata, which is discussed in more detail directly below.

Third-party considerations

No enterprise of any scale lives entirely without external data sources and Crunchyroll is no different. Simply within the Crunchyroll ecosystem, we source data from a number of third parties such as the Apple Store, Google Play and other third parties that gather additional information across our users. Furthermore, this information may span multiple tenants and include information that is useful to other tenants. Ingestion of this data must understand the tenancy model of the data being ingested and have sufficient tenant awareness to correctly distribute data for tenant processing.

Consumption is no different. We leverage third parties such as Amplitude and Braze primarily for marketing and customer engagement metrics. Just because you have a contractual relationship with a third-party does not alleviate your need to practice good data hygiene with those third parties. Understanding the tenancy model for handling of data as it leaves the corporate boundary is just important as within the data lake.

Security and Compliance

With the advent of GDPR, CCPA, and the considerable financial risk to enterprises from poor data hygiene, security is of paramount importance. Multi-tenancy complicates the manner because the security policies must conform both within the individual BUs and across the corporate and external regulatory and compliance environment. The Crunchyroll approach to security involves:

Building security practices within the BU as required by the operation of the business
Establishing data governance and access policies within the business that are conformant

As you build your data governance model across the tenants, this will eventually lead to unified security practices which are rooted in best practices of the individual tenants. Just as building multi-tenant meaningful data sets relies on agreed-upon business definitions, data governance also relies on agreement across tenants as well as conformance across all of the regulatory and compliance environments for those tenants.

Metadata Management

The entire lifecycle of metadata management is core to building a successful data lake. Without metadata, a data lake does indeed become the proverbial data swamp. Metadata fulfills a wide variety of different business functions:

Data classification and compliance
Data inventory, lineage and data discovery
Unification of the data landscape

Like any other aspect of a data practice, metadata creation and management must be automated to the extent possible. However, generation of useful metadata at ingestion as well as data classification is very difficult to automate. A simpler approach is to enforce the creation of analyst-friendly metadata and data classification by the teams that build these capabilities and build automation to support the propagation thereof.

The format on which we’ve settled at Crunchyroll involves building metadata that describes:

Per-attribute data classification
Lineage or source table for derivative
Domain values/ranges for the attributes
A business analyst-friendly description of the attribute, i.e. plain English
Links to specific documentation which has more detail
Schema objects above attributes should contain the actual business purpose for the object

Crunchyroll is rolling out support for a tool called Alation which will support the full life-cycle of metadata management. Look for another blog post detailing our use of this technology to this channel soon!

Data classification and compliance

Crunchyroll has invested heavily in business and data policies to facilitate compliance within the Crunchyroll regulatory environment. Core to this is the notion of data classification, by which any datum can be attributed to a sensitivity which affects its handling during the data lifecycle. Crunchyroll has defined the following 5 levels of data classification:

Public — says it all
Confidential — matters of interest solely to Crunchyroll
Proprietary — specific behavior relating to customers or developers
Internal — typically referred to as PII — subject to compliance controls
Secret — highly sensitive personal or financial information

Without data classification, it’s impossible to implement a compliance program with any credibility. Similarly, unless you can classify your data, the metadata will not be useful for compliance efforts.

Data Inventory, lineage and data discovery

In order to have any chance of maintaining the structure of the data lake, you need to enforce a strict metadata requirement. This implies that metadata will be kept throughout the entire data lifecycle, i.e. ingestion to consumption. The hardest part is ingestion of data — ingestion of data needs to build the metadata that describes that data at the point of ingestion. As mentioned before, it’s critical to generate metadata synchronously with the creation of the data set as this is the only way to build lineage and facilitate discovery.

Unification of the data landscape

Unification of the data landscape is often viewed as the holy grail of data spread across multiple BUs. As we’ve already discussed, there are many practical challenges to simply throwing all the data together and having everything work properly and in a fully-compliant manner. However, unification of the metadata landscape is a good first step towards unification of the data landscape.

Data landscape unification follows broadly into the following areas:

Visibility
Comprehension
Collaboration

Visibility into the metadata gives you insights into the data managed by individual BUs and is the foundation on which you build those trusted cross-tenant understandings. Comprehension — how the data is accessed and used — provides insight into the core value of individual data sets as well as the quality and fit-for-purpose. Collaboration ultimately makes data more accessible to more people, thus facilitating institutional knowledge sharing across the tenants.

This concludes part 1 of the blog. Look soon for part 2 where I discuss the technology that supports our multi-tenant data lake!