Analytics Challenges — Big Data Management & Governance

Published in

Analytics Vidhya

6 min readJan 11, 2020

In the previous post, I have listed the Analytics Challenges that most organizations facing today. In this article, I will talk about Data Management & Governance…

Data Governance is required to manage the lifecycle of all data in the enterprise. Thus, the Data Governance should define the processes and rules to manage how data is retrieved, validated, integrated, maintained, secured, found, accessed, shared, retired, etc…

The data is usually sourced from legacy, system of records, external cloud systems, internal systems in several forms.
In an enterprise with several hundreds of such systems, managing and maintaining the whole data asset properly is the topic of Information Architecture and Data Governance. Most organizations do not run a Data Governance Program and do not have Information Architecture as these are quite complex and costly initiatives.

Instead, all data could be pulled into the Centralized Big Data & Analytics Platform where it can be cleaned, controlled, managed and processed.

In short, this article will discuss about Big Data Governance…

A Big Data Governance solution might consider the following…

Governance:

Data Governance has always been considered as an overhead.
Data Governance is not a technical solution but an enforcement of authority over the management of data.
Your organization may already have some Data Management and Governance practices which can be improved or it is a green-field Analytics Platform where DG can be established the first time. Thus a startegy is required to initiate and maintain Data Governance. One of such approaches is non-invasive Data Governance.
Non-Invasive approach suggests to embed DG processes into existing ones instead of introducing new processes without introducing much overhead.
Non-Invasive DG defines data domains/subject areas with important data and its owners together with the data's impact on the business to determine the related cost if this data is not governed properly — creates a business case.
Once the business case is approved with support for the DG initiative, then a Data Governance Operating Model is defined.
The data governance body determines the DG Principles and Data Policies such as:
* Data Management Policy
* Data Protection Policy
* Data Compliance Requirements & GDPR Policy
* Data Quality Standards
Data Governance body assigns the Data Stewards and identifies the Authority (Data Owner) for each domain.
To drive data interoperability, Data Stewards define the Data Glossary (aka Taxonomy) with enterprise wide agreements and populates the the Business Data Catalogue.
Data Stewards would also manage the Reference Data.

On-boarding:

Assume, Business Analyst needs to develop a Report and searches Business Data Catalogue (where all reports are defined by the meta-data extracted from reporting tool) If not there, s/he request a new data set required by the Report. This initiates the on-boarding workflow.
Business Analyst specify the Data Owner for the new data asset. If it a new data assets has to be created to deliver the report, also provides the related technical meta-data (schema), business meta-data (for each column: context, taxonomy, data sensitivity, data quality validation rules) for the new data sources required.
Data Owner approves the Data Access request after related Report is created by Data Engineering Squad as part of the workflow.

Ingestion, Validation, Storage:

Data Engineering Squad creates the related Data Pipeline with the related schema, and business-metadata provided by the Business Analyst.
During the Ingestion, The sensitive data is tagged as PII based on the business meta-data and the sensitive data is protected (data masking, stateless tokenization, etc) by the Data Privacy/Protection solution with policies.
For each row in the source feed, a UUID for the feed and a Surrogate Key for each row is created and appended to the data set to identify it later during transformation.
The Operational meta-data created during Ingestion (as part of transformation) is captured into Meta-data Repository with its lineage.
Finally, the data lands on staging area as raw data and tagged as Bronze. The ingestion can’t guarantee the duplicates at this stage.
The raw data is quality validated based on the schema and data quality validation rules (as defined in the business meta-data). A Data Quality Framework (Deequ running on Spark, Apache Griffin, etc.)
The ingested data will be tagged as Bronze and as one of Good, Bad, Ugly based on validation outcome.
Alternatively, you might think of a ML based solution that extracts the meta-data, and tags, quality validates, profiles the data automatically.
At this stage, the Business Data Catalogue displays the Data Source, as shown below:

Curation:

The data needs to be promoted from staging (raw) layer into curated layer after removing duplicates (using Surrogate Keys), data is standardized based on rules or ML Algorithms (if not done already during ingestion )and integrated for a common model.

During the data curation activity, tables will need to be created in Curated Layer that will be tagged as Silver and Data Steward may be required to approve the data in these tables (if ML Algorithms are involved, etc.).
All the related meta-data, lineage and tagging is captured in the Meta-data Repository.

Transformation & Reconciliation:

During ETL processing (transformation) additional tables with technical and operational meta-data will be created with related lineage that will populate the Meta-data Repository.
During the ETL, Data Quality rules will validate data quality dimensions and report issues into Data Quality dashboards.
As the last step, Reconciliation jobs will be running based on KPI's captured from the source system with each data feed. The reconciliation issues will be reported (for each failure) into Data Quality dashboards.
After all validations, the data is promoted as Data Mart, OLAP Cube, Denormalized entities into Semantic Layer and tagged as Gold. Gold data is ready to be served to external users.
The Silver, Gold tags needs to be validated by Data Stewards periodically. Tagging can also be performed at the Data Catalogue that should reflect back to on the Big Data Platform’s Access Management component.

Data Access:

Every access to the data should be protected with MFA and logged. The data found in Bronze, Silver, Gold will be protected and can be accessed only through Hive Views through Access Policies.

Data Lineage & Find-ability:

All the feeds and entities (tables created during ETL) and related metadata with tags can be searched in Business Data Catalogue and related lineage from the source (broze) to final data product (gold) can be overseen.
All users will further tag the data using crowdsourcing approach to improve the quality/trustability of data.

Compliances:

Related audit reports will be pulled from Access Management Component to comply with SOC II. GDPR compliance requires to search for PII data from Business Data Catalogue (for specific customer) and PII to locate where the data is.

Data Security:

Sensitive data will already be masked during ingestion as explained above.
The data is protected with automatic encryption at-rest.

Data Retention & Retirement:

Based on the business meta-data captured during ingestion, Retention jobs will be moving/deleting older data at each layer (raw, curated, semantic) to archive/cold storage.

Conclusion:

In this article, I have outlined how data is on-boarded, validated, stored, integrated, secured, found, accessed, archived, retired as part of Data Management.

The Data Governance team also will need to define rules and policies for Data Management such as:

Reference Data Management — definition of how reference data is created, managed, etc.
Rules, processes for on-boarding internal/external data, etc.
Policies for Data Management, Protection, Compliance, etc.
…

See the related Analytics Challenges for more articles on this series.

Contact for questions: ckayay@gmail.com