February 17, 2016
In our previous blog post, Bhooshan talked about the importance of metadata and what value-add it provides to enterprises. The problem of data discovery and lineage becomes increasingly complex if multiple different storage layers are used across several different frameworks. Datasets might reside in different storages — HDFS (batch use case), HBase (realtime use case) or RDBMS. Facilitating identification of relevant data sources through the powerful combination of technical and business metadata provides an organization with greater agility making it easier to apply data governance and compliance rules.
Cloudera Navigator is one such product that provides a self-service data discovery platform through which users can easily explore and tag data using a search interface. It supports multiple source types in the Hadoop ecosystem, like HDFS, Hive, Pig and more. This blog post will talk about how CDAP users can leverage Cloudera Navigator to search and view metadata managed on CDAP.
In CDAP v3.2 we introduced the Metadata feature, giving users the ability to add their business metadata to CDAP entities like Applications, Programs, Datasets to name a few. In CDAP v3.3 this feature was enhanced to add System Metadata, which is automatically added by the platform when the entity is created in CDAP. With a powerful search capability, users are able to query metadata tags and properties. But CDAP metadata still resides within the platform.
As we discussed previously, a number of use cases warrant the ability to search across the different stacks of the Big Data ecosystem for entities that have particular metadata. Making CDAP metadata available in Cloudera Navigator for view and search provides that bridge, and users can then simply use Cloudera Navigator as their one-stop big data metadata management system.
CDAP Metadata provides an option to publish the metadata updates to Kafka, including the addition and deletion of CDAP business and system metadata. Once this option is enabled in CDAP it allows interested parties to subscribe to such updates. We leverage this capability for the creation of the CDAP-Navigator Integration Application. This application is a regular CDAP application that has a flow which subscribes to the metadata update Kafka topic, converts the updates to Cloudera Navigator entities and writes them to the Cloudera Navigator Metadata server.
Steps to download and create the application can be found here. Once the application has been created, the ‘MetadataFlow’ can be started. The flow will then push the new Metadata updates to Cloudera Navigator and the user can now use the Cloudera Navigator UI to view and query CDAP Metadata along with metadata from other Hadoop/big data entities.
Note that the current integration only provides a read-only view of the metadata from CDAP in Cloudera Navigator. That is, any changes made to CDAP entity metadata in Cloudera Navigator is not reflected back in CDAP. This limitation will be addressed in a future release of the Integration. In addition to this, creating logical to physical entity relationships, creating lineage links between CDAP datasets and programs, and surfacing CDAP audit logs in Cloudera Navigator are some of the items on the roadmap.