Apache Atlas — Using the v2 Rest API

Putting Data Governance and Metadata Management to Work

by Venkatesh Sekar

The Need for Data Governance

Data governance is an important paradigm especially for the part of the organization with ultimate responsibility for data. A data governance solution helps answer questions such as:

  • Which sources are feeding data?
  • What is the data schema of these sources?
  • Which process reads the data and how is it transformed?
  • When was the data last updated?
  • Can we classify the data as private, public, etc?
  • Based on classification can we know which users and processes are accessing private data?

Data governance is provides the ability to comprehend metadata and then take appropriate actions as required. Actions could be simple as answering the above questionnaire or as complex as defining access policies on confidential datasets so that only specific users or groups can view a particular dataset.

You might convince yourself that you are providing appropriate data governance measures already via a wiki, spreadsheets, and various documentation. That’s a start but those tools fail to give a complete end to end picture, and if not updated and continuously contributed to, would be out of sync with what is happening day to day with the constantly changing and growing datasets in your environment.

How Can Apache Atlas Help?

Apache Atlas is a data governance tool which facilitates gathering, processing, and maintaining metadata. Unlike spreadsheets and wiki docs, it has functioning components which can monitor your data processes, data stores, files and updates in a metadata repository.

Additionally, system notification functionality is provided so that entities get updated with new or dropped columns. In combination with a security solution such as Apache Ranger, Atlas can be used to define access policies for users and processes.

While Apache Atlas is typically used with Hadoop environments, it could be integrated into other environments as well, however certain functionalities could be limited.

Defining Metadata in Apache Atlas Using the Rest API v2

In this post, I’ll walk you thru the process of defining metadata in Apache Atlas using the REST API v2. I won’t be reviewing or explaining the Apache Atlas architecture or solution — if you are looking for that type of information, you can visit the Apache Atlas documentation site.

Today, there is minimal documentation/examples on the format of the request JSON and I hope that this provides some assistance and acceleration to your Apache Atlas projects. For purposes of clarification, I’ll be referencing and using Apache Atlas v0.8.

Let’s Review a Hypothetical Scenario

The following is a very simple data ingestion process scenario…

A source system (news site scraper) uploads a CSV data file to your landing zone. You have a Python process that reads the file, formats it to JSON, and publishes the record to a Kafka topic. A streaming process consumes the message from the Kafka topic, enriches it with some data and stores it into a database (HBase).

In the example, it’s assumed that the Storm Atlas Hook for HBase is not configured. I am not going to show actual code of these artifacts (Python script, Storm code, etc.), but I am going to demonstrate what you would need to do define these into Atlas.

Some Basics

We are focusing on using the Rest API v2 endpoints — the documentation for the REST API endpoint is at Apache Atlas REST API v2.

As per the Apache documentation a ‘Type’ in Atlas is a definition of how particular types of metadata objects are stored and accessed. A type represents one or a collection of attributes that define the properties for the metadata object.

From a developer’s viewpoint, a Type is analogous to a class definition, for example:

Type : Kafka_topic

  • Broker
  • Topic name
  • Topic configuration
  • Key schema
  • Value schema

Type : Data_File

  • File name pattern
  • Directory
  • Server
  • Format
  • Data schema

An Entity is an instance of a Type, for example:

Type : Kafka_topic : new_topic

  • Broker => analytics_topics_broker
  • Topic name => news_topic

Topic configuration

  • Key schema => Id , String
  • Value schema => Id ,String
  • Url => String
  • Headline => String

You can get further information on the above concepts by referring to the Apache Atlas Type System documentation.

Interacting with Atlas

For the solutions that we are defining in Atlas, we are going to be defining the following Type Definitions:

Node:

  • File
  • Python script
  • Kafka topic with schema definition

Entities:

  • DataFile
  • Kafka topic
  • HBase table

Process:

  • Python script
  • Storm topology

REST Endpoint

To interact with the Atlas REST V2 endpoint, I am going to be using curl. Below is a sample interaction which is used to get the entire type definitions in Atlas:

ATLAS_BASE_URL=https://myatlas-server:21443/api/atlas/v2
curl -negotiate -u venkat -X GET -H ‘Content-Type: application/json’ -H ‘Accept: application/json’ “$ATLAS_BASE_URL/types/typedefs”

Note: My Atlas instance was Kerberos protected and therefore the negotiate flag was used.

I am defining multiple definitions on a single request and for this I am using the bulk API. For submitting types/definitions in bulk, we wrap various definitions inside an array element. For example, to define multiple entities at the same time:

{
“entityDefs” : [ …

]
}

Type Definition : Node

In our solution we want to represent a server where files will be uploaded by the source system. The server is also the machine where our python script will execute. Out-of-the-box, Atlas does not have a type definition representing a server. Hence we are defining the type entity for server.

I would also like to classify these servers for various context, for instance, landing zones, worker nodes, etc., and since the server is an infrastructure component, it inherits from the “infrastructure” type. Inheritance is defined for the key “superTypes”.

The request JSON is stored in the file typedef-node.json, and we invoke the REST endpoint as per the below:

curl -negotiate -u venkat -X POST -H ‘Content-Type: application/json’ -H ‘Accept: application/json’ “$ATLAS_BASE_URL/types/typedefs” -d “@./typedef-node.json”

On successful response you could query the type definition response and you can see the definition in the response as shown below:

You can now find the tags in the ATLAS GUI — below is a screenshot:

Type Definition : File

In our solution, we want to represent a data file, which is the file the source system uploads. The data file is uploaded to a specific server under a specific directory, and the data follows a specific schema. Out-of-the-box, Atlas does not have a type definition representing a file. Hence we are defining the type entity for data file.

To demonstrate, I have type defined the schema as a structure “schema_col” element. Also I defined classification tags by which we can tag the data file, and these are defined under the “classificationDefs” elements.

The entity definition “DataFile” is the type definition for the file type. Since the file holds the data, it inherits the functionality of “DataSet”. The request JSON is stored in the file typedef-file.json. We invoke the REST endpoint as below:

curl -negotiate -u venkat -X POST -H ‘Content-Type: application/json’ -H ‘Accept: application/json’ “$ATLAS_BASE_URL/types/typedefs” -d “@./typedef-file.json”

On successful response you could query the type definitions and see the definition in the response.

Also, you could find the type definition in Atlas GUI as per the below.

Type Definition : Kafka Topic with Schema

Kafka is part of our overall solution, and out-of-box definition for Kafka does exist in Atlas, however there is no facility to define the schema for the topic key and topic value. Yes, you could argue that for this function, a schema registry is the appropriate solution. For purposes of this solution, I want to facilitate this in Atlas itself, assuming that I don’t have a schema registry available.

We are also using this example to demonstrate that you can extend an entity that is defined in Atlas. The type “kafka_topic_and_schema” extends from “kafka_topic” type.

For defining the message schema i have declared a type definition “kafka_value_message_schema” which is unlike the struct definition found in “data-file” type. By doing this, you can now mark/tag individual fields during entity definition. The type definition is stored in the file typedef-kafka_with_schema.json.

We invoke the REST endpoint as below:

curl -negotiate -u venkat -X POST -H ‘Content-Type: application/json’ -H ‘Accept: application/json’ “$ATLAS_BASE_URL/types/typedefs” -d “@./typedef-kafka_with_schema.json”

Type Definition : Python

We now need to represent a python script which ingests data from the data file into the Kafka topic. The script executes in a specific server. Atlas does not have a type definition representing a Python script so we will define the type entity for a Python script.

The entity definition “python_script” is the type definition for the Python script. Since this is a process it inherits the functionality of “Process”.

The request JSON is stored in the file typedef-python_process.json, and we invoke the REST endpoint as per the below:

curl -negotiate -u venkat -X POST -H ‘Content-Type: application/json’ -H ‘Accept: application/json’ “$ATLAS_BASE_URL/types/typedefs” -d “@./typedef-python_process.json”

On successful response you could query the type definitions and see the definition in the response. Now that we have reviewed Type Definitions, let’s move into defining Entities, which as mentioned earlier, are instances of Types that we defined.

Entity : Server

We will now define servers for both landing-zone and storm node and classify the landing zone server using the “classifications” element.

The request JSON is stored in the file infrastructure-entities.json, and we invoke the REST endpoint as below:

curl -negotiate -u venkat -X POST -H ‘Content-Type: application/json’ -H ‘Accept: application/json’ “$ATLAS_BASE_URL/entity/bulk” -d “@./infrastructure-entities.json”

On successful response, you can see the entities via the Atlas GUI as in the below screenshot.

Additional properties are provided as well by clicking on the name:

The classifications are displayed under the Tag tab:

The Audit tab provides information on when the entity was created and updated last:

DataSet Entities

Let’s now define multiple entities in order to hold data. In our solution these are data-file, kafka-topics, and the HBase table.

Referring to previously defined entities

The data file has “server” attribute which would need to refer to the previously configured “landing_zone_server_1”. By default the REST API prefers to have such references via the “guid”.

The guid is assigned by Atlas while creating the entities, and this works fine when defining these entities via the Atlas GUI, but not it’s not too friendly via the script. To handle this, we first define the “referredEntities” map in the request. On this map we give an invalid guid, but set the attributes that would match the exact entity.

"referredEntities": {
"-100": {
"guid": "-100",
"typeName": "server",
"attributes": {
"qualifiedName": "landing_zone_server_1@dev",
...
}

On the data-file entity itself we set the server to the key, as per the below:

{
"typeName": "DataFile",
...
"server" : {"guid": -100,"typeName": "server"},
...
}

Referring to entities defined in the same json

In cases where the entity is defined as part of the same request, we cannot use the “referredEntities” dictionaries. In this case, we have to set a random guid (which has to be negative) and use this as needed.

An example for this is the definition for hbase_table. The HBase table is being referred in the hbase_column_family.

{
"typeName": "hbase_table",
"createdBy": "ingestors",
"guid":-110 ,
...
},
{
"typeName": "hbase_column_family",
"createdBy": "ingestors",
"guid":-111,
...
"table":{ "qualifiedName":"news:news_from_reuters" ,"guid":-110 ,"typeName": "hbase_table"}
}

You would also find similar examples with kafka_topic_and_schema & kafka_value_message_schema entities.The request JSON is stored in the file news-ingestion-dataset.json. We invoke the REST endpoint as per the below:

curl -negotiate -u venkat -X POST -H 'Content-Type: application/json' -H 'Accept: application/json' "$ATLAS_BASE_URL/entity/bulk" -d "@./news-ingestion-dataset.json”On successful response you can see the entities via the Atlas GUI, as in the below screenshot.

The data file entity:

The details of the data file entity:

Note the lineage is empty, as there is currently no lineage defined yet. The Kafka topic entity is below:

Note that each field, for example, id, url, etc. is an entity, this is not the same as the data-file entity. I would encourage you to explore the HBase table definitions in the Atlas GUI.

Process Entities

Above we reviewed DataSet entities, and now let’s look at defining multiple Process entities which provide the interconnection/lineage between the defined datasets. For our solution these are python-script and storm-topology.

The request JSON is stored in the file news-ingestion-process.json. As explained in the previous section, the dataset entities are referred in the request via the “referredEntities” map and we invoke the REST endpoint as below:

curl -negotiate -u venkat -X POST -H ‘Content-Type: application/json’ -H ‘Accept: application/json’ “$ATLAS_BASE_URL/entity/bulk” -d “@./news-ingestion-process.json”
WARNING : I have purposefully set an easter egg in the request json. Find it ;-)

On successful response you can now see the lineage in the Atlas GUI, by looking at the data file:

Upon further exploration and just to provide a challenge, I defined and updated entities and processes and defined a complex process evolution like below:

Retrospective

What was demonstrated is just a tiny bit of interaction with Atlas via its REST endpoint. Similar interactions can be done via its specific Kafka topic too.

Atlas does not really validate if there is a server, directory, Hbase table, etc. as it just takes the data definition and stores it in the repository.

You might also wonder if a data analyst will be able accomplish what we did above, and it’s a fair question. I would recommend that the type of interaction we just went through with Atlas should be performed by a more of a DevOps style resource.

Also, if there are Atlas hooks then they should ideally create the entities without ever creating a JSON. For example, Hive and HDFS entities get defined in Atlas with no interactions.

It’s also true that the above exercise is a static definition and does not change the flow as data/processes evolve. To support this functionality, we would need to configure hooks or you could develop a custom hook, which then can interact with Atlas.

Atlas is evolving and more type definitions are being added in future releases too. Should an Atlas upgrade not be in your immediate roadmap, this example can be used as an experiment to test out in your current Atlas implementation.

Functionalities such as tagging, classification, and user policies were not addressed here, and we will save those for a future Atlas post.

Lastly, if you’d like to better understand best practices for REST API Design, be sure and check out Jay Kapadnis’ recent post.


Feel free to share on other channels and be sure and keep up with all new content from Hashmap here.

Venkat Sekar is Regional Director for Hashmap Canada and is an architect and consultant providing Big Data, IoT, and AI/ML solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high value business outcomes for our customers.

Be sure to catch Hashmap’s Weekly IoT on Tap Podcast for a casual conversation about IoT from a developer’s perspective.