Here’s a Quick Start Guide on Hangar

Hanoona Rasheed
FSE.ai
Published in
4 min readFeb 10, 2020

Version control systems like Git makes it possible for software developers to track changes in their source code and coordinate work among the programmers. So think about a version control system solely for tensor data. Hangar! A data handling toolkit designed to handle large scale numerical data, making data collection and management easier. Make actions like commit, branch, merge, revert, and collaborate in the data-defined software era. So let’s dive right into it!

Installation of Hangar

You can install Hangar using Anaconda with:conda install -c conda-forge hangar For pip installation, use pip install hangarand for source installation, clone the repository from Github and install hangar:

git clone https://github.com/tensorwerk/hangar-py.git
cd hangar-py
python setup.py install

And you just start with,

from hangar import Repository

Hangar Workflow

So let me first brief you about the Hangar workflow. It begins with creating a repository and initializing it, activating the checkout and creating a branch. You can then access the samples and make changes, like adding, removing or changing the samples.

Data Storage in Hangar

In Hangar, a Dataset is made of one or more Arraysets, which stores a collection of samples. The samples in the arraysets are numeric arrays of a similar data type. You can also set its dimensions to be smaller or equal to that of the maximum shape of the arrayset itself.

Working with Hangar Repositories

In this blog, we are going to a)create a repository, initialize it, b)checkout the repository, c)create arraysets in the repository, and d)add data into the arraysets.

a. Creating & Initializing Repository

We know from the workflow, the very first step is to create a repository and initialize the repository. You use the Repository() function, and the path to the repository file is provided by the user. Note that the first time you do this, python throws a warning to run the initialization using the init function.

repo = Repository(path='D:/hangar_doc/repos')

Out:

UserWarning: No repository exists at D:/repos\.hangar,please use `repo.init()` method
warnings.warn(msg, UserWarning)

You just created a repository and assigned it to a name ‘repo’. Let’s go ahead and initialization it. Here’s how you can do it.

repo.init(user_name='Name of User', user_email='some_email.com', remove_old=True)

Out:

Hangar Repo initialized at: D:/repos\.hangar'D:/repos\\.hangar'

Now that we have initialized our repository, let’s get to know how to work with Arraysets, activating the checkout and creating a branch.

b. Repository Checkout

A repository can be checked out in two different modes, namely ‘write-enabled’ and ‘read-only’. We need to turn the checkout of the repository to ‘write-enable’ in order to initialize the arraysets and write into them. This can be done with,

co = repo.checkout(write=True)
co

Out:

Hangar WriterCheckout Writer : True
Base Branch : master
Num Arraysets : 0
Num Metadata : 0

You can access and analyze the contents of the checkout using the co.arraysets and co.metadata commands.

co.arraysets
co.metadata

Out:

Hangar Arraysets Writeable: True Arrayset Names / Partial Remote
Hangar Metadata Writeable: True Number of Keys: 0

c. Arrayset Initialization

Now that we have the checkout write-enabled, we need to initialize an arrayset in-order to add data into the repository using the init_arrayset(). Before we do this, let us first load some numPy data. I’ll be using the MNIST data set, and show you how you add them into our repository we just created. You can download the dataset from ‘mnist.pkl.gz’.

import gzip
import pickle
import numpy as np
file_ = gzip.open('D:/hangar_doc/dataset/mnist.pkl.gz', 'rb')
train_DS, test_DS, validation_DS = pickle.load(file_, encoding = 'bytes')
# Extracting training images and labels into numpy arrays
image_arr = train_DS[0] # image_arr.shape:(50000, 784)
label_arr = train_DS[1] # label_arr.shape:(50000,)

Now that we have the data ready, let us initialize an arrayset for the training images and their targets(image_arr& label_arr). Note that during initialization, we get options to provide a name for the arrayset and a prototype for the allocation of the maximum size of arrayset. For the prototype, you just provide a sample data, and it calculates the size and allocates the space for you. I’m gonna name the arraysets as train_image & train_label .

# sample value in image_arr--> image_arr[0]-->(shape: (784,)
arr_set_img = co.arraysets.init_arrayset(name='train_image', prototype=np.array(image_arr[0]))
# sample value in label_arr--> label_arr[0]
arr_set_label = co.arraysets.init_arrayset(name='train_label', prototype=np.array(label_arr[0]))

Now that we have initialized two arraysets, let us visualize the contents of the checkout and each of the arraysets. The arraysets object in checkout gives details of the useful information about its contents and state.

co.arraysets

Out:

Hangar Arraysets
Writeable: True
Arrayset Names / Partial Remote References:
- train_image / False
- train_label / False

Let’s analyze the contents of arrayset image

arr_set_img

Out:

Hangar ArraysetDataWriterArrayset Name             : train_imageSchema Hash              : 1=8022802f13bbVariable Shape           : False(max) Shape              : (784,)Datatype                 : <class 'numpy.uint8'>Named Samples            : TrueAccess Mode              : aNumber of Samples        : 0Partial Remote Data Refs : False

Close the checkout after completion of the task with co.close()

d. Adding Data to Arraysets

To add data to the initialized arraysets, the add() method can be used. It's just simple as, add(data=‘data_file’, name=‘name to be given’).

co = repo.checkout(write=True)
with arr_set_img, arr_set_label:
for index, image in enumerate(image_arr):
arr_set_img.add(data=image, name=index)
arr_set_label.add(data=np.array(label_arr[index]), name=str(index))
co.close() # Closing the checkout after completion of task.

We have now successfully created our repository, learned to use the checkouts, and create arraysets. With that, we have come to the end of the article. I am excited to know how it worked for you. The entire notebook is available here. And, Don’t forget to give your 👏 ! More cool articles lined up. Stay tuned!

--

--