Extending Activeloop Hub capabilities to handle Waymo Open Dataset

Published in

Activeloop

3 min readApr 1, 2020

Too much time is spent on setting up the data. With well-designed data pipelines, rapid iterations of machine learning experiments will result in models with superhuman accuracy much faster.

We are releasing simple, but yet powerful, python native access to Waymo Open Dataset [1]. The Hub package allows streaming any chunk of the data to your local machine. It could be used not only for fast exploration or visualization purposes but also for directly training machine learning models.

Waymo Open Dataset

Waymo published one of the very first large-scale autonomous driving datasets for the research community. This includes a high-quality multimodal sensor dataset that covers a wide variety of environments. The goal is to help researchers gain advances in 2D and 3D perception, including scene understanding and behavior prediction.

The data is about 2TB after compression. To access it, one downloads 89 tar files and uncompresses them into 1950 .tfrecords files. Then, use the waymo_open_dataset package to load the data. But Snark Hub can simplify access to this data and significantly reduce download/access time.

Snark Hub

At Snark, we have released an open-source package called Hub to manage large scale datasets. The package lets you represent large arrays on the cloud or on remote storage as if they are local NumPy arrays. We want to simplify access to the data for exploration and ML training purposes.

Get Access

1. Register at Waymo

To access the data, you will need to register at Waymo Open and accept the license agreement. As noted on their website, it may take up to 2 business days to be granted access to their Google Cloud Storage Bucket.

Then, authenticate the Google Cloud inside your terminal by running.

gcloud auth application-default login
gcloud init

2. Install the Hub Package

Install python package simply by running.

pip3 install hub==0.5

3. Stream the data

Enjoy simple, yet powerful, access to the data inside your python script.

import hub
waymo = hub.gs('waymo_open_dataset_snark').connect()ds = waymo.dataset_open('v1/training')
ds['images'].shape       # [158361, 5, 1280, 1920, 3]
ds['images'][0,0].mean() # 106.92709309895834

4. Visualize

To visualize a single image without requiring you to download the rest of the data simply run it.

import hub
from PIL import Image waymo = hub.gs('waymo_open_dataset_snark').connect() camera = waymo.array_open('v1/training/images') 
for i in range(0, 5):    
    img = camera[10000, i]     
    Image.fromarray(img, 'RGB').save(f'image-{i}.jpg')

5. Laser point clouds

You can also go beyond images and access laser point clouds and labels.

# Open the dataset
ds_train = waymo.dataset_open('v1/training')
ds_val = waymo.dataset_open('v1/validation')# Get all arrays from the dataset
print(ds_train['images'].shape)
print(ds_train['lasers_range_image'].shape)
print(ds_train['lasers_camera_projection'].shape)
print(ds_train['labels'].shape)

6. Accessing v1.2 version for Waymo Dataset Challenge

In the same way, you can access and explore the v1.2 version by just changing the naming for participating in the Waymo dataset challenge.

> ds_train = waymo.dataset_open('v1.2/training')
> print(ds_train.paths.keys())
dict_keys(['labels', 'lasers_camera_projection', 'images', 'lasers_range_image'])
> ds_train['lasers_range_image'].shape
[158081, 5, 2, 200, 2650, 4]

You can access domain adaptation datasets

> waymo.dataset_open('v1.2/domain_adaptation/training')
> waymo.dataset_open('v1.2/domain_adaptation/training/unlabeled')

Next Step

We plan to provide future tutorials to let you directly train machine learning models while streaming the data through data pipelines from Hub.

Acknowledgment

Thanks, Waymo for hosting the data backend.

[1] Sun, Pei, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo et al. “Scalability in Perception for Autonomous Driving: An Open Dataset Benchmark.” arXiv preprint arXiv:1912.04838 (2019).