Extending Activeloop Hub capabilities to handle Waymo Open Dataset
Too much time is spent on setting up the data. With well-designed data pipelines, rapid iterations of machine learning experiments will result in models with superhuman accuracy much faster.
We are releasing simple, but yet powerful, python native access to Waymo Open Dataset . The Hub package allows streaming any chunk of the data to your local machine. It could be used not only for fast exploration or visualization purposes but also for directly training machine learning models.
Waymo Open Dataset
Waymo published one of the very first large-scale autonomous driving datasets for the research community. This includes a high-quality multimodal sensor dataset that covers a wide variety of environments. The goal is to help researchers gain advances in 2D and 3D perception, including scene understanding and behavior prediction.
The data is about 2TB after compression. To access it, one downloads 89 tar files and uncompresses them into 1950 .tfrecords files. Then, use the waymo_open_dataset package to load the data. But Snark Hub can simplify access to this data and significantly reduce download/access time.
At Snark, we have released an open-source package called Hub to manage large scale datasets. The package lets you represent large arrays on the cloud or on remote storage as if they are local NumPy arrays. We want to simplify access to the data for exploration and ML training purposes.
1. Register at Waymo
To access the data, you will need to register at Waymo Open and accept the license agreement. As noted on their website, it may take up to 2 business days to be granted access to their Google Cloud Storage Bucket.
Then, authenticate the Google Cloud inside your terminal by running.
gcloud auth application-default login
2. Install the Hub Package
Install python package simply by running.
pip3 install hub==0.5
3. Stream the data
Enjoy simple, yet powerful, access to the data inside your python script.
waymo = hub.gs('waymo_open_dataset_snark').connect()ds = waymo.dataset_open('v1/training')
ds['images'].shape # [158361, 5, 1280, 1920, 3]
ds['images'][0,0].mean() # 106.92709309895834
To visualize a single image without requiring you to download the rest of the data simply run it.
from PIL import Image waymo = hub.gs('waymo_open_dataset_snark').connect() camera = waymo.array_open('v1/training/images')
for i in range(0, 5):
img = camera[10000, i]
5. Laser point clouds
You can also go beyond images and access laser point clouds and labels.
# Open the dataset
ds_train = waymo.dataset_open('v1/training')
ds_val = waymo.dataset_open('v1/validation')# Get all arrays from the dataset
6. Accessing v1.2 version for Waymo Dataset Challenge
In the same way, you can access and explore the v1.2 version by just changing the naming for participating in the Waymo dataset challenge.
> ds_train = waymo.dataset_open('v1.2/training')
dict_keys(['labels', 'lasers_camera_projection', 'images', 'lasers_range_image'])
[158081, 5, 2, 200, 2650, 4]
You can access domain adaptation datasets
We plan to provide future tutorials to let you directly train machine learning models while streaming the data through data pipelines from Hub.
Thanks, Waymo for hosting the data backend.
 Sun, Pei, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo et al. “Scalability in Perception for Autonomous Driving: An Open Dataset Benchmark.” arXiv preprint arXiv:1912.04838 (2019).