Data selection for machine learning — a challenge for automated driving data

Published in

SiaSearch (now Scale Nucleus)

3 min readJul 24, 2020

Applications in the field of computer vision are heavily dependent on high quality data. Data-driven development offers a huge potential, however the performance of the systems relies mostly on the data which is provided during training time. In order to ensure reliable system performance the right data needs to be selected for model training and testing.

The majority of machine learning applications rely on supervised approaches for model training. For some industries (e.g. e-commerce, online ads) the generated data is automatically labeled by the customer during usage. However, computer vision applications typically require a significant amount of manual work in order to encode human knowledge into the systems and generate labeled data to learn from. This is a slow and expensive process which is the bottleneck of most modern machine learning pipelines.

Data selection for perception model training

Although massive amounts of data are required in order to train typical computer vision models it is not only about the sheer size of the data. Some data samples add more value to the training process than others and therefore should be labeled preferably, especially with limited budget and time.

Therefore we need to decide which data to label without manually looking at it or labeling it in the first place. Content-based analysis, also referred to as preliminary feature extraction, offers a good indication of which data will be most useful and where it is located.

Here’s a simple example: You want to train a pedestrian detector but only have a limited budget to label 5k images. When images are recorded with a vehicle, subsampling data in space or time would potentially yield many images with no or very easy to detect pedestrians. While those are also required, it is desirable to quickly identify images with a high probability of pedestrians being present. This can be done by data augmentation through different sensors or context understanding on the image. In the case of automated driving, this challenge is further intensified by two factors:

Data is collected at an entirely new scale
A single intelligent vehicle can produce more than 4 TB of data per day.
The data is multimodal and unstructured
Unstructured data (video, GPS, radar, etc) in a variety of formats cannot be easily searched or aggregated, let alone be directly fed into machine learning pipelines.

What the perception engineer needs is a way to search, access and deploy the raw data based on higher level features or metadata.

Although large amounts of data are being recorded, only small shares are easily available to engineers for development.

The challenge for machine learning engineers

These factors combined create a unique challenge and impose an obstacle to improving training data for ML engineers in particular.
Without the relevant metadata, running SQL-like searches on the raw unstructured data to identify interesting situations (e.g. busy intersections, jaywalking, …) is simply not possible. Only simple queries like velocity or acceleration values are possible, but typically very slow. As the ML engineers need to cope with peta-bytes of data, it might take days to find the right sequence manually.

The result is manual work at large scale, which creates inefficiencies in the workflow and lower quality models. A new type of data platform software is needed — this is what SiaSearch is designed for.

Get in touch

Do you want to learn how we at SiaSearch help ML Engineers to overcome this challenge with automatic metadata generation? Sign up for our newsletter, or for a demo and stay tuned!

Data selection for machine learning — a challenge for automated driving data

Written by Mark Pfeiffer