How to Efficiently Manage Storage for High-Volume Data Annotation Projects

Check Xtreme1’s solution to the challenges of storing and managing large amounts of annotated data required for training AI models.

Xtreme1
Multimodal Data Training
4 min readFeb 27, 2023

--

AI is transforming everything, and the future looks bright for a more efficient, productive, and connected world. With autonomous vehicles, robots, and smart homes on the horizon, we can expect safer, more efficient, and more convenient ways of living. However, as we look at these advancements from a technical perspective, there are still challenges to overcome.

Whether ADAS or robots, any AI application couldn’t take shape without trained AI models. Furthermore, training these artificial intelligence models couldn’t be achieved without a large amount of high-quality data that has been manually annotated, including:

  1. Data that needs to be annotated (such as point clouds, images, speech, text, etc.) and
  2. Annotated objects (such as 2D/3D bounding boxes, polygons, point clouds, etc.).

When it comes to annotated data, since they are all files and quite large, using object storage is a conventional choice. The annotated objects are stored in JSON format, with the information size of a single object ranging from several hundred bytes to several hundred KB. In the point cloud segmentation scenario, since it is necessary to save the information of all points in the segmentation area, the information will be relatively large, reaching several hundred KB. The traditional way is to use a database to store the records, which can support more complex queries, but it will face the following challenges:

Storage Challenges:

For 1,000 tenants, each with 1 million Data, and 10 Objects annotated per Data, the total number of Objects would be 10 billion. Assuming an object size of 1KB, the required storage space would be 10TB. This is just the data volume generated by one annotation source. If multiple annotation sources are considered, the data volume will double.

Read/Write Challenges:

While storage challenges can be addressed by adding nodes and disks, read/write challenges are tricky because read/write performance cannot be linearly scaled by adding nodes. For frame series annotation, several hundred Data with thousands of Objects are loaded simultaneously when the annotation tool is opened. This means that several thousand records (ranging from several MB to several hundred MB) are read from the database at once. The same writing pressure is faced when submitting. If the database solution can barely support low-concurrency private deployment scenarios, it becomes impractical for SaaS, where thousands of users operate simultaneously.

Faced with various types of data and a large amount of annotation work, how does the Xtreme1 team design a storage solution?

Solution:

Dealing with large data volumes and read/write loads can be challenging, even when using distributed relational or NoSQL databases. The maintenance cost can be prohibitively high, and these distributed databases typically rely on multiple replicas, leading to a storage expansion of two or three times the original size.

In such cases, abandoning the database storage method and using files to store annotated Objects may be the optimal solution. By storing the annotation results of each annotation source as a separate file, the entire operation can be performed at this granularity each time the annotation results are read and submitted. This approach is ideal for storing massive amounts of small file storage, making it a perfect fit for storing annotation results.

Object Storage Solution

For the solution, the browser can read and write the annotation result files directly from the object storage service, significantly reducing the load on the API service and database. In a continuous frame annotation scenario, the browser can download and upload several hundred files concurrently. However, the same-origin concurrency limit of the browser can elongate the overall download and upload time. To overcome this, a “batch request proxy service” can be added to support batch downloading and uploading of files if there is a performance bottleneck.

The file storage solution offers several benefits compared to the database storage solution.

  1. The file storage solution transforms random database read and write into sequential files read and write. For instance, based on 300 Data per frame, each containing 30 Objects, for a single continuous frame annotation result load or submission, the database storage solution requires random read and write of 9000 Objects. In contrast, the file storage solution only needs to read and write 300 files. This performance difference is enormous.
  2. Removing the annotation results from the database reduces the number of records and storage size by 1–2 orders of magnitude, leading to significant cost reductions and preventing any impact on the read and write of other business data.
  3. The file storage solution supports massive annotation result storage. If using public cloud object storage services like AWS S3 or Alibaba Cloud OSS, there is almost infinite storage space available, and the cost is lower than that of database SSD disks. For private scenarios, open-source MinIO can be used to replace it seamlessly.
  4. The browser can read and write the annotation results directly from the object storage service, reducing the bandwidth demand on the API service, which is especially useful when using public cloud object storage services with larger bandwidth and lower prices.

Of course, every solution has its pros and cons. One disadvantage of the file storage solution is that it cannot support annotation result queries at the Object granularity, such as displaying the Object list under a specific dataset and filtering based on certain attributes. To support such queries, a professional search system like Elasticsearch can store only the fields that need to be searched, significantly reducing the data size.

--

--

Xtreme1
Multimodal Data Training

Xtreme1 - World's 1st Open-Source Platform for Multisensory Training Data. Find us on GitHub: https://github.com/xtreme1-io/xtreme1