3 Things You Need to Deal with in Data Management to Create Best Dataset

Kenichi Higuchi
5 min readApr 30, 2022

--

Hi, this is Kenny, a product manager at Adansons Inc.

We are working on building a tool to automate tedious tasks that are often encountered in machine learning projects. We are looking for test users, and we would very much appreciate your feedback!

Get Invitation Form ↓↓↓

It takes too much effort before data is ready to be analyzed with code!

It is often said that annotating data takes a lot of manpower. However, I noticed from my own experience that it also takes a lot of manual effort to connect data files with metadata such as annotations.

Especially when dealing with unstructured data like images and audio that are difficult to directly embed metadata, it is necessary to implement a complex data loader function that selects data while referencing multiple CSV files or assign metadata as folder structures or file names.

It is ok if the metadata is well structured, but sometimes different annotators use slightly different names and order for columns. When that happens, someone has to manually correct the structure of the data files or write a data loader function filled with many if-statements, which leads to complex and hard-to-maintain code.

In addition, writing metadata as a folder structure not only makes it difficult to update and modify metadata, but also makes code and data loader functions inoperable if the data is placed locally by each project member.

This kind of preprocessing is not something that data scientists would want to do (nor should be doing), and yet many people are doing it manually and it becomes a technical liability. We think that the ability to add and edit metadata more flexibly and create datasets more easily would not only solve this issue but also enables data scientists to get deeper insights into data in a data-centric way by filtering data based on more detailed criteria.

Andrew Ng, “MLOps: From Model-centric to Data-centric AI”, 2021, https://www.deeplearning.ai/wp-content/uploads/2021/06/MLOps-From-Model-centric-to-Data-centric-AI.pdf (Page.20)

We made a tool to easily filter through data without the need to implement code that reads folder names and external files like CSV!

In short, it is a tool to create an organized database by extracting metadata contained in external files and folder structures.

You might think that it doesn’t sound much different from the so-called data catalog, so we would like to introduce a few points that set our product apart from other services.

Point 1: Extract metadata while handling fluctuations in the structure.

A structure that is easy for us to see is not necessarily the same as a structure that is easy for machines to analyze (a machine-readable structure).

Even when data is automatically generated by machines, the data pipeline might suddenly stop due to miscommunications of specification changes by different teams.

Since different users prefer different formats, we should create an interface that allows conversions between formats that are convenient for each data generator and data analyzer.

a sample of unstructured CSV or Excel file

Point 2: Detect spelling inconsistencies and duplicates in metadata and automatically integrate additional metadata to an existing DB

Another thing that requires much manual work in data preparation is handling spelling inconsistencies in metadata.

For example, columns of a data table might have slightly different names or data types (string, numbers, …) depending on the annotator or the machine that generated the data.

It is definitely not an easy task to fix all these errors by hand and is not a good practice to implement a code that handles these inconsistencies with many if-statements when linking information between metadata tables.

We need a system that is tolerant of these notational fluctuations.

Point 3: Upload only hash values of data files such as images to make it a secure, light, and fast database

In many cases, metadata is stored in formats that are relatively small in size, like CSV, Excel, and JSON. Data files such as images and audio, however, are often large both in number and size.

It is not realistic to upload a large amount of data every time we create datasets, and in terms of confidentiality, the data files should be kept securely under the user’s own control.

Only the IDs of data files are enough to link metadata and actual data files, and employing hash values for that allows editing folder structures and file names freely without breaking the links. This way, we can prevent the problem that the code does not work due to the mismatch of folder structures between different environments.

How to use this tool to organize the dataset for your own project

We are currently making some features of this product publicly available for free and are looking for test users!

  1. Register your email address on Form
  2. Join our Slack workspace
  3. Receive an access key from the bot on Slack
  4. Install Base from GitHub with pip and register the access key when you run the command the first time

And please give us feedback!

Get Invitation Form ↓↓↓

Form

https://github.com/adansons/base

--

--

Kenichi Higuchi

PdM & Engineer & Director at Adansons Inc. / Medical Student at Tohoku Univ. / pursuit next gen. of AI / 1st product → https://adansons.wraptas.site