How To Create a Custom COCO Dataset from Scratch

In this tutorial, you will learn how to collaboratively create a custom COCO dataset, starting with ideation.

Published in

DataTorch

12 min readJan 20, 2021

Our Mission: Create a COCO dataset for Lucky Charms detection and classification.

Sometimes, you just want to use neural nets to build something cool. But one of the biggest barriers to entry for deep learning projects is the lack of relevant datasets to train your models on.

Building a custom dataset for computer vision isn’t actually that difficult, with the right tools. And if you have collaborators helping you, it can be downright easy.

That’s why, after stepping on the 1000th piece of Lucky Charms breakfast cereal left on the ground by my baby nephew, I had the idea to create a dataset for training models to automatically classify and detect these pieces of cereal.

This article is a step-by-step guide to building a COCO dataset from scratch with two collaborators (although you can do this on your own as well), using my journey in creating the Lucky Charms dataset with my little brother and sister helping me as an example.

Tools

Team of 3 people (optional)
DataTorch
iPhone X

We used a free tool called DataTorch as our primary dataset creation tool. It makes it extremely easy to consolidate data and distribute annotation tasks, and it’s free to use collaboratively for teams of up to 3 people. Other than that, we will just be using our iPhone cameras to gather data.

Step 1: Plan Dataset

Sit down and really think about what you want your dataset to look like. What exactly do you want your annotation classes to be? You want numerous high-quality representations of each class (in our case, the 12 different types of cereal bits). Each class should be represented in a roughly equal distribution, so the model can train in an unbiased manner.

Our dataset had 12 classes total: 4 cereal classes (fish, cross, tree, bell) and 8 marshmallow classes (moon, unicorn, rainbow, balloon, heart, star, horseshoe, clover).

The amount of variation you want in regards to properties such as background, distance, and noise depends on your expected use case for the data; generally speaking, the more flexible you want your computer vision algorithm to be, the more variation you need to include in the training dataset. If you know the algorithm will be deployed in a tightly controlled environment, then simple examples that closely match that environment might work without the need for extra data.

For the Lucky Charms dataset, we decided it would be easiest to have several instances of cereal per photo (instead of just one piece per image), so each photo would just be a handful of Lucky Charms, around 20 or so, scattered on the ground. Since our house had both carpet and hardwood floors, we also wanted our dataset to include these backgrounds, and for each photo to be taken consistently at a distance of around 1 foot.

Step 2: Setup Project

Before taking pictures, you will need to create a project on DataTorch and add your collaborators, which will give you a place to consolidate all of your images.

Create one account for each of your team members on DataTorch. Make sure they remember their usernames!

In one of the accounts, create a new project. In DataTorch, projects are like folders that encapsulate datasets, data sources, and annotations for a machine learning model. Follow the project creation steps and click “Create”.

Once this is done, you should be taken to the project summary page. Now you need to link your collaborators to the project, so they can also contribute data to the dataset.

To do this, click the “Members” tab in the bottom left corner of the sidebar, type in the usernames of your partners under the “Invite Members” section, and click the “add” icon on the right side to confirm these choices. Set the permission to Maintainer, then click “Add”.

This is where you invite members to collaborate on your project.

Now, all of your collaborators will see your Project in their list as soon as they log in, and will be able to contribute data.

Step 2.5: Define Annotation Classes

While you are setting your project up, you can also define your annotation classes here while you are clicking around the project. This doesn’t necessarily have to be done right now; it can wait until you are actually annotating the data. However, defining the classes as part of your planning process is good practice and can also put you in the right mindset when doing the data collection step.

In DataTorch, annotation classes are called labels, and can be organized hierarchically.

In the sidebar, click “Labels”, then click the green plus sign to create a new label. Name your label whatever you want, don’t worry about the Custom ID section, set the color and parent if you want, and click “Create”. Do this for each of your classes until you have labels for all of them.

Click the “Hierarchy” tab to see a cool tree view of how your labels are organized. Here is what ours looks like:

To get a teaser of where we are going next, click Files (on the sidebar) > DataTorch Storage > Upload. This is where everyone will put their files for the dataset!

Step 3: Gather Data

Now you will need to actually gather the data for the dataset. If you had sound planning in Step 1, then this is just a matter of dividing up the work equally among your team members and executing.

In our case, we all used iPhone X cameras as our hardware and split the work among the three of us by assigning each person one of the three types of backgrounds we wanted in our dataset: one person would take pictures on carpets, one person would take pictures on hard floors, and one person would take pictures on anything else (tables, books, boxes, etc). This way, we would get a good mix of backgrounds that properly represented the floors of our house.

With each person assigned to take 50 images, and each image having at least 20 pieces of cereal, we would end up with at least 1000 instances of cereal per person, or 3000 total in the dataset, before augmentation. Not bad!

We also made sure that each picture used a different handful of cereal that was randomly selected from the cereal box population, but this ended up being somewhat of a mistake. Although this random sampling led to a distribution of classes similar to that of the actual population, the distribution of classes in the population was not equal. This goes against the principle of a good dataset having a roughly equal distribution of classes.

Simply put, we wanted our dataset to have an equal amount of each type of cereal. But a box of Lucky Charms has a lot more grains than marshmallows. Since we just took a handful of Lucky Charms for each picture, we ended up having a lot more grains than marshmallows in our dataset.

Lesson learned. Hopefully, since the marshmallows look so different from the grains, this won’t affect training so much. We can always add more data later.

Step 4: Upload Files

DataTorch actually provides each new project with free Azure Blob Storage in the cloud and automatically hooks it up without any prior setup. Every user gets up to 10 GB for free, so we will be using this to store our data.

When each team member is done gathering data, have them log into DataTorch and upload their images to the project’s provided DataTorch cloud storage:

Click Files (on the sidebar) > DataTorch Storage > Upload. Then, have each team member drag their images (as JPGs or PNGs) into the storage.

If you go to the “Explorer” tab next to “Upload” in the DataTorch Storage overview, you should see all of the files that have been uploaded.

Step 5: Organize Files into Datasets

Now that your files are in DataTorch, you need to organize it into a dataset.

Wait a second, you might be thinking…isn’t it already a dataset?

Not exactly. In DataTorch, dataset structure is organized separately from file storage. This is because DataTorch lets you plug in multiple cloud storages, and this allows you to create a dataset with files that span separate storages without needing to reorganize the actual location of the file.

Click “Datasets” on the sidebar, then click the green plus icon to create a new dataset. Name your dataset, then click “Create”.

Your should see this after creating a new dataset.

You should see an upload modal similar to the one in the Files tab. There are multiple ways to add files to a dataset, so you could have created the dataset first, then uploaded the files directly here. But since we uploaded the files first, we will have to go back and assign the files to the dataset we just created.

Go back to Files > DataTorch Storage > Explorer. Select the check mark box to check all of the files in the storage, then click on the blue database icon to assign the selected files to a dataset.

Pick the dataset you just created, click “Update” and you should see all of your files get assigned. Go back to the Datasets tab and check it out!

If you want to organize your files into multiple datasets or rearrange files, DataTorch makes it really easy to do so. And best of all, your files don’t ever need to move from their original locations.

Step 6: Distribute Annotation Tasks

Right now, your dataset is essentially just a folder of raw images. You need to annotate the images, so that your classes can be identified to train a computer vision model.

Collaborating on annotation work with multiple people can get messy pretty quickly, especially with larger teams, but luckily DataTorch has a feature that simplifies the process of divvying up annotation work, called Jobs.

Go to “Jobs” in the sidebar, then click on “New Job”. Enter in a name and due date, don’t worry about the Description and Details section (but feel free to put something there, if you want). Click through to “Members”.

A job with three labelers, and one reviewer.

Here, you will assign Labelers and Reviewers to the job. A Labeler is someone who does annotations. A Reviewer is an optional worker who must approve annotations before the work is considered complete. This is useful where there is a high level of quality control or domain knowledge necessary in the annotation work, such as with medical data.

Assign you and your two collaborators as Labelers, and add Reviewers if you want (but it isn’t necessary). Click through to “Files”, select your dataset, and click through to “Distribute”.

This last tab determines how much overlap you want each Labeler to have. Setting the number of passthroughs to 1 means the work will be evenly divided up between all of the labelers. On the other end, you can move the slider all the way to the right and set the number of passthroughs to the number of labelers, which will make each of the labelers annotate the entire dataset; the maximum amount of overlap.

Setting some overlap is good for calculating consensus, especially in cases where it can be easy to misinterpret or overlook an annotation. For basic annotation jobs, just 1 passthrough should be fine, so keep the slider all the way to the left and click “Create”.

You should see something similar to this after creating your job.

You will see the job conveniently split into separate annotation tasks. You should be able to see the files and comments surrounding each task, but you will only be able to annotate within the task assigned to you. Click the task assigned to you, then click the “Annotate” button in the top right corner to start working!

Step 7: Annotate Data

The annotator in DataTorch has many advanced features, but you can use it as a simple tool as well. Since you already created your labels previously, you should be able to start annotating right away.

You can read a more in-depth coverage of the tools in this article, but for now, we will just be covering the basics for bounding box annotations.

Annotator with key portions highlighted.

In the top right corner, you should see the currently selected label. This indicates what label any annotation you create will have. To change the currently selected label, press “A” on your keyboard and select the label you want from the dropdown.

Go ahead and switch the label to a class that is present in your image. On the right side of the annotator, you should see a few tool icons. Select the bounding box tool (it looks like a square). Draw a box around the instance of the class in your image. You should see the annotation pop up in the “Stack” module of your annotator, if it is present in your view. If it isn’t, go to Window > Stack to enable it.

Repeat this process until all of the annotations are complete for all the classes in that image. If you mess up, click the little trash can next to the annotation to delete it. When you are finished, click the check mark in the upper right hand corner to indicate that the image is complete.

Once you mark all of your assigned files as complete, you should see your task also marked as complete:

When all of the tasks in the job are done, the job will be marked as complete, and your dataset will be fully annotated!

Step 8: Export Annotations

Once all of your data has been annotated, you will probably be excited to put your new dataset to good use. DataTorch lets you easily export annotations made on the platform to different annotation formats (in this case COCO) in just a few simple clicks.

DataTorch generates exports based on something called schemas, which are essentially conditional statements that define the annotations and files which you want to include in your export.

Click “Exports” in the sidebar and click the green “New Schema” button.

Name the new schema whatever you want, and change the Format to COCO. Leave Storage as is, then click the plus sign under “Where annotations” to create a new condition.

Set the condition to Created, Before, and set the date to a week in the future. You can click on “Calculate Schema Metrics” to get an idea of what your output will look like.

When you are done, click “Create” to create your schema. Once it is created, click on it, then click “Run Export” to tell it to execute.

Depending on the size of the schema, it might take some time to run, but the State indicator will change depending on the status. When it says “Success”, click on it and then click the “Download” button in the top right corner. Your COCO JSON file will download with all of the annotations included!

Conclusion

In this article, you learned how to collaborate on a COCO dataset from scratch using nothing but a few friends, a smartphone camera, and free online software. We:

planned a dataset,
setup a project,
gathered data,
uploaded files,
organized files into datasets,
distributed the annotation tasks,
annotated data,
exported annotations,

and created a cool dataset for building a Lucky Charms cereal classifier in the process.

I hope you found this tutorial useful, I hope you use it to build some cool new datasets for your own projects! I encourage you to leave a comment, or get in touch with me directly through email at michael@datatorch.io.