The Custom Parser: A key to a good data harvest

PlantDoc Dataset in an IceVision Framework

Published in

Geek Culture

7 min readAug 31, 2021

Object Detection deals with both classification and localization of objects and background (non-objects). For this, a data scientist needs to know what kind of data to look for and how to properly direct data recognition. In this blog, we will discuss about data formats, loading and customized parsing. The insights were gathered from IceVision docs, github codes and forum discussions.

For an intro/refresher on Object Detection, please refer here.

We will follow this Outline:

A. Set-up

B. Gathering Data

B.1. Data Source and Format

B.1.a. Github

B.1.b. Roboflow

C. Data Loading

C.1. Local computer

C.2. Git Clone

D. Data Glimpse and Routing

D.1. Main directory

D.2. Annotations

D.3. Class mapping

E. Custom Parsing

F. Visualization

Open your Notebook and let’s plant some code!

A. Set-up

I used the IceVision framework on a Colab Pro on GPU setting and Standard RAM.

!wget https://raw.githubusercontent.com/airctic/icevision/master/install_colab.sh
!bash install_colab.sh

Wait for the above installations to finish before running the next.

from icevision.all import *
# import icevision

B. Gathering Data

B.1. Data Source and Format

We will use the PlantDoc dataset which was generated to help improve early detection of plant diseases. There are various sources for this. We will focus on two source options: Github and Roboflow.

B.1.a. Github source

In the Github repository, the TRAIN folder contains the images in jpg, as well as individual annotation files in xml. There is also a separate train_labels.csv file that contains compiled annotations (containing the information on filename, image size, class and bounding box coordinates). For our purposes, we will deal with the annotations in the CSV file, and not the XML files.

If you prefer to use this source, proceed to Section C.2.

B.1.b. Roboflow source

The Roboflow website offers different download formats for the same dataset, resulting in a variety of file arrangements and file types.

We will utilize the Tensorflow Object Detection CSV format. This has a train and a test folder, and the annotations are available as a CSV file inside each folder.

It is best to have a working knowledge of the file locations and formatting because this will help define the routing and parsing in Sections D and E.

C. Data Loading

C.1. Steps for Loading Data on a Local computer

Proceed to the Roboflow website,
Choose the PlantDoc dataset,
Click on the Download for the resize-416x416,
Choose the Tensorflow Object Detection CSV format,
Export via Download zip to computer,

* This will download the zip file to your local computer.

Upload the zip file into Colab.

The Colab storage for the zip file is temporary, but reloading is usually not necessary if you follow along this code.

If you are happy with this manual style, proceed to Section D, otherwise see C.2. for an alternative style of loading.

C.2. Loading data using Git Clone approach

There are some differences for using a git clone for data loading versus the manual loading described above. The advantage is the straight routing to the source with no intermediary downloading and uploading. The steps for using the Github PlantDoc repo and data loading by Git Clone are elaborated on here.

D. Data Glimpse and Routing

D.1. Establish the Main directory

Using a Colab cell, we can check our directories so that we can trace the data path.

%pwd  # present working directory, output: '/content'  
!ls

!unzip \*.zip  && rm *.zip

!ls

After unzipping the file, the main directory now shows the train folder that contains our data.

data_dir = Path('./train')

This will be our information highway, and is an important route to identify for the parsing step later.

D.2. Annotations

import pandas as pd
train_labels = pd.read_csv('/content/train/_annotations.csv', sep = ',', error_bad_lines=False)

After looking at the locations of the directories and files, we can now give directions on how to find the annotations data:

/content/train for the main directory
/_annotations.csv inside the train folder
Some of the annotations deviated from the normal data input, resulting in errors in tokenization. The error_bad_lines=False parameter will skip over these deviations so that you can use the rest of the data.

train_labels.rename(columns={'class':'label'}, inplace=True)

The column name ‘class’ is renamed to ‘label’ to avoid the confusion with the ‘class’ method.

train_labels.sample(3)

The CSV file contains all the relevant information for object detection:

image filename
image width and height (416 for this, as resized by Roboflow)
label or ‘class’
coordinates: xmin pertaining to the smallest value of the bounding box’s x-axis, i.e. in the lower left corner.

train_labels.info()             # 8,329 observations, no null values
train_labels.filename.nunique() # 2,315 unique filenames
train_labels.label.nunique()    # 30 labels (13 species, 17diseases)

There were 30 different classes of plant leaves found among more than 2,000 images. There were more than 8,000 bounding boxes classified.

D.3. Class mapping

_CLASSES = train_labels.label.unique().tolist()
class_map = ClassMap(_CLASSES)
class_map.get_by_name('Apple leaf') # 25; code to be used in parser

Examples of labels are ‘Blueberry leaf’, ‘Corn rust leaf’, and ‘Tomato two spotted spider mites leaf’.
ClassMap creates number id’s in association with a specific label. This facilitates numerical labelling for the model.

E. Custom Parsing

template_record = ObjectDetectionRecord()

ObjectDetectionRecord gathers information for the path, bounding boxes and labels.
A significant feature of this function is the autofix where illogical coordinates are fixed to more logical values. For example, if the ymax indicates a position higher than the image height, the value is changed to correspond to the image height instead.

class PlantParser(Parser):
  def __init__(self, template_record, data_dir):
    super().__init__(template_record = template_record)    self.data_dir = data_dir # Path('./train')
    self.df = train_labels # pd.read_csv('./train/_annotations.csv')
    self.class_map = class_map # ClassMap(_CLASSES)

The Parser class gives the directions for arranging the data.
We have already established the data highway, annotations path and class assignments in Sections D1–3.

  def __iter__(self) -> Any:
    for o in self.df.itertuples():
      yield o  def __len__(self) -> int:
    return len(self.df)

The data is processed by rows, for the whole dataset.

  def record_id(self, o) -> Hashable:
    return o.filename

Observations with the same filename are gathered into one record.

def parse_fields(self, o, record, is_new):
    if is_new:
      record.set_filepath(self.data_dir / o.filename)
      record.set_img_size(ImgSize(width= o.width, height= o.height))
      record.detection.set_class_map(self.class_map)    record.detection.add_bboxes(
                [BBox.from_xyxy(o.xmin, o.ymin, o.xmax, o.ymax)])
    record.detection.add_labels([o.label])

Each row determined by the __iter__ is passed on to the parse_fields. The records are gathered in this step.
The image size is determined based on the width and height values for each row.
The bounding box is constructed based on the coordinates given for each row.
Corresponding labels are attached to each bounding box inside a record.

Putting the custom parser together:

class PlantParser(Parser):
  def __init__(self, template_record, data_dir):
    super().__init__(template_record = template_record)    self.data_dir = data_dir 
    self.df = train_labels 
    self.class_map = class_map   def __iter__(self) -> Any:
    for o in self.df.itertuples():
      yield o  def __len__(self) -> int:
    return len(self.df)  def record_id(self, o) -> Hashable:
    return o.filename  def parse_fields(self, o, record, is_new):
    if is_new:
      record.set_filepath(self.data_dir / o.filename)#
      record.set_img_size(ImgSize(width= o.width, height= o.height))
      record.detection.set_class_map(self.class_map)    record.detection.add_bboxes(
                   [BBox.from_xyxy(o.xmin, o.ymin, o.xmax, o.ymax)])
    record.detection.add_labels([o.label])

We will use the custom parser to arrange the information in the dataset to build the records that can be used for modelling.

parser = PlantParser(template_record, data_dir)train_records, valid_records = parser.parse()

The parse function randomly splits the data at 80/20 ratio by default.

F. Visualization

show_record(train_records[0], class_map= class_map, font_size= 25,
            label_color= '#ffff00')
train_records[0]

Each record contains a filename with a path to the image, the coordinates for each box, and the label for the object within the box. We have a visual output that shows the leaf identity and bounding box frame.

Summary:

An external dataset containing images and aggregated annotation information was identified and uploaded. Custom parsing enabled us to arrange data to create records that can be used for modelling.

Future Play:

Model the parsed data !

I hope you enjoyed planting codes :)

Maria

LinkedIn: https://www.linkedin.com/in/rodriguez-maria/

Github: https://github.com/yrodriguezmd/IceVision_miniprojects/blob/main/IV_plant_local_upload%2C_csv%2C_custom_parser_2021_8_31.ipynb

Twitter: https://twitter.com/Maria_Rod_Data

The Custom Parser: A key to a good data harvest

PlantDoc Dataset in an IceVision Framework

Written by Maria L Rodriguez