iMaterialist (Fashion) 2019 at FGVC6

rashi mathur
Nerd For Tech

--

Fine-grained segmentation task for fashion and apparel

Content

  1. Introduction
  2. Problem Statement
  3. Performance Metrics
  4. About Data
  5. Objective
  6. Exploratory Data Analysis
  7. Tf Data Pipeline
  8. U-Net Standard Architecture
  9. Training U-Net
  10. Predictions
  11. Deployment Video
  12. Conclusion
  13. Future Work
  14. References

1. Introduction

Designers know what they are creating, but what, and how, do people really wear their products? What combinations of products are people using? In this case study, we develop an algorithm that will help with an important step towards automatic product detection — to accurately assign segmentations for fashion images.

Visual analysis of clothing is a topic that has received increasing attention in recent years. Being able to recognize apparel products from pictures could enhance the shopping experience for consumers, and increase work efficiency for fashion professionals.

There is a new clothing dataset used in this case study with the goal of introducing a novel fine-grained segmentation task by joining forces between the fashion and computer vision communities. The proposed task does categorization and segmentation of fashion apparel, an important step toward real-world applications.

2.Problem Statement

The task is to perform the categorization and segmentation of fashion apparel. Given an image of apparel, the model has to perform image segmentation and categorization. To capture the complex structure of fashion objects and ambiguity in descriptions obtained from crawling the web, our standardized taxonomy contains 46 apparel objects (27 main apparel items and 19 apparel parts) and 92 related fine-grained attributes.

3. Performance Metrics

IOU:

IOU

The metric sweeps over a range of IoU thresholds, at each point calculating an average precision value. The threshold values range from 0.5 to 0.95 with a step size of 0.05: (0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95). In other words, at a threshold of 0.5, a predicted object is considered a "hit" if its intersection over union with a ground truth object is greater than 0.5.

source:https://www.kaggle.com/c/imaterialist-fashion-2019-FGVC6/overview/evaluation

4. About Data

In this dataset, we are provided a large number of images and corresponding fashion/apparel segmentations. Images are named with a unique image id. Segmentations are in form of Encoded Pixels. For more details on EncodedPixels refer link. The dataset contains the following files:

  • train/ — The training images
  • test/ — The test images (you are segmenting and classifying these images)
  • train.csv — Training annotations, contains images with both segmented apparel categories and fine-grained attributes; and images with segmented apparel categories only.
  • label_descriptions.json — A file giving the apparel categories and fine-grained attributes descriptions.

The columns in train.csv are as follows:

  • ImageId - the unique Id of an image
  • EncodedPixels - masks in a run-length encoded format (please refer to evaluation page for details).
  • ClassId - the class id for this mask. We concatenate both category and attributes (if any) together.
  • Height- The height of image given
  • Width- The width of Image given

source: link

5.Objective

The objective of this case study is to perform simple image segmentation based on the categorization of apparel categories. It uses image segmentation concepts along with computer vision concepts for learning purposes. The case study is based on the above kaggle problem. The model used is a simple U-Net model to see its effectiveness in the segmentation of apparel images.

6.Exploratory Data Analysis

a. Basic Statistics

In train.csv:

  • the number of unique image_id is 45,195.
  • the number of features is 5.
  • the number of categories with attributes is 11,499.
  • the number of categories without attributes is 3,19,714.

b. count_attributes vs number of images plot

We first split ClassId given in train.csv into categories and attributes. An image may contain attributes. According to our dataset if class_id in form of 35_24_51_69_88_195_210_306

the first number denotes category and rest denotes attributes separated by ‘_’. For Example in above category is 35 and attributes is [24,51,69,88,195,210,306].If class id in form of a single number then that denotes category_id and the image contains 0 attributes.

We plotted a simple barplot of count_attributes vs the number of images.

The plot shows that the maximum number of images are without attributes. Few images have non-zero attributes. the number of images with non-zero attributes is less than 5000. We have a skewed distribution.

c.kde plot of the number of images per attribute

We also plotted the KDE plot which shows it's a right-tailed normal distribution curve centered at 0.

d. number of images per category plot

  • category 31(sleeve) has the highest number of images i.e . around 6000.
  • category 12,20,26,41,45 is found in lowest number of images.
  • category 1,10,23,32,33 is found in the fairly high amount of images.

e. rle to mask conversion:

rle (run-length encoding ) is given for each image in the train.csv file. The rle is converted to a segmentation mask and stored on disk. For this, we grouped images of the same image_id and then obtained a list of categories belonging to that particular image. The mask created uses multi-label segmentation concept. The mask pixels are internally represented as category numbers as we are using sparse categorical cross-entropy loss function. For example, refer to the below image.That's how multi-label segmentation is done.

multilabelsegmentation

7. Tf data pipeline

The tf data pipeline is created using the tensorflow.org guide. tf-data improves the performance by prefetching the next batch of data asynchronously so that GPU need not wait for the data. One can also parallelize the process of preprocessing and loading the dataset.

tf-data pipeline effect

a. Creating Datasets:

  • tensor_slices: The TensorFlow dataset is created using tensor_slices which accepts single or multiple NumPy arrays or tensors. Here its accepting filenames of images and masks stored on a local disk. tensor_slices slices data with its first dimension. It can be used to combine different elements into one dataset, e.g., combine features and labels into one dataset (that's also why the 1st dimension of the tensors should be the same). That is, the dataset becomes "wider".

b. Operation On Dataset:

  • Batch: The dataset. batch() will take the first batch_size entries and make a batch out of them. It combines consecutive elements of the Dataset into a single batch. It's useful when one wants to train smaller batches of data to avoid out-of-memory errors. Here we are taking batch_size of 5 for training.
  • Map: The dataset. map() will map some user-defined functions, that transform data to be able to feed the model, onto the dataset. Here we are mapping parse_image function and random augmentation function onto train dataset and parse_image function onto validation and test dataset. Because input elements are independent of one another, the pre-processing can be parallelized across multiple CPU cores. To make this possible, the map transformation provides the num_parallel_calls argument to specify the level of parallelism.
  • Prefetch: The dataset. prefetch() overlaps the preprocessing and model execution of a training step. While the model is executing training step s, the input pipeline is reading the data for step s+1. Doing so reduces the step time to the maximum (as opposed to the sum) of the training and the time it takes to extract the data. The number of elements to prefetch should be equal to (or possibly greater than) the number of batches consumed by a single training step. Typically it is most useful to add a small prefetch buffer (with perhaps just a single element) at the very end of the pipeline, but more complex pipelines can benefit from additional prefetching, especially when the time to produce a single element can vary. Here we have used buffer size 3 in prefetch.
  • shuffle: shuffle() should be called before batch() as we want to shuffle records, not batches. The buffer is first filled by adding records in order then, once full, a random one is selected and emitted and a new record read from the original source.

Similarly, we create tf dataset for validation and test.

Now our tf. data pipeline is created and ready to use!!!!Let's display some images and true masks:

8. U-Net Standard Architecture

U Net

U-net was first designed and applied in 2015 to process biomedical images. In biomedical cases not only it is required to distinguish whether there is a disease but also to localize the area of abnormality. Thus U-Net serves this purpose. It does classification on every pixel so that input and output share the same size. The U-Net architecture is symmetric. it has a contracting path and an expansive path. The left part is the contracting path which is a normal convolution process and the right part is the expansive part which is transposed 2d convolution layer. For complete explanation of architecture refer link.

The standard U-Net architecture is used for image segmentation.The code for this is as follows:

9.Training U-Net

The model is trained for 80 epochs. The model is compiled with Adam Optimiser with learning rate (1e-3)and the loss used is Sparse categorical cross-entropy. We use Keras callbacks also to implement tensorboard callback, earlystopping if validation loss does not improve for 10 continuous epochs, and model checkpointing which saves weights only.

We use batch_size of 128 for training.

We can do hyperparameter tuning of these parameters and improve model performance.

We also computed the get_mean_iou score which is our business metric.It's a custom metric.

The score obtained after training for 80 epochs is as follows:

score

The training plot for get_mean_iou and loss is as follows:

get_mean_iouscore for last 39 epochs
epoch loss for last 39 epochs

10.Predictions

Let's look at some model predictions.

11. Deployment Video

The model was deployed using streamlit app. Have a look at the video.

12. Conclusion

  • the model unet is a simple model and easy to implement.
  • We can more fine-tune the model to give better predictions.
  • The procedure adopted to perform segmentation was simple and easy to implement.
  • tf data pipeline makes the training process faster and more optimized.

13. Future Work

  • The Mask-RCNN model can be used as it is a more complex model and robust.
  • The images can be resized to 512*512 instead of 256*256.
  • Tf- data pipeline batch size can be increased from 5.

14. References

please check out my Github code and LinkedIn profile for further discussion.

--

--