What is Semantic Segmentation?

Mariia Krasavina
CVAT.ai
Published in
6 min readJun 17, 2024

First, let’s begin by offering a formal definition of the term and its underlying principles. If it seems complex at first, hang tight; we’ll delve deeper into the details.

Semantic segmentation is a method in computer vision that leverages deep learning algorithms to allocate class labels to individual pixels in an image. This technique segments an image into various areas of interest, categorizing each area into a distinct class.

Now, we will unpack this idea gradually, using a straightforward example to illustrate its application.

We are introducing Alex, a young, passionate urban planner with an ambitious goal: to create smarter, more efficient urban environments.

What is Semantic Segmentation?

First, let’s begin by offering a formal definition of the term and its underlying principles. If it seems complex at first, hang tight; we’ll delve deeper into the details.

Semantic segmentation is a method in computer vision that leverages deep learning algorithms to allocate class labels to individual pixels in an image. This technique segments an image into various areas of interest, categorizing each area into a distinct class.

Now, we will unpack this idea gradually, using a straightforward example to illustrate its application.

We are introducing Alex, a young, passionate urban planner with an ambitious goal: to create smarter, more efficient urban environments.

To grasp what constitutes a “smarter” or “more efficient” city, Alex must first understand the dynamics of urban landscapes. He needs to identify and differentiate various land covers such as buildings, roads, waterways, and green spaces. This knowledge enables him to measure the extent of greenery, assess the health of vegetation, and strategize on developing or conserving parks and natural spaces. Moreover, by categorizing areas according to pedestrian traffic, Alex can pinpoint both heavily trafficked and underused areas, planning strategic improvements to boost accessibility and safety by adding amenities like benches, improved lighting, or pedestrian pathways.

These methods represent just a handful of ways Alex can leverage data to make well-informed decisions and architect improved urban spaces.

With these objectives in mind, Alex contemplates his next steps: How will he analyze an urban area? What strategies will guide his decisions to enhance urban life?

Thus, his adventure in urban planning begins.

Step 1: Understanding Semantic Segmentation

Semantic segmentation gives a computer the ability to see and understand images the same way humans do. Rather than simply identifying an entire image as a “cityscape” or “street view,” it dissects the image into minute components, assigning labels to each segment. Each pixel within the image receives a specific designation: one pixel might be identified as part of a road, another as a building, and others as trees. This detailed breakdown allows for a deeper, more nuanced understanding of the visual data presented.

With this approach, Alex can automatically categorize and label every pixel in an image. He then utilizes these labeled images to train a machine-learning algorithm, which extracts valuable insights into urban functionality. These algorithms sift through the labeled data, detect patterns in urban practices, and produce practical outcomes. Such insights are crucial for making informed urban planning choices, optimizing traffic flows, and improving the design of public spaces.

Step 2: Preparing the Data

Alex discovers that training a computer to interpret images requires numerous examples. Therefore, he compiles a dataset of urban images, similar to the previously mentioned example, and arranges them into a designated folder. This stage is recognized as the data collection phase, which comes with its own set of challenges.

Note, that at this step Alex might have some challenges:

At this stage, Alex may encounter several challenges:

  • Gathering enough data can be problematic due to privacy concerns when utilizing images from specific sources.
  • It’s also crucial to carefully select the most effective data for training a deep-learning model.
  • Alex needs to remove any duplicate data to guarantee the quality of his dataset. Once the data is in order, the subsequent task involves labeling the objects in the images.

These steps and their challenges will be further discussed in upcoming articles.

Step 2: Labeling the Data

Alex uploads the folder with data into the Computer Vision Annotation Tool (CVAT.ai). Then he carefully labels each pixel in the images, categorizing elements like roads, buildings, and trees.

Alex can do it manually, or from the cloud storage.

For this task, Alex can use various tools: Polygons or Brush tool. Here is how it looks, when he adds buildings to one category, and pools to the other.

Here’s a glimpse of what happens in the background at this moment:

Semantic segmentation models craft an intricate map of each input image by assigning a distinct category to every pixel. This results in a segmentation map where each pixel is color-coded based on its category, creating what are known as segmentation masks.

A segmentation mask distinctly highlights parts of the image, differentiating them from other areas. To accomplish this, semantic segmentation models utilize sophisticated neural networks. These networks cluster associated pixels into segmentation masks and precisely determine the real-world category for each segment.

For instance, pixels associated with the object “pool” are now categorized under “pool,” while those linked to “building” are labeled as “building.”

One key point to understand is that semantic segmentation does not distinguish between instances of the same class; it only identifies the category of each pixel. This means that if there are two objects of the same category in your input image, the segmentation map will not differentiate between them as separate entities. To achieve that level of detail, instance segmentation models are used. These models can differentiate and label separate objects within the same category.

Here is a video showing different types of segmentation applied to the same image:

Step 3: Training the Model with Annotated Data

Once the annotation is complete, Alex exports the annotated dataset from CVAT.ai. He then feeds this labeled data into a deep learning model designed for semantic segmentation. Some examples are Cityscapes, PASCAL VOC and the very popular Yolo8. Models are usually evaluated with the Mean Intersection-Over-Union (Mean IoU) and Pixel Accuracy metrics.

Once Alex has chosen and trained his model, he tests it on new, previously unseen images to evaluate its effectiveness. The model, now educated with Alex’s labeled dataset, is capable of automatically identifying every object within these images and delivering detailed segmentation outcomes.

Here are some examples of how it may look:

Step 4: Gathering Insights

From the analysis of the model’s outcomes, Alex extracts crucial insights:

  • Traffic Patterns: By optimizing traffic light timings and road layouts, traffic flow improves and congestion decreases.
  • Green Space Distribution: Identifies areas that could benefit from additional green spaces to enhance urban environmental health.
  • Public Space Utilization: Improves planning for public spaces to boost accessibility and increase their use.
  • Infrastructure Development: Enables efficient monitoring of ongoing construction and better planning for future infrastructure projects.
  • Urban Heat Islands: Proposes cooling strategies to combat the effects of urban heat islands.

With this processed data, Alex is now equipped to make well-informed urban planning decisions.

Conclusion

Thanks to semantic segmentation, Alex can transform raw images into valuable insights without spending countless hours analyzing each one manually. The technology not only saves time but also enhances the accuracy of Alex’s work, making the dream of designing smarter cities a reality. In the end, semantic segmentation turns complex visual data into actionable knowledge and helps create a better urban environment for everyone.

And Alex couldn’t be happier with the results.

--

--