Recommended ailia MODELS by usage

David Cochard
axinc-ai
Published in
12 min readSep 23, 2023

ailia MODELS is an open-source model library provided by ax Inc. This article introduces recommended models for typical application purposes.

About ailia MODELS

ailia MODELS offers more than 300 models at the timeof writing, converted to ONNX, ready to be used out-of-the-box with ailia SDK, allowing you to easily try out different models on images or videos.

ailia MODELS (Images from Pixabay)

Object Detection

Object detection computes 2D bounding box of objects in an image and assigns class labels to each one of them.

YOLOX

YOLOX is the 2021 iteration of the YOLO model series, widely used in object detection. It is easy to use under the Apache License, offers a strong augmentation mechanism applied at training time, and has excellent generalization performance.

Six variants of the model are available: nano, tiny, s, m, l, and x, which can be switched depending on the required accuracy. yolox-sfor example has a reference inference speed of 20 ms with M1 Mac using ailia SDK.

Image from https://pixabay.com/ja/photos/%E3%83%AD%E3%83%B3%E3%83%89%E3%83%B3%E5%B8%82-%E9%8A%80%E8%A1%8C-%E3%83%AD%E3%83%B3%E3%83%89%E3%83%B3-4481399/

For a slightly different use, the -dwand -dhoptions can be given to change the recognition resolution. Use this option when you want to detect small objects.

$ python3 yolox.py -dw 1280 -dh 1280

DETIC

If you want to recognize objects other than those in the 80 categories of YOLOX, DETIC can be used. The base model takes longer to infer, but it can recognize way more classes of object. Labels that can be assigned are taken from the LVIS (1200+ categories, 164K images) and ImageNet21k (21 843 categories, 14M images)datasets. The inference with this model on a M1 Mac (CPU) with ailia SDK takes 4110 ms.

Image from https://web.eecs.umich.edu/~fouhey/fun/desk/desk.jpg

Although Detic is a heavy model, the -dw option can be used to reduce the recognition resolution and speed up the process.

$ python3 detic.py -dw 320

It is also possible to use ONNX’s GridSampler instead of Torch’s GridSampler by using the opset16 option.

python3 detic.py --opset16

Object Tracking

Object tracking is a task that computes 2D bounding box of objects in an image, and tracks each bounding box in subsequent frames by assigning a tracking ID for each.

DeepSort

This object tracking model combines the SORT algorithm using Kalman filter and person re-identification (ReID) to determine if it is the same person. After a person is detected using YOLO, the SORT algorithm and ReID model assign an ID to the person. It can be used for example to count the number of people passing through or to detect flow lines. The inference on a M1 Mac with ailia SDK for the ReID model is 4 ms per feature extraction.

Video from http://www.robots.ox.ac.uk/~lav/Research/Projects/2009bbenfold_headpose/project.html

ByteTrack

This object tracking model improves accuracy by adding a matching logic to bounding boxes with a low confidence values on top of the Kalman filter tracking. It can be applied to more than just people as tracking targets. Because ByteTrack itself runs very fast, processing time is approximately equivalent to YOLO processing time.

Video from https://vimeo.com/60139361

Segmentation

Segmentation is a task that finds region contours of an object from a certain category in an image.

HRNet

This segmentation model is highly accurate by connecting high-resolution information to low-resolution in parallel. The inference runs in 31.25ms on a M1 Mac using ailia SDK.

Image from https://www.cityscapes-dataset.com/

PaddleSeg

Highly accurate segmentation model developed by Baidu, which is pretty heavy so it should be used when high precision is required. The inference runs in 19555ms on a M1 Mac using ailia SDK.

Image from https://www.cityscapes-dataset.com/downloads/

2D Pose Estimation

Pose estimation aims at computing the position of limbs and joints in an image.

LightWeightHumanPoseEstimation

This model uses a bottom-up approach to compute and merge multiple keypoints from a single frame. It should be used when inference speed is important rather than precision. Note that the processing time does not depend on the number of people visible in the image. The inference runs in 11ms on a M1 Mac using ailia SDK.

Image from https://pixabay.com/ja/photos/%E5%A5%B3%E3%81%AE%E5%AD%90-%E7%BE%8E%E3%81%97%E3%81%84-%E8%8B%A5%E3%81%84-%E3%83%9B%E3%83%AF%E3%82%A4%E3%83%88-5204299/

PoseResNet

This model uses a top-down approach developed by Microsoft and based on ResNet. This model should be used when high precision is required. The inference runs in 27.5ms on a M1 Mac using ailia SDK, including the processing time of the YOLO model used internally.

Image from https://pixabay.com/ja/photos/%E5%A5%B3%E3%81%AE%E5%AD%90-%E7%BE%8E%E3%81%97%E3%81%84-%E8%8B%A5%E3%81%84-%E3%83%9B%E3%83%AF%E3%82%A4%E3%83%88-5204299/

MoveNet

This is another model using a top-down approach developed by Google, mostly used for fitness applications because it is robust to fast-moving video. There are two models available, Thunder and Lighting. The inference runs in 27.75ms for Thunder and 23.25ms for Lighting on a M1 Mac using ailia SDK.

Image from https://pixabay.com/ja/photos/%E5%A5%B3%E3%81%AE%E5%AD%90-%E7%BE%8E%E3%81%97%E3%81%84-%E8%8B%A5%E3%81%84-%E3%83%9B%E3%83%AF%E3%82%A4%E3%83%88-5204299/

AnimalPose

This pose estimation model can be applied to animals such as dogs, cats, cattle, horses, and sheep. After animals are detected with YOLO, the pose estimation is performed using HRNet. The inference runs in 40ms on a M1 Mac using ailia SDK, including the processing time of the YOLO model used internally.

Image from https://pixabay.com/ja/photos/%e7%89%9b-%e5%ae%b6%e7%95%9c-%e4%b9%b3%e7%89%9b-%e4%b9%b3%e7%94%a8%e7%89%9b-%e5%8b%95%e7%89%a9-5717276/

3D Pose Estimation

This task is similar to the previous one but gives joint positions in 3D. Inferring 3D positions from 2D data is usually strongly dependent on camera parameters and other parameters of the dataset. A popular example of 3D pose estimation is Google’s MediaPipe model.

BlazePoseFullbody

The model is developed by Google, it gives joint positions including Z-values, which are are given in local coordinate system. The inference runs in 59.5ms on a M1 Mac using ailia SDK.

Image from https://pixabay.com/ja/photos/%E5%A5%B3%E3%81%AE%E5%AD%90-%E7%BE%8E%E3%81%97%E3%81%84-%E8%8B%A5%E3%81%84-%E3%83%9B%E3%83%AF%E3%82%A4%E3%83%88-5204299/

MediaPipeWorldLandmarks

The model was also developed by Google, it gives joint positions including Z-values in world coordinates in meters. The inference runs in 175.25ms on a M1 Mac using ailia SDK.

Face and Hand Detection

FaceMesh

The model developed by Google is capable of computing 896 key points on the face. It offers 2 models, one using a typical approach and a newer one using attention mechanism. The inference using the typical model runs in 17.0ms per person, the model using attention runs in 98.0ms per person, on a M1 Mac using ailia SDK .

Image from https://pixabay.com/photos/person-human-male-face-man-view-829966/

BlazeHand

This is another model developed by Google to compute keypoints on a hand. The inference runs in 11.75ms on a M1 Mac using ailia SDK.

Image from https://pixabay.com/photos/stop-no-photo-no-photographing-hand-565609/

MediapipeHolistic

This model combines BlazePose, FaceMesh and BlazeHand to efficiently compute the face, mesh and body keypoints in a single pass.

Source: https://colab.research.google.com/drive/1uCuA6We9T5r0WljspEHWPHXCT_2bMKUy

Multiple detectors can be used using the --detector option.

$ python3 mediapipe_holistic.py --detector

Road Detection

The task consists in segmenting parts of the image that represent the road.

RoadSegmentationAdas

This model, developed by Intel, is able to detect the drivable area and lane with high accuracy even on Japanese roads. The inference runs in 43.5ms on a M1 Mac using ailia SDK.

Image from https://www.pexels.com/ja-jp/video/854669/

Anomaly Detection

Anomaly or product defect detection is based on learning from images of normal products and segmentation of defective areas.

PaDiM

This model can can detect defects using Mahalanobis distance and covariance matrix after being trained on only about 200 images of products without defects. ResNet is used for feature extraction. The inference runs in 379.25ms on a M1 Mac using ailia SDK, where 6ms is spent running ResNet18, getting embedding vectors takes 40ms, and computing the Mahalanobis distance takes 333ms.

Image from MVTec AD datasets https://www.mvtec.com/company/research/datasets/mvtec-ad/

PaDiM can also be executed using a GUI originally developed by ax Inc. The results of defect detection can be obtained by giving images of products without anomaly in the Train images section, images presenting defects in Test images, and pressing the Train button. Finally the Test button lets you run the inference on a new image.

PaDiM GUI

The GUI can be started with the command

$ python3 padim_gui.py

Background Removal

This category of model segments elements from the foreground in an image and separate it from the background.

U2Net

This model can remove thew background on images of people as well other generic objects. The inference runs in 47.5ms using the base model, and 24.0ms usiong the U2NetP model on a M1 Mac using ailia SDK.

Image from https://github.com/NathanUA/U-2-Net/blob/master/test_data/test_images/girl.png

Using the --composite option you can also generate a PNG file with the background removed.

$ python3 u2net.py --input input.png --savepath output.png --composite

RemBG

This model creates TRIMAP (tri-valued image with background, middle, and foreground) from the output of U2Net, and then uses alpha matting to increase the accuracy. The inference runs in 511.75ms on a M1 Mac using ailia SDK, where 50.25ms is spent on running U2Net.

Image from https://github.com/danielgatis/rembg/blob/main/examples/animal-1.jpg

Depth Estimation

Midas

This model estimates depth from monocular images. It is trained by mixing multiple datasets and has high generalization performance. The inference runs in 60.75ms on a M1 Mac using ailia SDK.

Image from kitti dataset http://www.cvlibs.net/datasets/kitti/raw_data.php

OCR

The purpose of OCR is to read text in an image.

PaddleOCR

This is a real-time OCR model developed by Baidu that can also recognize Japanese characters. CRAFT is used to detect the position of characters and read the detected characters. For Japanese characters, a highly accurate server-side model trained independently by ax Inc. can also be used. The inference runs in 2667ms on a M1 Mac using ailia SDK, where 148ms is spent to detect the position of characters, 437ms to detect the direction of 54 words, and 1398ms to identify (“read”) those 54 words. Processing time for word orientation and identification is proportional to the number of words.

from https://github.com/PaddlePaddle/PaddleOCR/tree/dygraph/doc/imgs

PaddleOCR can also use a high precision model for Japanese that was originally trained by ax Inc.

$ python paddleocr.py -i input.png -c server

Speech Recognition

Speech recognition involves speech-to-text transcription as well as identification from audio files of voices.

Whisper

This speech recognition model developed by OpenAI was trained on 68 000 hours of speech data and can perform speech-to-text transcription in 99 languages, including Japanese. Text can be generated from an audio file as input.

AutoSpeech

This is a voice recognition model that can determine the identity of a person based on his/her voice. The same person can be determined by acquiring feature vectors from the voice and calculating the distance between the feature vectors. It can also be used with Whisper for separating different speakers’ voices in a conversation.

About ailia MODELS

Getting started

Please refer to the tutorial below.

Launcher

ailia MODELS includes a simple GUI to easily run any model on a image or a video.

ailia MODELS Launcher

ax Inc. has developed ailia SDK, which enables cross-platform, GPU-based rapid inference.

ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.

--

--

David Cochard
axinc-ai

Engineer with 10+ years in game engines & multiplayer backend development. Now focused on machine learning, computer vision, graphics and AR