Recommended ailia MODELS by usage
ailia MODELS is an open-source model library provided by ax Inc. This article introduces recommended models for typical application purposes.
About ailia MODELS
ailia MODELS offers more than 300 models at the timeof writing, converted to ONNX, ready to be used out-of-the-box with ailia SDK, allowing you to easily try out different models on images or videos.
Object Detection
Object detection computes 2D bounding box of objects in an image and assigns class labels to each one of them.
YOLOX
YOLOX is the 2021 iteration of the YOLO model series, widely used in object detection. It is easy to use under the Apache License, offers a strong augmentation mechanism applied at training time, and has excellent generalization performance.
Six variants of the model are available: nano
, tiny
, s
, m
, l
, and x
, which can be switched depending on the required accuracy. yolox-s
for example has a reference inference speed of 20 ms with M1 Mac using ailia SDK.
For a slightly different use, the -dw
and -dh
options can be given to change the recognition resolution. Use this option when you want to detect small objects.
$ python3 yolox.py -dw 1280 -dh 1280
DETIC
If you want to recognize objects other than those in the 80 categories of YOLOX, DETIC can be used. The base model takes longer to infer, but it can recognize way more classes of object. Labels that can be assigned are taken from the LVIS (1200+ categories, 164K images) and ImageNet21k (21 843 categories, 14M images)datasets. The inference with this model on a M1 Mac (CPU) with ailia SDK takes 4110 ms.
Although Detic is a heavy model, the -dw
option can be used to reduce the recognition resolution and speed up the process.
$ python3 detic.py -dw 320
It is also possible to use ONNX’s GridSampler instead of Torch’s GridSampler by using the opset16
option.
python3 detic.py --opset16
Object Tracking
Object tracking is a task that computes 2D bounding box of objects in an image, and tracks each bounding box in subsequent frames by assigning a tracking ID for each.
DeepSort
This object tracking model combines the SORT algorithm using Kalman filter and person re-identification (ReID) to determine if it is the same person. After a person is detected using YOLO, the SORT algorithm and ReID model assign an ID to the person. It can be used for example to count the number of people passing through or to detect flow lines. The inference on a M1 Mac with ailia SDK for the ReID model is 4 ms per feature extraction.
ByteTrack
This object tracking model improves accuracy by adding a matching logic to bounding boxes with a low confidence values on top of the Kalman filter tracking. It can be applied to more than just people as tracking targets. Because ByteTrack itself runs very fast, processing time is approximately equivalent to YOLO processing time.
Segmentation
Segmentation is a task that finds region contours of an object from a certain category in an image.
HRNet
This segmentation model is highly accurate by connecting high-resolution information to low-resolution in parallel. The inference runs in 31.25ms on a M1 Mac using ailia SDK.
PaddleSeg
Highly accurate segmentation model developed by Baidu, which is pretty heavy so it should be used when high precision is required. The inference runs in 19555ms on a M1 Mac using ailia SDK.
2D Pose Estimation
Pose estimation aims at computing the position of limbs and joints in an image.
LightWeightHumanPoseEstimation
This model uses a bottom-up approach to compute and merge multiple keypoints from a single frame. It should be used when inference speed is important rather than precision. Note that the processing time does not depend on the number of people visible in the image. The inference runs in 11ms on a M1 Mac using ailia SDK.
PoseResNet
This model uses a top-down approach developed by Microsoft and based on ResNet. This model should be used when high precision is required. The inference runs in 27.5ms on a M1 Mac using ailia SDK, including the processing time of the YOLO model used internally.
MoveNet
This is another model using a top-down approach developed by Google, mostly used for fitness applications because it is robust to fast-moving video. There are two models available, Thunder and Lighting. The inference runs in 27.75ms for Thunder and 23.25ms for Lighting on a M1 Mac using ailia SDK.
AnimalPose
This pose estimation model can be applied to animals such as dogs, cats, cattle, horses, and sheep. After animals are detected with YOLO, the pose estimation is performed using HRNet. The inference runs in 40ms on a M1 Mac using ailia SDK, including the processing time of the YOLO model used internally.
3D Pose Estimation
This task is similar to the previous one but gives joint positions in 3D. Inferring 3D positions from 2D data is usually strongly dependent on camera parameters and other parameters of the dataset. A popular example of 3D pose estimation is Google’s MediaPipe model.
BlazePoseFullbody
The model is developed by Google, it gives joint positions including Z-values, which are are given in local coordinate system. The inference runs in 59.5ms on a M1 Mac using ailia SDK.
MediaPipeWorldLandmarks
The model was also developed by Google, it gives joint positions including Z-values in world coordinates in meters. The inference runs in 175.25ms on a M1 Mac using ailia SDK.
Face and Hand Detection
FaceMesh
The model developed by Google is capable of computing 896 key points on the face. It offers 2 models, one using a typical approach and a newer one using attention mechanism. The inference using the typical model runs in 17.0ms per person, the model using attention runs in 98.0ms per person, on a M1 Mac using ailia SDK .
BlazeHand
This is another model developed by Google to compute keypoints on a hand. The inference runs in 11.75ms on a M1 Mac using ailia SDK.
MediapipeHolistic
This model combines BlazePose, FaceMesh and BlazeHand to efficiently compute the face, mesh and body keypoints in a single pass.
Multiple detectors can be used using the --detector
option.
$ python3 mediapipe_holistic.py --detector
Road Detection
The task consists in segmenting parts of the image that represent the road.
RoadSegmentationAdas
This model, developed by Intel, is able to detect the drivable area and lane with high accuracy even on Japanese roads. The inference runs in 43.5ms on a M1 Mac using ailia SDK.
Anomaly Detection
Anomaly or product defect detection is based on learning from images of normal products and segmentation of defective areas.
PaDiM
This model can can detect defects using Mahalanobis distance and covariance matrix after being trained on only about 200 images of products without defects. ResNet is used for feature extraction. The inference runs in 379.25ms on a M1 Mac using ailia SDK, where 6ms is spent running ResNet18, getting embedding vectors takes 40ms, and computing the Mahalanobis distance takes 333ms.
PaDiM can also be executed using a GUI originally developed by ax Inc. The results of defect detection can be obtained by giving images of products without anomaly in the Train images
section, images presenting defects in Test images
, and pressing the Train
button. Finally the Test
button lets you run the inference on a new image.
The GUI can be started with the command
$ python3 padim_gui.py
Background Removal
This category of model segments elements from the foreground in an image and separate it from the background.
U2Net
This model can remove thew background on images of people as well other generic objects. The inference runs in 47.5ms using the base model, and 24.0ms usiong the U2NetP
model on a M1 Mac using ailia SDK.
Using the --composite
option you can also generate a PNG file with the background removed.
$ python3 u2net.py --input input.png --savepath output.png --composite
RemBG
This model creates TRIMAP (tri-valued image with background, middle, and foreground) from the output of U2Net, and then uses alpha matting to increase the accuracy. The inference runs in 511.75ms on a M1 Mac using ailia SDK, where 50.25ms is spent on running U2Net.
Depth Estimation
Midas
This model estimates depth from monocular images. It is trained by mixing multiple datasets and has high generalization performance. The inference runs in 60.75ms on a M1 Mac using ailia SDK.
OCR
The purpose of OCR is to read text in an image.
PaddleOCR
This is a real-time OCR model developed by Baidu that can also recognize Japanese characters. CRAFT is used to detect the position of characters and read the detected characters. For Japanese characters, a highly accurate server-side model trained independently by ax Inc. can also be used. The inference runs in 2667ms on a M1 Mac using ailia SDK, where 148ms is spent to detect the position of characters, 437ms to detect the direction of 54 words, and 1398ms to identify (“read”) those 54 words. Processing time for word orientation and identification is proportional to the number of words.
PaddleOCR can also use a high precision model for Japanese that was originally trained by ax Inc.
$ python paddleocr.py -i input.png -c server
Speech Recognition
Speech recognition involves speech-to-text transcription as well as identification from audio files of voices.
Whisper
This speech recognition model developed by OpenAI was trained on 68 000 hours of speech data and can perform speech-to-text transcription in 99 languages, including Japanese. Text can be generated from an audio file as input.
AutoSpeech
This is a voice recognition model that can determine the identity of a person based on his/her voice. The same person can be determined by acquiring feature vectors from the voice and calculating the distance between the feature vectors. It can also be used with Whisper for separating different speakers’ voices in a conversation.
About ailia MODELS
Getting started
Please refer to the tutorial below.
Launcher
ailia MODELS includes a simple GUI to easily run any model on a image or a video.
ax Inc. has developed ailia SDK, which enables cross-platform, GPU-based rapid inference.
ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.