Recommended ailia MODELS by usage

Published in

axinc-ai

12 min readSep 23, 2023

ailia MODELS is an open-source model library provided by ax Inc. This article introduces recommended models for typical application purposes.

About ailia MODELS

ailia MODELS offers more than 300 models at the timeof writing, converted to ONNX, ready to be used out-of-the-box with ailia SDK, allowing you to easily try out different models on images or videos.

ax Inc.

A future in which all devices carry AI We believe such a future is on the way. To usher in that day, we will keep…

axinc.jp

Object Detection

Object detection computes 2D bounding box of objects in an image and assigns class labels to each one of them.

YOLOX

YOLOX is the 2021 iteration of the YOLO model series, widely used in object detection. It is easy to use under the Apache License, offers a strong augmentation mechanism applied at training time, and has excellent generalization performance.

Six variants of the model are available: nano, tiny, s, m, l, and x, which can be switched depending on the required accuracy. yolox-sfor example has a reference inference speed of 20 ms with M1 Mac using ailia SDK.

ailia-models/object_detection/yolox at master · axinc-ai/ailia-models

(Image from…

github.com

YOLOX : Object detection model exceeding YOLOv5

This is an introduction to「YOLOX」, a machine learning model that can be used with ailia SDK. You can easily use this…

medium.com

For a slightly different use, the -dwand -dhoptions can be given to change the recognition resolution. Use this option when you want to detect small objects.

$ python3 yolox.py -dw 1280 -dh 1280

DETIC

If you want to recognize objects other than those in the 80 categories of YOLOX, DETIC can be used. The base model takes longer to infer, but it can recognize way more classes of object. Labels that can be assigned are taken from the LVIS (1200+ categories, 164K images) and ImageNet21k (21 843 categories, 14M images)datasets. The inference with this model on a M1 Mac (CPU) with ailia SDK takes 4110 ms.

ailia-models/object_detection/detic at master · axinc-ai/ailia-models

(Image from https://web.eecs.umich.edu/~fouhey/fun/desk/desk.jpg, credit David Fouhey) Automatically downloads the onnx…

github.com

Detic : Object Detection and Segmentation of 21k Classes with High Accuracy

This is an introduction to「Detic」, a machine learning model that can be used with ailia SDK. You can easily use this…

medium.com

Although Detic is a heavy model, the -dw option can be used to reduce the recognition resolution and speed up the process.

$ python3 detic.py -dw 320

It is also possible to use ONNX’s GridSampler instead of Torch’s GridSampler by using the opset16 option.

python3 detic.py --opset16

Object Tracking

Object tracking is a task that computes 2D bounding box of objects in an image, and tracks each bounding box in subsequent frames by assigning a tracking ID for each.

DeepSort

This object tracking model combines the SORT algorithm using Kalman filter and person re-identification (ReID) to determine if it is the same person. After a person is detected using YOLO, the SORT algorithm and ReID model assign an ID to the person. It can be used for example to count the number of people passing through or to detect flow lines. The inference on a M1 Mac with ailia SDK for the ReID model is 4 ms per feature extraction.

Video from http://www.robots.ox.ac.uk/~lav/Research/Projects/2009bbenfold_headpose/project.html

ailia-models/object_tracking/deepsort at master · axinc-ai/ailia-models

DeepSORT original mode: compare image mode: [‘correct_32_1.jpg’ , ‘correct_32_2.jpg’ ]: SAME person (confidence…

github.com

DeepSort : A Machine Learning Model for Tracking People

medium.com

ByteTrack

This object tracking model improves accuracy by adding a matching logic to bounding boxes with a low confidence values on top of the Kalman filter tracking. It can be applied to more than just people as tracking targets. Because ByteTrack itself runs very fast, processing time is approximately equivalent to YOLO processing time.

ailia-models/object_tracking/bytetrack at master · axinc-ai/ailia-models

(Video from https://vimeo.com/60139361) This model requires additional module. Automatically downloads the onnx and…

github.com

ByteTrack : Tracking model that also considers low accuracy bounding boxes

This is an introduction to「ByteTrack」, a machine learning model that can be used with ailia SDK. You can easily use…

medium.com

Segmentation

Segmentation is a task that finds region contours of an object from a certain category in an image.

HRNet

This segmentation model is highly accurate by connecting high-resolution information to low-resolution in parallel. The inference runs in 31.25ms on a M1 Mac using ailia SDK.

Image from https://www.cityscapes-dataset.com/

ailia-models/image_segmentation/hrnet_segmentation at master · axinc-ai/ailia-models

(from https://www.cityscapes-dataset.com/) Ailia input shape: (1, 3, 512, 1024) Range:[0, 1] Normal output Smoothed…

github.com

PaddleSeg

Highly accurate segmentation model developed by Baidu, which is pretty heavy so it should be used when high precision is required. The inference runs in 19555ms on a M1 Mac using ailia SDK.

ailia-models/image_segmentation/paddleseg at master · axinc-ai/ailia-models

(Image from https://www.cityscapes-dataset.com/downloads/) Automatically downloads the onnx and prototxt files on the…

github.com

PaddleSeg: Highly Accurate Segmentation Model Using Hierarchical Attention

This is an introduction to「PaddleSeg」, a machine learning model that can be used with ailia SDK. You can easily use…

medium.com

2D Pose Estimation

Pose estimation aims at computing the position of limbs and joints in an image.

LightWeightHumanPoseEstimation

This model uses a bottom-up approach to compute and merge multiple keypoints from a single frame. It should be used when inference speed is important rather than precision. Note that the processing time does not depend on the number of people visible in the image. The inference runs in 11ms on a M1 Mac using ailia SDK.

ailia-models/pose_estimation/lightweight-human-pose-estimation at master · axinc-ai/ailia-models

(Image from…

github.com

LightWeightHumanPose : A Machine Learning Model for Fast Multi-person Skeleton Detection.

This is an introduction to「LightWeightHumanPose」, a machine learning model that can be used with ailia SDK. You can…

medium.com

PoseResNet

This model uses a top-down approach developed by Microsoft and based on ResNet. This model should be used when high precision is required. The inference runs in 27.5ms on a M1 Mac using ailia SDK, including the processing time of the YOLO model used internally.

ailia-models/pose_estimation/pose_resnet at master · axinc-ai/ailia-models

(Image from…

github.com

MoveNet

This is another model using a top-down approach developed by Google, mostly used for fitness applications because it is robust to fast-moving video. There are two models available, Thunder and Lighting. The inference runs in 27.75ms for Thunder and 23.25ms for Lighting on a M1 Mac using ailia SDK.

ailia-models/pose_estimation/movenet at master · axinc-ai/ailia-models

(Image from…

github.com

MoveNet : Pose Estimation for Video with Intense Motion

This is an introduction to「MoveNet」, a machine learning model that can be used with ailia SDK. You can easily use this…

medium.com

AnimalPose

This pose estimation model can be applied to animals such as dogs, cats, cattle, horses, and sheep. After animals are detected with YOLO, the pose estimation is performed using HRNet. The inference runs in 40ms on a M1 Mac using ailia SDK, including the processing time of the YOLO model used internally.

ailia-models/pose_estimation/animalpose at master · axinc-ai/ailia-models

(Image from…

github.com

AnimalPose : Pose Esimation for Animals

This is an introduction to「AnimalPose」, a machine learning model that can be used with ailia SDK. You can easily use…

medium.com

3D Pose Estimation

This task is similar to the previous one but gives joint positions in 3D. Inferring 3D positions from 2D data is usually strongly dependent on camera parameters and other parameters of the dataset. A popular example of 3D pose estimation is Google’s MediaPipe model.

BlazePoseFullbody

The model is developed by Google, it gives joint positions including Z-values, which are are given in local coordinate system. The inference runs in 59.5ms on a M1 Mac using ailia SDK.

ailia-models/pose_estimation_3d/blazepose-fullbody at master · axinc-ai/ailia-models

(Image from…

github.com

BlazePose : A 3D Pose Estimation Model

This is an introduction to「BlazePose」, a machine learning model that can be used with ailia SDK. You can easily use…

medium.com

MediaPipeWorldLandmarks

The model was also developed by Google, it gives joint positions including Z-values in world coordinates in meters. The inference runs in 175.25ms on a M1 Mac using ailia SDK.

ailia-models/pose_estimation_3d/mediapipe_pose_world_landmarks at master · axinc-ai/ailia-models

(Image from https://mediapipe.page.link/pose_py_colab) Automatically downloads the onnx and prototxt files on the first…

github.com

Face and Hand Detection

FaceMesh

The model developed by Google is capable of computing 896 key points on the face. It offers 2 models, one using a typical approach and a newer one using attention mechanism. The inference using the typical model runs in 17.0ms per person, the model using attention runs in 98.0ms per person, on a M1 Mac using ailia SDK .

ailia-models/face_recognition/facemesh at master · axinc-ai/ailia-models

(Image from https://pixabay.com/photos/person-human-male-face-man-view-829966/) ailia input shape: (1, 3, 128, 128) RGB…

github.com

FaceMesh : Detecting Key Points on Faces in Real Time

This is an introduction to「FaceMesh」, a machine learning model that can be used with ailia SDK. You can easily use this…

medium.com

BlazeHand

This is another model developed by Google to compute keypoints on a hand. The inference runs in 11.75ms on a M1 Mac using ailia SDK.

ailia-models/hand_recognition/blazehand at master · axinc-ai/ailia-models

(Image from https://pixabay.com/photos/stop-no-photo-no-photographing-hand-565609/) ailia input shape: (1, 3, 256, 256)…

github.com

BlazeHand : A Machine Learning Model for Detecting Hand Key Points

This is an introduction to「BlazeHand」, a machine learning model that can be used with ailia SDK. You can easily use…

medium.com

MediapipeHolistic

This model combines BlazePose, FaceMesh and BlazeHand to efficiently compute the face, mesh and body keypoints in a single pass.

Source: https://colab.research.google.com/drive/1uCuA6We9T5r0WljspEHWPHXCT_2bMKUy

ailia-models/pose_estimation/mediapipe_holistic at master · axinc-ai/ailia-models

(Image from https://mediapipe.page.link/pose_py_colab) Automatically downloads the onnx and prototxt files on the first…

github.com

Multiple detectors can be used using the --detector option.

$ python3 mediapipe_holistic.py --detector

Road Detection

The task consists in segmenting parts of the image that represent the road.

RoadSegmentationAdas

This model, developed by Intel, is able to detect the drivable area and lane with high accuracy even on Japanese roads. The inference runs in 43.5ms on a M1 Mac using ailia SDK.

ailia-models/road_detection/road-segmentation-adas at master · axinc-ai/ailia-models

(Image from https://www.pexels.com/ja-jp/video/854669/) Shape : (1, 512, 896, 3) BGR channel order Shape : (1, 512…

github.com

Anomaly Detection

Anomaly or product defect detection is based on learning from images of normal products and segmentation of defective areas.

PaDiM

This model can can detect defects using Mahalanobis distance and covariance matrix after being trained on only about 200 images of products without defects. ResNet is used for feature extraction. The inference runs in 379.25ms on a M1 Mac using ailia SDK, where 6ms is spent running ResNet18, getting embedding vectors takes 40ms, and computing the Mahalanobis distance takes 333ms.

ailia-models/anomaly_detection/padim at master · axinc-ai/ailia-models

Normal images (Image from MVTec AD datasets https://www.mvtec.com/company/research/datasets/mvtec-ad/) Original image…

github.com

PaDiM : A machine learning model for detecting defective products without retraining

This is an introduction to「PaDiM」, a machine learning model that can be used with ailia SDK. You can easily use this…

medium.com

PaDiM can also be executed using a GUI originally developed by ax Inc. The results of defect detection can be obtained by giving images of products without anomaly in the Train images section, images presenting defects in Test images, and pressing the Train button. Finally the Test button lets you run the inference on a new image.

The GUI can be started with the command

$ python3 padim_gui.py

Background Removal

This category of model segments elements from the foreground in an image and separate it from the background.

U2Net

This model can remove thew background on images of people as well other generic objects. The inference runs in 47.5ms using the base model, and 24.0ms usiong the U2NetP model on a M1 Mac using ailia SDK.

ailia-models/background_removal/u2net at master · axinc-ai/ailia-models

(Image from https://github.com/NathanUA/U-2-Net/blob/master/test_data/test_images/girl.png) Ailia input shape: (1, 3…

github.com

U2Net : A machine learning model that performs object cropping in a single shot

Introducing U2Net, a machine learning model that can be used with ailia SDK. You can easily implement AI features in…

medium.com

Using the --composite option you can also generate a PNG file with the background removed.

$ python3 u2net.py --input input.png --savepath output.png --composite

RemBG

This model creates TRIMAP (tri-valued image with background, middle, and foreground) from the output of U2Net, and then uses alpha matting to increase the accuracy. The inference runs in 511.75ms on a M1 Mac using ailia SDK, where 50.25ms is spent on running U2Net.

ailia-models/background_removal/rembg at master · axinc-ai/ailia-models

(Image from https://github.com/danielgatis/rembg/blob/main/examples/animal-1.jpg) This model requires additional…

github.com

Depth Estimation

Midas

This model estimates depth from monocular images. It is trained by mixing multiple datasets and has high generalization performance. The inference runs in 60.75ms on a M1 Mac using ailia SDK.

ailia-models/depth_estimation/midas at master · axinc-ai/ailia-models

(Image from kitti dataset http://www.cvlibs.net/datasets/kitti/raw_data.php) Shape : (1, 3, h, w) Shape : (1, h, w)…

github.com

Midas : A Machine Learning Model for Depth Estimation

medium.com

OCR

The purpose of OCR is to read text in an image.

PaddleOCR

This is a real-time OCR model developed by Baidu that can also recognize Japanese characters. CRAFT is used to detect the position of characters and read the detected characters. For Japanese characters, a highly accurate server-side model trained independently by ax Inc. can also be used. The inference runs in 2667ms on a M1 Mac using ailia SDK, where 148ms is spent to detect the position of characters, 437ms to detect the direction of 54 words, and 1398ms to identify (“read”) those 54 words. Processing time for word orientation and identification is proportional to the number of words.

ailia-models/text_recognition/paddleocr at master · axinc-ai/ailia-models

(from https://github.com/PaddlePaddle/PaddleOCR/tree/dygraph/doc/imgs) Automatically downloads the onnx and prototxt…

github.com

PaddleOCR: The latest lightweight OCR system

This is an introduction to「PaddleOCR」, a machine learning model that can be used with ailia SDK. You can easily use…

medium.com

PaddleOCR can also use a high precision model for Japanese that was originally trained by ax Inc.

$ python paddleocr.py -i input.png -c server

Speech Recognition

Speech recognition involves speech-to-text transcription as well as identification from audio files of voices.

Whisper

This speech recognition model developed by OpenAI was trained on 68 000 hours of speech data and can perform speech-to-text transcription in 99 languages, including Japanese. Text can be generated from an audio file as input.

ailia-models/audio_processing/whisper at master · axinc-ai/ailia-models

Audio file Recognized speech text He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and…

github.com

AutoSpeech

This is a voice recognition model that can determine the identity of a person based on his/her voice. The same person can be determined by acquiring feature vectors from the voice and calculating the distance between the feature vectors. It can also be used with Whisper for separating different speakers’ voices in a conversation.

ailia-models/audio_processing/auto_speech at master · axinc-ai/ailia-models

Audio file Wav file from The VoxCeleb1 Dataset https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html Default input…

github.com

About ailia MODELS

Getting started

Please refer to the tutorial below.

ailia-models/TUTORIAL_jp.md at master · axinc-ai/ailia-models

このチュートリアルでは、python言語からailiaを使用する方法について解説します。…

github.com

Launcher

ailia MODELS includes a simple GUI to easily run any model on a image or a video.

ax Inc. has developed ailia SDK, which enables cross-platform, GPU-based rapid inference.

ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.