ActionClip : Action detection model that can detect arbitrary actions

Takehiko TERADA
axinc-ai
Published in
4 min readJan 23, 2023

This is an introduction to「ActionClip」, a machine learning model that can be used with ailia SDK. You can easily use this model to create AI applications using ailia SDK as well as many other ready-to-use ailia MODELS.

Overview

ActionClip is an action detection model released in September 2021. By integrating language processing systems and image identification, and utilizing Clip’s pre-trained models that have been pre-trained by the vast amount of images on the web, it makes it possible to detect arbitrary actions.

ActionClip overview (Source : https://github.com/sallymmx/ActionCLIP)

Arcitecture

Traditional action detection assigns a simple number to the label of the action to be detected and learns which label is most appropriate for the input video.

Rather than simply numbering labels, ActionClip learns to use features encoded from the text of the label. The model learns to match the video to the text. Since the features are extracted from the text of the labels, it is also possible to detect labels that are not used in the training. This allows for ZERO-SHOT action detection.

ActionClip architecture (Source : https://github.com/sallymmx/ActionCLIP)

ActionClip uses 8 frames as input to estimate the action. VisionTransformer is used for the backbone.

Action Clip uses a CLIP that has been trained using a vast amount of images from the web as its initial values. This paradigm is called “pre-train, prompt and fine-tune. By training the model, performance is greatly improved.

Impact of CLIP weights on performance (Source: https://github.com/sallymmx/ActionCLIP)

In addition, Action Clip not only provides zero-shot action detection, but also the best performance on existing action detection tasks through end-to-end fine tuning. Top-1 accuracy of 83.8% is achieved when fine tuning with the backbone.

Impact of FineTuning on Performance (Source: https://github.com/sallymmx/ActionCLIP)
Impact of ZeroShot and FewShot on performance (Source: https://github.com/sallymmx/ActionCLIP)

Performance

ActionClip achieves SoTA on the Kinetic-400 data set.

Performance comparison on the Kinetic400 data set (Source: https://github.com/sallymmx/ActionCLIP)
Example of Kinetic dataset (Source: https://arxiv.org/pdf/1705.06950.pdf)

Usage

ActionClip can be applied to any video with the following command. The action to be detected is specified by text.

$ python3 action_clip.py --video VIDEO_PATH --text "drinking" --text "eating" --text "laughing"

When video is input, 8 frames are captured at random to detect action. Therefore, it is currently not compatible with webcams.

An example of execution is as follows

Source:https://pixabay.com/videos/running-people-sports-run-walk-294/
$ python3 action_clip.py -v Running\ -\ 294.mp4 --text "running" --text "walking" --text "sitting" --text "dancing" --text "standing"
INFO action_clip.py (202) : class_count = 5
INFO action_clip.py (205) : + idx = 0
INFO action_clip.py (206) : category = 1 [walking]
INFO action_clip.py (207) : prob = 0.5208446979522705
INFO action_clip.py (205) : + idx = 1
INFO action_clip.py (206) : category = 0 [running]
INFO action_clip.py (207) : prob = 0.47726234793663025
INFO action_clip.py (205) : + idx = 2
INFO action_clip.py (206) : category = 3 [dancing]
INFO action_clip.py (207) : prob = 0.0013572901953011751
INFO action_clip.py (205) : + idx = 3
INFO action_clip.py (206) : category = 4 [standing]
INFO action_clip.py (207) : prob = 0.000507637916598469
INFO action_clip.py (205) : + idx = 4
INFO action_clip.py (206) : category = 2 [sitting]
INFO action_clip.py (207) : prob = 2.8073100111214444e-05
INFO action_clip.py (208) :
INFO action_clip.py (238) : Script finished successfully.
Source:https://pixabay.com/videos/coffee-woman-girl-drink-cup-youth-20564/
$ python3 action_clip.py -v Coffee\ -\ 20564.mp4 --text "running" --text "walking" --text "sitting" --text "dancing" --text "standing"
INFO action_clip.py (202) : class_count = 5
INFO action_clip.py (205) : + idx = 0
INFO action_clip.py (206) : category = 2 [sitting]
INFO action_clip.py (207) : prob = 0.975949227809906
INFO action_clip.py (205) : + idx = 1
INFO action_clip.py (206) : category = 4 [standing]
INFO action_clip.py (207) : prob = 0.01414870098233223
INFO action_clip.py (205) : + idx = 2
INFO action_clip.py (206) : category = 0 [running]
INFO action_clip.py (207) : prob = 0.0058464109897613525
INFO action_clip.py (205) : + idx = 3
INFO action_clip.py (206) : category = 1 [walking]
INFO action_clip.py (207) : prob = 0.0027284801471978426
INFO action_clip.py (205) : + idx = 4
INFO action_clip.py (206) : category = 3 [dancing]
INFO action_clip.py (207) : prob = 0.0013270939234644175
INFO action_clip.py (208) :
INFO action_clip.py (238) : Script finished successfully.

ax Inc. has developed ailia SDK, which enables cross-platform, GPU-based rapid inference.

ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.

--

--