Action-Recogntion: Model Zoo
ActivityNet
Play Video Intelligence draws on the ActivityNet dataset, a comprehensive benchmark for activity recognition, to train and validate its models. This dataset includes a wide range of activities, providing a rich source of data for developing and refining activity recognition algorithms.
Kinetics-400
The Kinetics-400 dataset is also utilized to enhance the activity recognition capabilities of Play Video Intelligence. This dataset consists of 400 human action classes, offering extensive examples that help improve the accuracy and robustness of the recognition models.
We build upon the shoulders of giants, so these models are based off of MMAction2 (opens in a new tab).
Action Recognition
For action recognition, unless specified, models are trained on Kinetics-400. The version of Kinetics-400 we used contains 240436 training videos and 19796 testing videos. For TSN, we also train it on UCF-101, initialized with ImageNet pretrained weights. We also provide transfer learning results on UCF101 and HMDB51 for some algorithms. Models with * are converted from other repositories (including VMZ (opens in a new tab) and kinetics_i3d (opens in a new tab)), others are trained by ourselves.
For data preprocessins all Kinetics-400 models are trained with videos which short-edges are resized to 256px.
TSN
Kinetics
| Modality | Pretrained | Backbone | Input | Top-1 | Top-5 |
|---|---|---|---|---|---|
| RGB | ImageNet | ResNet50 | 3seg | 70.6 | 89.4 |
UCF101
| Modality | Pretrained | Backbone | Input | Top-1 |
|---|---|---|---|---|
| RGB | ImageNet | BNInception | 3seg | 86.4 |
| TV-L1 | ImageNet | BNInception | 3seg | 87.7 |
C3D
Sports-1M
| Modality | Pretrained | Backbone | Input | Top-1 |
|---|---|---|---|---|
| RGB | None | C3D | 16x1 | N/A |
UCF101
| Modality | Pretrained | Backbone | Input | Top-1 |
|---|---|---|---|---|
| RGB | Sports-1M | C3D | 16x1 | 82.26 |
I3D
| Modality | Pretrained | Backbone | Input | Top-1 | Top-5 |
|---|---|---|---|---|---|
| RGB | ImageNet | Inception-V1 | 64x1 | 71.1 | 89.3 |
| RGB | ImageNet | ResNet50 | 32x2 | 72.9 | 90.8 |
| Flow | ImageNet | Inception-V1 | 64x1 | 63.4 | 84.9 |
| Two-Stream | ImageNet | Inception-V1 | 64x1 | 74.2 | 91.3 |
SlowOnly
| Modality | Pretrained | Backbone | Input | Top-1 | Top-5 |
|---|---|---|---|---|---|
| RGB | None | ResNet50 | 4x16 | 72.9 | 90.9 |
| RGB | ImageNet | ResNet50 | 4x16 | 73.8 | 90.9 |
| RGB | None | ResNet50 | 8x8 | 74.8 | 91.9 |
| RGB | ImageNet | ResNet50 | 8x8 | 75.7 | 92.2 |
| RGB | None | ResNet101 | 8x8 | 76.5 | 92.7 |
| RGB | ImageNet | ResNet101 | 8x8 | 76.8 | 92.8 |
SlowFast
| Modality | Pretrained | Backbone | Input | Top-1 | Top-5 |
|---|---|---|---|---|---|
| RGB | None | ResNet50 | 4x16 | 75.4 | 92.1 |
| RGB | ImageNet | ResNet50 | 4x16 | 75.9 | 92.3 |
R(2+1)D
| Modality | Pretrained | Backbone | Input | Top-1 | Top-5 |
|---|---|---|---|---|---|
| RGB | None | ResNet34 | 8x8 | 63.7 | 85.9 |
| RGB | IG-65M | ResNet34 | 8x8 | 74.4 | 91.7 |
| RGB | None | ResNet34 | 32x2 | 71.8 | 90.4 |
| RGB | IG-65M | ResNet34 | 32x2 | 80.3 | 94.7 |
CSN
| Modality | Pretrained | Backbone | Input | Top-1 | Top-5 |
|---|---|---|---|---|---|
| RGB | IG-65M | irCSN-152 | 32x2 | 82.6 | 95.7 |
| RGB | IG-65M | ipCSN-152 | 32x2 | 82.7 | 95.6 |
OmniSource
| Modality | Pretrained | Backbone | Input | Top-1 (Baseline / OmniSource delta) | Top-5 (Baseline / OmniSource delta) |
|---|---|---|---|---|---|
| RGB | ImageNet | ResNet50 | 3seg | 70.6 / 73.6 (+ 3.0) | 89.4 / 91.0 (+ 1.6) |
| RGB | IG-1B | ResNet50 | 3seg | 73.1 / 75.7 (+ 2.6) | 90.4 / 91.9 (+ 1.5) |
| RGB | Scratch | ResNet50 | 4x16 | 72.9 / 76.8 (+ 3.9) | 90.9 / 92.5 (+ 1.6) |
| RGB | Scratch | ResNet101 | 8x8 | 76.5 / 80.4 (+ 3.9) | 92.7 / 94.4 (+ 1.7) |
Transfer Learning
| Model | Modality | Pretrained | Backbone | Input | UCF101 | HMDB51 |
|---|---|---|---|---|---|---|
| I3D | RGB | Kinetics | I3D | 64x1 | 94.8 | 72.6 |
| I3D | Flow | Kinetics | I3D | 64x1 | 96.6 | 79.2 |
| I3D | TwoStream | Kinetics | I3D | 64x1 | 97.8 | 80.8 |
Action Detection
For action detection, we release models trained on THUMOS14.
SSN
| Modality | Pretrained | Backbone | mAP@0.10 | mAP@0.20 | mAP@0.30 | mAP@0.40 | mAP@0.50 |
|---|---|---|---|---|---|---|---|
| RGB | ImageNet | BNInception | 43.09% | 37.95% | 32.56% | 25.71% | 18.33% |
Spatial Temporal Action Detection
For spatial temporal action detection, we release models trained on AVA.
| Modality | Model | Pretrained | Backbone | mAP@0.5 |
|---|---|---|---|---|
| RGB | Fast-RCNN | Kinetics | NL-I3D R50 | 21.2 |