Action-Recogntion: Model Zoo

ActivityNet

Play Video Intelligence draws on the ActivityNet dataset, a comprehensive benchmark for activity recognition, to train and validate its models. This dataset includes a wide range of activities, providing a rich source of data for developing and refining activity recognition algorithms.

Kinetics-400

The Kinetics-400 dataset is also utilized to enhance the activity recognition capabilities of Play Video Intelligence. This dataset consists of 400 human action classes, offering extensive examples that help improve the accuracy and robustness of the recognition models.

We build upon the shoulders of giants, so these models are based off of MMAction2 (opens in a new tab).

Action Recognition

For action recognition, unless specified, models are trained on Kinetics-400. The version of Kinetics-400 we used contains 240436 training videos and 19796 testing videos. For TSN, we also train it on UCF-101, initialized with ImageNet pretrained weights. We also provide transfer learning results on UCF101 and HMDB51 for some algorithms. Models with * are converted from other repositories (including VMZ (opens in a new tab) and kinetics_i3d (opens in a new tab)), others are trained by ourselves.

For data preprocessins all Kinetics-400 models are trained with videos which short-edges are resized to 256px.

TSN

Kinetics

ModalityPretrainedBackboneInputTop-1Top-5
RGBImageNetResNet503seg70.689.4

UCF101

ModalityPretrainedBackboneInputTop-1
RGBImageNetBNInception3seg86.4
TV-L1ImageNetBNInception3seg87.7

C3D

Sports-1M

ModalityPretrainedBackboneInputTop-1
RGBNoneC3D16x1N/A

UCF101

ModalityPretrainedBackboneInputTop-1
RGBSports-1MC3D16x182.26

I3D

ModalityPretrainedBackboneInputTop-1Top-5
RGBImageNetInception-V164x171.189.3
RGBImageNetResNet5032x272.990.8
FlowImageNetInception-V164x163.484.9
Two-StreamImageNetInception-V164x174.291.3

SlowOnly

ModalityPretrainedBackboneInputTop-1Top-5
RGBNoneResNet504x1672.990.9
RGBImageNetResNet504x1673.890.9
RGBNoneResNet508x874.891.9
RGBImageNetResNet508x875.792.2
RGBNoneResNet1018x876.592.7
RGBImageNetResNet1018x876.892.8

SlowFast

ModalityPretrainedBackboneInputTop-1Top-5
RGBNoneResNet504x1675.492.1
RGBImageNetResNet504x1675.992.3

R(2+1)D

ModalityPretrainedBackboneInputTop-1Top-5
RGBNoneResNet348x863.785.9
RGBIG-65MResNet348x874.491.7
RGBNoneResNet3432x271.890.4
RGBIG-65MResNet3432x280.394.7

CSN

ModalityPretrainedBackboneInputTop-1Top-5
RGBIG-65MirCSN-15232x282.695.7
RGBIG-65MipCSN-15232x282.795.6

OmniSource

ModalityPretrainedBackboneInputTop-1 (Baseline / OmniSource delta)Top-5 (Baseline / OmniSource delta)
RGBImageNetResNet503seg70.6 / 73.6 (+ 3.0)89.4 / 91.0 (+ 1.6)
RGBIG-1BResNet503seg73.1 / 75.7 (+ 2.6)90.4 / 91.9 (+ 1.5)
RGBScratchResNet504x1672.9 / 76.8 (+ 3.9)90.9 / 92.5 (+ 1.6)
RGBScratchResNet1018x876.5 / 80.4 (+ 3.9)92.7 / 94.4 (+ 1.7)

Transfer Learning

ModelModalityPretrainedBackboneInputUCF101HMDB51
I3DRGBKineticsI3D64x194.872.6
I3DFlowKineticsI3D64x196.679.2
I3DTwoStreamKineticsI3D64x197.880.8

Action Detection

For action detection, we release models trained on THUMOS14.

SSN

ModalityPretrainedBackbonemAP@0.10mAP@0.20mAP@0.30mAP@0.40mAP@0.50
RGBImageNetBNInception43.09%37.95%32.56%25.71%18.33%

Spatial Temporal Action Detection

For spatial temporal action detection, we release models trained on AVA.

ModalityModelPretrainedBackbonemAP@0.5
RGBFast-RCNNKineticsNL-I3D R5021.2