Action-Recogntion: Model Zoo

ActivityNet

Play Video Intelligence draws on the ActivityNet dataset, a comprehensive benchmark for activity recognition, to train and validate its models. This dataset includes a wide range of activities, providing a rich source of data for developing and refining activity recognition algorithms.

Kinetics-400

The Kinetics-400 dataset is also utilized to enhance the activity recognition capabilities of Play Video Intelligence. This dataset consists of 400 human action classes, offering extensive examples that help improve the accuracy and robustness of the recognition models.

We build upon the shoulders of giants, so these models are based off of MMAction2 (opens in a new tab).

Action Recognition

For action recognition, unless specified, models are trained on Kinetics-400. The version of Kinetics-400 we used contains 240436 training videos and 19796 testing videos. For TSN, we also train it on UCF-101, initialized with ImageNet pretrained weights. We also provide transfer learning results on UCF101 and HMDB51 for some algorithms. Models with * are converted from other repositories (including VMZ (opens in a new tab) and kinetics_i3d (opens in a new tab)), others are trained by ourselves.

For data preprocessins all Kinetics-400 models are trained with videos which short-edges are resized to 256px.

TSN

Kinetics

Modality	Pretrained	Backbone	Input	Top-1	Top-5
RGB	ImageNet	ResNet50	3seg	70.6	89.4

UCF101

Modality	Pretrained	Backbone	Input	Top-1
RGB	ImageNet	BNInception	3seg	86.4
TV-L1	ImageNet	BNInception	3seg	87.7

C3D

Sports-1M

Modality	Pretrained	Backbone	Input	Top-1
RGB	None	C3D	16x1	N/A

UCF101

Modality	Pretrained	Backbone	Input	Top-1
RGB	Sports-1M	C3D	16x1	82.26

I3D

Modality	Pretrained	Backbone	Input	Top-1	Top-5
RGB	ImageNet	Inception-V1	64x1	71.1	89.3
RGB	ImageNet	ResNet50	32x2	72.9	90.8
Flow	ImageNet	Inception-V1	64x1	63.4	84.9
Two-Stream	ImageNet	Inception-V1	64x1	74.2	91.3

SlowOnly

Modality	Pretrained	Backbone	Input	Top-1	Top-5
RGB	None	ResNet50	4x16	72.9	90.9
RGB	ImageNet	ResNet50	4x16	73.8	90.9
RGB	None	ResNet50	8x8	74.8	91.9
RGB	ImageNet	ResNet50	8x8	75.7	92.2
RGB	None	ResNet101	8x8	76.5	92.7
RGB	ImageNet	ResNet101	8x8	76.8	92.8

SlowFast

Modality	Pretrained	Backbone	Input	Top-1	Top-5
RGB	None	ResNet50	4x16	75.4	92.1
RGB	ImageNet	ResNet50	4x16	75.9	92.3

R(2+1)D

Modality	Pretrained	Backbone	Input	Top-1	Top-5
RGB	None	ResNet34	8x8	63.7	85.9
RGB	IG-65M	ResNet34	8x8	74.4	91.7
RGB	None	ResNet34	32x2	71.8	90.4
RGB	IG-65M	ResNet34	32x2	80.3	94.7

CSN

Modality	Pretrained	Backbone	Input	Top-1	Top-5
RGB	IG-65M	irCSN-152	32x2	82.6	95.7
RGB	IG-65M	ipCSN-152	32x2	82.7	95.6

OmniSource

Modality	Pretrained	Backbone	Input	Top-1 (Baseline / OmniSource delta)	Top-5 (Baseline / OmniSource delta)
RGB	ImageNet	ResNet50	3seg	70.6 / 73.6 (+ 3.0)	89.4 / 91.0 (+ 1.6)
RGB	IG-1B	ResNet50	3seg	73.1 / 75.7 (+ 2.6)	90.4 / 91.9 (+ 1.5)
RGB	Scratch	ResNet50	4x16	72.9 / 76.8 (+ 3.9)	90.9 / 92.5 (+ 1.6)
RGB	Scratch	ResNet101	8x8	76.5 / 80.4 (+ 3.9)	92.7 / 94.4 (+ 1.7)

Transfer Learning

Model	Modality	Pretrained	Backbone	Input	UCF101	HMDB51
I3D	RGB	Kinetics	I3D	64x1	94.8	72.6
I3D	Flow	Kinetics	I3D	64x1	96.6	79.2
I3D	TwoStream	Kinetics	I3D	64x1	97.8	80.8

Action Detection

For action detection, we release models trained on THUMOS14.

SSN

Modality	Pretrained	Backbone	mAP@0.10	mAP@0.20	mAP@0.30	mAP@0.40	mAP@0.50
RGB	ImageNet	BNInception	43.09%	37.95%	32.56%	25.71%	18.33%

Spatial Temporal Action Detection

For spatial temporal action detection, we release models trained on AVA.

Modality	Model	Pretrained	Backbone	mAP@0.5
RGB	Fast-RCNN	Kinetics	NL-I3D R50	21.2

Action Recognition Video Metadata