Given two action sequences, we are interested in spotting/co-segmenting all pairs of sub-sequences that represent the same action. We propose a totally unsupervised solution to this problem. No a-priori model of the actions is assumed to be available. The number of common subsequences may be unknown. The sub-sequences can be located anywhere in the original sequences, may differ in duration and the corresponding actions may be performed by a different person, in different style. We treat this type of temporal action co-segmentation as a stochastic optimization problem that is solved by employing Particle Swarm Optimization (PSO). The objective function that is minimized by PSO capitalizes on Dynamic TimeWarping (DTW) to compare two action sequences. Due to the generic problem formulation and solution, the proposed method can be applied to motion capture (i.e., 3D skeletal) data or to conventional RGB video data acquired in the wild. We present extensive quantitative experiments on several standard, ground truthed datasets. The obtained results demonstrate that the proposed method achieves a remarkable increase in co-segmentation quality compared to all tested existing state of the art methods.
Given two image sequences that share common actions, our goal is to automatically co-segment them in a totally unsupervised manner. In this example, there are four common actions and two non-common actions. Notice that there are two instances of the 1st action (green segments) of sequence A in sequence B. Each point of the grayscale background encodes the pairwise distance of the corresponding sequence frames.
-
MHAD101-s
Downloads: Data, ReadMe
Links to the original MHAD dataset: here.
-
MHAD101-v
Downloads: Data, ReadMe
Links to the original MHAD dataset: here.
-
80-pair
Downloads: Data, ReadMe
Links to the original 80-pair dataset: data, code, webpage.
The dataset was originally introduced in:
"Video Co-segmentation for Meaningful Action Extraction", ICCV 2013
-
CMU86-91
Downloads: Data, ReadMe
The CMU mocap Subject86 was originally used in "Segmenting Motion Capture Data into Distinct Behaviors", Graphics Interface 2004. Download frames.
Features and ground truth used in our work were originally presented in:
"Video Summarization by Visual Co-occurrence", ICCV 2015,
"Unsupervised Temporal Commonality Discovery", ECCV 2012.

-
Results of SEVACO, UEVACO variants for the datasets: Download
-
Temporal Co-Segmentation Accuracy (Recall, Precision, F1, Overlap)

- TCD: Chu et.al "Unsupervised Temporal Commonality Discovery", ECCV2012
- Guo: Guo et.al "Video Co-segmentation for Meaningful Action Extraction", ICCV2013
- S/U-SDTW: Park et.al, "Unsupervised pattern discovery in speech", IEEE Trans.on Audio, Speech & Language Processing 2008
-
Runtime Performance
The runtime performance is reported in seconds and was assessed using an i7 CPU with 12GB RAM for a Python implementation of SEVACO.

K. Papoutsakis, C. Panagiotakis and A.A. Argyros, "Temporal Action Co-Segmentation in 3D Motion Capture Data and Videos", In IEEE Computer Vision and Pattern Recognition (CVPR 2017), IEEE, Honolulu, Hawaii, USA, July 2017
BibTex
@inproceedings{PapoutsakisPanagiotakisArgyros2017,
author = {Papoutsakis, Konstantinos and Panagiotakis, Costas and Argyros, Antonis A},
title = {Temporal Action Co-Segmentation in 3D Motion Capture Data and Videos},
booktitle = {IEEE Computer Vision and Pattern Recognition (CVPR 2017) },
publisher = {IEEE},
year = {2017},
month = {July},
address = {Honolulu, Hawaii, USA},
}
Contact:
papoutsa@ics.forth.gr,
Personal Webpage.
Acknowledgements:
This work was partially supported by
H2020 projects
Co4Robots and
ACANTO.
Copyright © 2017 Konstantinos Papoutsakis, ICS-FORTH 2017