Image banner for technology action recognition section

Action Recognition

  In the fourth and final world (MonsterLand), the kids encounter the long-awaited monsters. In this world, the monsters first teach the kids the moves that make up the monster dance and the monsters then complement the kids when they do a good job. After teaching the moves, the monsters then "dance" with the kids by doing the moves that the kids do. There are four moves the monsters and kids can do: crouching, throwing arms-up to make a "Y", flapping, and spinning. We chose these moves because they would be fun for the participants and because they also let us demonstrate a few different recognition techniques using computer vision. All of the vision processes use background subtracted images which contain only a silhouette of the person. Background subtraction removes the room from the imagery, leaving only the image of the person. Since it may be difficult to remove shadows induced by strong lighting, shadows have been incorporated into the models.

Each of the real-time approaches for recognition described below are run in parallel as the kids perform dance moves. If one method matches well and matches better than the other methods, the system will detect the dance move.

Example of action recognition in MonsterLand
(MonsterLand world video example)


Blob Characteristics (For "crouching")

The first and simplest technique, for detecting crouching, uses the shape characteristics of the background-difference blob. The "standing" blob shape for a person is initialized as soon as the person gets on the rug in the fourth world. Then, the blob shape, which is modeled using an ellipse matched to the blob data, is compared with the "standing" model. If the elongation of the blob changes significantly, the algorithm will signal that a crouch has taken place.

Quicktime plug-in required
Click on this image
to see the crouch.


Blob of a person standing
Standing position


Blob of a person crouching
Crouching position

Pose Recognition (For "throw yours arms up and make a Y")

This next technique uses the shape (or pose) of the person to identify when the person's arms are up in the air (in a Y shape). Here we use a pattern recognition approach to classify the background subtracted images of the person. Seven moment-based shape features are computed from the these images and are statistically compared to training examples of people "making-a-Y". This approach is reasonable when the particular configurations of the person are of interest to recognize. (For details see the papers on the Info page.)

Quicktime plug-in required


Blob of a person making a Y
Arms up,
making a "

Action Understanding (For "flap your arms", "spin like a top")

The last technique used to recognize monster moves is a variant of a new action recognition technique. An in-depth description of the full approach is given in the papers on the Info page; the technique is also demonstrated online. In this method, the background subtracted images of the people are temporally integrated to yield a "temporal template" of the action. These template descriptions collapse the action over time down to a single image. The temporal extent (or range) of integration is determined by training examples of the actions. A statistical moment-based description of the action templates is used for recognition of the action. Some temporal templates are shown below.

Quicktime plug-in required

Blob image of person flapping


Quicktime plug-in required

Blob for person spinning with arms out


  Story - Playspace - Technology - People - Info  

The KidsRoom - Perceptual Computing Group - MIT Media Laboratory