TR#481: Light-years from Lena: Video and Image Libraries of the Future

Yuri Ivanov

MS Thesis

This thesis addresses the problem of using probabilistic formal languages to describe and understand actions with explicit structure. The thesis explores several mechanisms of parsing the uncertain input string aided by a stochastic context-free grammar. This method, originating in speech recognition, lets us combine a statistical recognition approach with a syntactical one in a unified syntactic-semantic framework for action recognition. The basic approach is to design the recognition system in a two-level architecture. The first level, a set of independently trained component event detectors, produces the likelihoods of each component model. The outputs of these detectors provide the input stream for a stochastic context-free parsing mechanism. Any decisions about supposed structure of the input are deferred to the parser, which attempts to combine the maximum amount of the candidate events into a most likely sequence according to a given Stochastic Context-Free Grammar (SCFG). The grammar and parser enforce longer range temporal constraints, disambiguate or correct uncertain or mis-labeled low level detections, and allow the inclusion of a priori knowledge about the structure of temporal events in a given domain. The method takes into consideration the continuous character of the input and performs ``structural rectification'' of it in order to account for misalignments and ungrammatical symbols in the stream. The presented technique of such a rectification uses the structure probability maximization to drive the segmentation. We describe a method of achieving this goal to increase recognition rates of complex sequential actions presenting the task of segmenting and annotating input stream in various sensor modalities.