A state-based method for learning visual behavior from image sequences is presented. The technique is novel for its incorporation of multiple representations into the Hidden Markov Model framework. Independent representations of the instantaneous visual input at each state of the Markov model are estimated concurrently with the learning of the temporal characteristics. Measures of the degree to which each representation describes the input are combined to determine an input's overall membership to a state. We exploit two constraints allowing application of the technique to view-based gesture recognition: gestures are modal in the space of possible human motion, and gestures are viewpoint-dependent. The recovery of the visual behavior of a number of simple gestures with a small number of low resolution example image sequences is shown.