During the past thirty years there have been many attempts to build systems to ‘conduct’ music using electronics. These have varied widely in their methods, gesture sensors, and quality. Some focus more on the algorithms and software, whereas others concentrate on the interface hardware. I detailed several conducting interfaces in my masters’ thesis, including systems using light-pens, radio transmitters, ultrasound reflections, sonar, video tracking, the VPL Research DataGlove, and accelerometers. Others have used keyboards and mice, pressure sensors, and infrared tracking systems. Many of these projects were disappointing; it is my opinion that this is because they have not been designed by or for conductors. That is, they are built by engineers who have little or no conducting experience, and therefore the kinds of assumptions that are made are often simplistic or impractical for use by a real conductor. here I will instead emphasize the software systems that accompany conducting applications.
Tod Machover’s numerous recent Hyperinstruments projects demonstrate an extensive array of ‘conducting’ systems for both expert performers and the general public. While many of these instruments do not explicitly resemble or mimic ‘conducting’, they make use of musical behaviors that lie in the continuum between actuating discrete notes and shaping their higher-level behaviors. In the sense that a performer does not have to literally ‘play’ every note directly, he behaves more like a conductor would. Several instruments from Professor Machover’s Brain Opera are well suited to the more continuous, higher-level, conducting-style behaviors -- particularly the Sensor Chair, Gesture Wall, and Digital Baton. For these instruments, Machover has written mappings that feature what he calls ‘shepherding’ behaviors; these are used to shape higher-level features in the music without controlling every discrete event. He describes the artistry of mapping design as coming from the complex interactions between gesture and sound -- on one extreme, the relationships are literal and clear, and on the other extreme, they are layered and complex. With the Hyperinstruments project Machover has explored the range of mapping possibilities, and feels that the best ones lie somewhere in between the two poles.
Max Mathews’ Radio Baton and Radio Drum, built in collaboration with Bob Boie, are significant because they were the earliest conducting interfaces; their only predecessor was Mathews’ mechanical baton (also called the "Daton"), which was a stick that hit a pressure sensitive plate. The Radio Baton system consists of two or more radio-transmitting batons, each of which transmit a distinct frequency, and which are tracked in three dimensions above a flat, sensitive plane.
Mathews’ numerous early publications introduced the basic issues of interactive music to the computer music community, along with his solutions to these problems. For example, Mathews soon realized that a measure of force would be needed in combination with the beat detection. For this, he implemented a velocity algorithm that determined the ‘hardness’ of the stroke by measuring the velocity of the stick as it crossed a trigger plane, a fixed distance above the surface of the table. He also soon realized that double triggering was a problem, and added an additional threshold (a short distance above the trigger threshold) to reset the trigger detector. Two successive beats would have to be separated by lifting the baton or stick just high enough to reset the mechanism. This meant a little extra effort for the user, but also reduced a significant source of error with the system. The Radio Drum has been used by numerous composers and performers over the years, but the Radio Baton has not enjoyed as much widespread success. According to one professional conductor, this is because the Radio Baton’s sensing mechanism requires that the baton remain above a small table-like surface to generate every beat; this is not natural for someone who has been trained in traditional conducting technique, and is impractical if she also has to communicate to assembled musicians.
The Virtual Orchestra83 is a Cincinnati-based commercial venture run by the composer/sound designer team of Fred Bianchi and David Smith. These two began integrating their technical product into professional performing arts productions beginning in 1989. According to their commercial website,
2.5.4 A MultiModal Conducting Simulator
Perhaps the most advanced work done in automatic conductor recognition has been done by Satoshi Usa of the Yamaha Musical Instrument Research Lab in Hamamatsu, Japan. At Kogakuin University in 1997-98, Usa implemented a system that used Hidden Markov Models to track conducting gestures. His hardware consisted of two electrostatic accelerometers in a small hand-held device; these detected vertical and horizontal accelerations of the right hand. In the resulting paper, "A conducting recognition system on the model of musicians’ process," he described his five-stage process: in stage one, the data is sampled at a minimum rate of 100Hz and band-pass filtered using a 12th-order moving average and the DC component is removed. In the second stage an HMM is used to recognize beats; in his case, he uses a 5-state HMM with 32 labels to describe all the different possible gestures, and trained the system with 100 samples using the Baum-Welch algorithm. In stage three, he uses a fuzzy logic system to decide if the beat is correct as recognized; if it comes too soon after a previous beat, then it is discarded. This removes problematic double-triggers. A fourth stage determines where the system is in relation to the score and whether the beat is on 1, 2, 3, or 4. The fifth stage synthesizes the previous three stages together and outputs MIDI with appropriate tempo and dynamics. Other features of the system include a preparatory beat at the beginning of every piece, a variable output delay based on the tempo, different following modes (loosely or tightly coupled to the beats), proportional dynamics (loudness of notes is determined by the absolute acceleration magnitude), and appropriate differentiations between staccato and legato gestures.
His assumptions about conducting technique came from the rule-based system proposed by Max Rudolf in "The Grammar of Conducting." Usa’s results were extremely strong; his beat recognition rates were 98.95-99.74% accurate. Much of this success can be attributed to his multi-staged HMM process which allowed each successive stage to error-correct on its predecessors. Usa later incorporated pulse, eye tracking (gaze point, blinking), GSR, and respiration sensing into extensions of this system.
2.5.5 The Conductor Follower of the MIT Electronic Music Studio
At the MIT Electronic Music Studio in the early 1980s, Stephen Haflich and Mark Burns developed a sonar-based conductor-following device. It used inexpensive ultrasonic rangefinder units that had been developed by Polaroid for their automatic cameras. They mounted the two sonar devices in separate wooden frames that sat on the floor and positioned the sonar beams upward toward the conductor ‘s arm at an angle of about 45 degrees. Since the devices were too directional, a dispersion fan was built to spread the signal in front of the conductor. The conductor had to be careful not to move forward or back and to keep his arm extended. The device would track the arm in two dimensions to an accuracy of about one inch at better than 10 readings per second. Haflich and Burns modified the device's circuit board to create a much softer click so that it wouldn't interfere with music, and were able to sense within a five-foot range, which corresponded to a quick duty cycle of approximately 10-20Hz. To increase the sensitivity they increased the DC voltage on the devices from approximately 9 to 45 volts.
One very nice feature of the Haflich and Burns device was its unobtrusiveness -- no wand or baton was necessary. However, Max Mathews, in residence at MIT that summer, suggested that they use a baton with a corner reflector on its tip; this improved the sensitivity of the device and reduced the number of dropped beats. Unfortunately, their device was never used further to study or exploit conducting gesture – they implemented only one function for it which detected the conductor's tactus and used it to control a synthesizer.
2.5.6 Gesture Recognition and Computer Vision
I originally based my search on the premise that current methods used by the gesture-recognition, pattern-recognition, and computer vision communities might be useful for developing mappings for new musical instruments. This turned out to be quite useful, because it turned up numerous techniques that are otherwise not used by musicians or composers. Also, gesture recognition researchers have developed methods for simplifying the inherent problems. Some of these techniques have great potential value for musical structures, such as in determining the meter and tempo of a composition. For example, Bobick and Wilson have defined gestures as sequences of configuration states in a measurement space that can be captured with both repeatability and variability by either narrowing or widening the state-space. They have provided a powerful model for abstracting away the difficult aspects of the recognition problem.
In addition, music requires very quick response times, absolutely repeatable "action-response mechanisms," high sampling rates, almost no hysteresis or external noise, and the recognition of highly complex, time-varying functions. For example, most musical performances demand a response time of 1kHz, which is a factor of almost two orders of magnitude difference from the 10-30 Hz response time of current gesture-recognition systems. Also, many gesture-recognition systems either use encumbering devices such as gloves, which limit the expressive power of the body, or low-resolution video cameras which lose track of important gestural cues and require tremendously expensive computation. However, many of the pattern- and gesture-recognition techniques have merit, and with some adaptations they have been shown to be useful for musical applications.
While gesture recognition cannot solve all of my problems, however, it does have some important and useful techniques. One such technique is Hidden Markov Models, which are normally used to find and train for interrelated clusters of states; they are also useful, although rarely used, to train for transitions. A second area involves the use of grammars (regular, stochastic) to parse the sub-pieces of a gesture language. A third is Bayesian networks. While none of these techniques is particularly optimized for real-time usage or music, I think that a combination of techniques will yield interesting results.
Numerous others have undertaken conducting system projects; most notable are the ones that have employed advanced techniques for real-time gesture recognition. Most recently, Andrew Wilson of the MIT Media Lab Vision and Modeling group built an adaptive real-time system for beat tracking using his Parametric Hidden Markov Modeling technique. This system, called "Watch and Learn," has a training algorithm that allows it to teach itself the extremes of an oscillating pattern of movement from a few seconds of video. The extremes are automatically labeled ‘upbeat’ and ‘downbeat,’ and after they are found they allow the system to lock onto the oscillating frequency. The frequency directly controls the tempo of the output sequence, with some smoothing. One great advantage of Wilson’s method is that it doesn’t use prior knowledge about hands or even attempt to track them; it just finds an oscillating pattern in the frame and locks onto that on the fly. This means that the gestures do not have to be fixed in any particular direction, unlike many gesture recognition systems. Also, Yuri Ivanov and Aaron Bobick built a system using probabilistic parsing methods in order to distinguish between different beat patterns in a passage involving free metric modulation tracking.
Finally, Martin Friedmann, Thad Starner, and Alex Pentland, also of the MIT Media Lab Vision and Modeling Group, used Kalman filters to predict the trajectory of a motion one step ahead of the current position. Their system allowed someone to play ‘air drums’ with a Polhemus magnetic postion-tracking sensor with near-instantaneous response times. Their solution provided a clever means to overcome the inherent time delay in the sensing, data acquisition and processing tasks. For processor-intensive computations, such as trajectory shape (curve-fitting), aspect ratio, slope, curvature, and amplitude estimation of peaks, their technique could be very useful. While the motivations behind all these projects were not to build instruments or systems for performers, they chose musical application areas because they are interesting, rule-based, and complex. Their primary motivation, namely, to improve upon vision-based gesture recognition systems, generated several advanced techniques that may prove to be very useful for music applications in the future.