2.5 Interactive systems for conductors and conductor-like gestures

During the past thirty years there have been many attempts to build systems to ‘conduct’ music using electronics. These have varied widely in their methods, gesture sensors, and quality. Some focus more on the algorithms and software, whereas others concentrate on the interface hardware. I detailed several conducting interfaces in my masters’ thesis, including systems using light-pens, radio transmitters, ultrasound reflections, sonar, video tracking, the VPL Research DataGlove, and accelerometers. Others have used keyboards and mice, pressure sensors, and infrared tracking systems. Many of these projects were disappointing; it is my opinion that this is because they have not been designed by or for conductors. That is, they are built by engineers who have little or no conducting experience, and therefore the kinds of assumptions that are made are often simplistic or impractical for use by a real conductor. here I will instead emphasize the software systems that accompany conducting applications.

2.5.1 Hyperinstruments

Tod Machover’s numerous recent Hyperinstruments projects demonstrate an extensive array of ‘conducting’ systems for both expert performers and the general public. While many of these instruments do not explicitly resemble or mimic ‘conducting’, they make use of musical behaviors that lie in the continuum between actuating discrete notes and shaping their higher-level behaviors. In the sense that a performer does not have to literally ‘play’ every note directly, he behaves more like a conductor would. Several instruments from Professor Machover’s Brain Opera are well suited to the more continuous, higher-level, conducting-style behaviors -- particularly the Sensor Chair, Gesture Wall, and Digital Baton. For these instruments, Machover has written mappings that feature what he calls ‘shepherding’ behaviors; these are used to shape higher-level features in the music without controlling every discrete event. He describes the artistry of mapping design as coming from the complex interactions between gesture and sound -- on one extreme, the relationships are literal and clear, and on the other extreme, they are layered and complex. With the Hyperinstruments project Machover has explored the range of mapping possibilities, and feels that the best ones lie somewhere in between the two poles.

2.5.2 Radio Baton

Max Mathews’ Radio Baton and Radio Drum, built in collaboration with Bob Boie, are significant because they were the earliest conducting interfaces; their only predecessor was Mathews’ mechanical baton (also called the "Daton"), which was a stick that hit a pressure sensitive plate. The Radio Baton system consists of two or more radio-transmitting batons, each of which transmit a distinct frequency, and which are tracked in three dimensions above a flat, sensitive plane.

Figure 5. Max Mathews performing on the Radio Baton (photo by Pattie Wood)

Mathews’ numerous early publications introduced the basic issues of interactive music to the computer music community, along with his solutions to these problems. For example, Mathews soon realized that a measure of force would be needed in combination with the beat detection. For this, he implemented a velocity algorithm that determined the ‘hardness’ of the stroke by measuring the velocity of the stick as it crossed a trigger plane, a fixed distance above the surface of the table. He also soon realized that double triggering was a problem, and added an additional threshold (a short distance above the trigger threshold) to reset the trigger detector. Two successive beats would have to be separated by lifting the baton or stick just high enough to reset the mechanism. This meant a little extra effort for the user, but also reduced a significant source of error with the system. The Radio Drum has been used by numerous composers and performers over the years, but the Radio Baton has not enjoyed as much widespread success. According to one professional conductor, this is because the Radio Baton’s sensing mechanism requires that the baton remain above a small table-like surface to generate every beat; this is not natural for someone who has been trained in traditional conducting technique, and is impractical if she also has to communicate to assembled musicians.

2.5.3 The Virtual Orchestra

The Virtual Orchestra83 is a Cincinnati-based commercial venture run by the composer/sound designer team of Fred Bianchi and David Smith. These two began integrating their technical product into professional performing arts productions beginning in 1989. According to their commercial website,

"the Bianchi & Smith Virtual Orchestra is a sophisticated network of computers that deliver a beautifully clear digital simulation of a live orchestra. There are no electronic keyboards, pre-recorded tapes, click-tracks, or artistic constraints. The Virtual Orchestra is a stand alone live performance instrument capable of following a conductor's tempo and subtle musical interpretation throughout a live performance." Bianchi and Smith usually run their system with a large electronic system in the pit, featuring two performers at computers, following a traditional conductor (who also conducts the performers onstage). The effect at one performance was described as follows from the Washington Post: "The orchestra pit held only the conductor, Robin Stamper, and two musically trained technicians, who stared into their video monitors with a calm that would have done credit to seasoned bank tellers, following the baton with carefully synchronized entries into the computer..." Despite the apparent lack of visual dynamism in performances of the Virtual Orchestra, the sonic result has been described as extremely realistic and professional. To date it has been used in over 800 performances of opera, musical theater, and ballet, including productions on Broadway, at the New York Shakespeare Festival, and at Lincoln Center. The company is currently engaged in an ongoing collaboration with Lucent Technologies. Not surprisingly, however, they have also generated considerable controversy. This is because they use their computer-based technology to replace the more expensive human musicians who have traditionally created the music in the pit. In a highly publicized opera production at the Kentucky Opera House, the entire pit orchestra was left out of a production of Hansel and Gretel, in favor of the Virtual Orchestra. The management of the opera house claimed that it did this to save money on production costs so as to help fund other productions with its in-house orchestra; one critic had this to say about the result: "The continuing development of this technology has ominous implications for opera and all music. The digitization process (Bianchi & Smith) is another case of the dehumanization of society and the deterioration of education." Equally withering is this description of the system by a music student: "The Virtual Orchestra, however, has been viewed as a threat to traditional musicianship…In fact, the orchestra sounds so real, that it is a low cost, effective substitute for an entire pit orchestra made up of professional musicians…While each orchestra "track" takes over three years to complete, as Bianchi puts it, "Once it’s done, it’s done." That means that popular pieces such as the Wizard of Oz can be used over and over again. All that the orchestra requires during a performance is the monitoring of a few people who constantly adjust the tempo, volume, and pitches of the electronic score. They watch the conductor and follow along, just as in any performance containing live musicians. While some purists consider this practice "ruining opera" and stealing the soul from otherwise live musical performances, Bianchi is quick to point out that "In a musical, where are the musicians? They are in a pit, inaccessible to the audience. We just take their place. We can never replace live orchestras in the sense that people will never come to see a few guys fiddle with electronic boxes. But we can fill in for the unseen musicians at a musical or opera, and at much lower of a cost." This brings around a sense of insecurity to the average traditional musician, despite Bianchi’s reassurances." My opinion is that the Virtual Orchestra system represents an unfortunate use of computer technology to save money by replacing human beings. The idea of computers as labor-saving devices is an age-old theme in the history of computer development, and often these ideas are short-lived. The Virtual Orchestra presents an impoverished vision about what the new technology is capable of – yes, in the short term, it can approximate traditional music well enough to replace humans. But a much richer function for the same technology would be for it to be used to create exciting new performance paradigms, not to dislocate a class of skilled professionals.

2.5.4 A MultiModal Conducting Simulator

Perhaps the most advanced work done in automatic conductor recognition has been done by Satoshi Usa of the Yamaha Musical Instrument Research Lab in Hamamatsu, Japan. At Kogakuin University in 1997-98, Usa implemented a system that used Hidden Markov Models to track conducting gestures. His hardware consisted of two electrostatic accelerometers in a small hand-held device; these detected vertical and horizontal accelerations of the right hand. In the resulting paper, "A conducting recognition system on the model of musicians’ process," he described his five-stage process: in stage one, the data is sampled at a minimum rate of 100Hz and band-pass filtered using a 12th-order moving average and the DC component is removed. In the second stage an HMM is used to recognize beats; in his case, he uses a 5-state HMM with 32 labels to describe all the different possible gestures, and trained the system with 100 samples using the Baum-Welch algorithm. In stage three, he uses a fuzzy logic system to decide if the beat is correct as recognized; if it comes too soon after a previous beat, then it is discarded. This removes problematic double-triggers. A fourth stage determines where the system is in relation to the score and whether the beat is on 1, 2, 3, or 4. The fifth stage synthesizes the previous three stages together and outputs MIDI with appropriate tempo and dynamics. Other features of the system include a preparatory beat at the beginning of every piece, a variable output delay based on the tempo, different following modes (loosely or tightly coupled to the beats), proportional dynamics (loudness of notes is determined by the absolute acceleration magnitude), and appropriate differentiations between staccato and legato gestures.

His assumptions about conducting technique came from the rule-based system proposed by Max Rudolf in "The Grammar of Conducting." Usa’s results were extremely strong; his beat recognition rates were 98.95-99.74% accurate. Much of this success can be attributed to his multi-staged HMM process which allowed each successive stage to error-correct on its predecessors. Usa later incorporated pulse, eye tracking (gaze point, blinking), GSR, and respiration sensing into extensions of this system.

2.5.5 The Conductor Follower of the MIT Electronic Music Studio

At the MIT Electronic Music Studio in the early 1980s, Stephen Haflich and Mark Burns developed a sonar-based conductor-following device. It used inexpensive ultrasonic rangefinder units that had been developed by Polaroid for their automatic cameras. They mounted the two sonar devices in separate wooden frames that sat on the floor and positioned the sonar beams upward toward the conductor ‘s arm at an angle of about 45 degrees. Since the devices were too directional, a dispersion fan was built to spread the signal in front of the conductor. The conductor had to be careful not to move forward or back and to keep his arm extended. The device would track the arm in two dimensions to an accuracy of about one inch at better than 10 readings per second. Haflich and Burns modified the device's circuit board to create a much softer click so that it wouldn't interfere with music, and were able to sense within a five-foot range, which corresponded to a quick duty cycle of approximately 10-20Hz. To increase the sensitivity they increased the DC voltage on the devices from approximately 9 to 45 volts.

Figure 6.  Stephen Haflich conducting at the MIT Media Lab with his sonar device in 1985.

One very nice feature of the Haflich and Burns device was its unobtrusiveness -- no wand or baton was necessary. However, Max Mathews, in residence at MIT that summer, suggested that they use a baton with a corner reflector on its tip; this improved the sensitivity of the device and reduced the number of dropped beats. Unfortunately, their device was never used further to study or exploit conducting gesture – they implemented only one function for it which detected the conductor's tactus and used it to control a synthesizer.

2.5.6 Gesture Recognition and Computer Vision

I originally based my search on the premise that current methods used by the gesture-recognition, pattern-recognition, and computer vision communities might be useful for developing mappings for new musical instruments. This turned out to be quite useful, because it turned up numerous techniques that are otherwise not used by musicians or composers. Also, gesture recognition researchers have developed methods for simplifying the inherent problems. Some of these techniques have great potential value for musical structures, such as in determining the meter and tempo of a composition. For example, Bobick and Wilson have defined gestures as sequences of configuration states in a measurement space that can be captured with both repeatability and variability by either narrowing or widening the state-space. They have provided a powerful model for abstracting away the difficult aspects of the recognition problem.

"Since humans do not reproduce their gestures very precisely, natural gesture recognition is rarely sufficiently accurate due to classification errors and segmentation ambiguity." But this was also partly unsuccessful, because the requirements for musical performance represent a very specialized and demanding subset of all gestures. The state of the art in gesture recognition is predicated on simple requirements, such as detection and classification of symbolic, one-to-one mappings. For example, most gesture-recognition tasks involve pointing at objects or demonstrating predefined postures such as hand signs from a sign language. These techniques are analogous to the triggering of discrete musical events, and are much too simple to describe the complex trajectories that music takes through its multivariable state-space. Often, the recognition process itself requires that much of the minute, expressive detail in a gesture be thrown out in order to train the system to recognize the general case.

In addition, music requires very quick response times, absolutely repeatable "action-response mechanisms," high sampling rates, almost no hysteresis or external noise, and the recognition of highly complex, time-varying functions. For example, most musical performances demand a response time of 1kHz, which is a factor of almost two orders of magnitude difference from the 10-30 Hz response time of current gesture-recognition systems. Also, many gesture-recognition systems either use encumbering devices such as gloves, which limit the expressive power of the body, or low-resolution video cameras which lose track of important gestural cues and require tremendously expensive computation. However, many of the pattern- and gesture-recognition techniques have merit, and with some adaptations they have been shown to be useful for musical applications.

While gesture recognition cannot solve all of my problems, however, it does have some important and useful techniques. One such technique is Hidden Markov Models, which are normally used to find and train for interrelated clusters of states; they are also useful, although rarely used, to train for transitions. A second area involves the use of grammars (regular, stochastic) to parse the sub-pieces of a gesture language. A third is Bayesian networks. While none of these techniques is particularly optimized for real-time usage or music, I think that a combination of techniques will yield interesting results.

Numerous others have undertaken conducting system projects; most notable are the ones that have employed advanced techniques for real-time gesture recognition. Most recently, Andrew Wilson of the MIT Media Lab Vision and Modeling group built an adaptive real-time system for beat tracking using his Parametric Hidden Markov Modeling technique. This system, called "Watch and Learn," has a training algorithm that allows it to teach itself the extremes of an oscillating pattern of movement from a few seconds of video. The extremes are automatically labeled ‘upbeat’ and ‘downbeat,’ and after they are found they allow the system to lock onto the oscillating frequency. The frequency directly controls the tempo of the output sequence, with some smoothing. One great advantage of Wilson’s method is that it doesn’t use prior knowledge about hands or even attempt to track them; it just finds an oscillating pattern in the frame and locks onto that on the fly. This means that the gestures do not have to be fixed in any particular direction, unlike many gesture recognition systems. Also, Yuri Ivanov and Aaron Bobick built a system using probabilistic parsing methods in order to distinguish between different beat patterns in a passage involving free metric modulation tracking.

Finally, Martin Friedmann, Thad Starner, and Alex Pentland, also of the MIT Media Lab Vision and Modeling Group, used Kalman filters to predict the trajectory of a motion one step ahead of the current position. Their system allowed someone to play ‘air drums’ with a Polhemus magnetic postion-tracking sensor with near-instantaneous response times. Their solution provided a clever means to overcome the inherent time delay in the sensing, data acquisition and processing tasks. For processor-intensive computations, such as trajectory shape (curve-fitting), aspect ratio, slope, curvature, and amplitude estimation of peaks, their technique could be very useful. While the motivations behind all these projects were not to build instruments or systems for performers, they chose musical application areas because they are interesting, rule-based, and complex. Their primary motivation, namely, to improve upon vision-based gesture recognition systems, generated several advanced techniques that may prove to be very useful for music applications in the future.

 Chapter 2.6