1.3 Motivation "While the human hand is well-suited for multidimensional control due to its detailed articulation, most gestural interfaces do not exploit this capability due to a lack of understanding of the way humans produce their gestures and what meaning can be inferred from these gestures." The strongest motivation for me to begin this project was the enormous difficulty I encountered in previous projects when attempting to map gestures to sounds. This was particularly true with my Digital Baton project, which I will discuss in detail in section 1.3. Secondly, a glaring lack of empirical data motivated me to gather some for myself. A visit to Professor Rosalind Picard in 1996 yielded some new ideas about how to go about designing a data collection experiment for conductors, which eventually we implemented in the Conductor’s Jacket project. As far as I know, there have been no other quantitative studies of conductors and gesture. Even in other studies of gesture I have not come across the kind of complex, multidimensional data that were required to describe conducting. Ultimately, I came to the realization that many music researchers were going about solving the problems in the wrong way; they were designing mappings for gestural interaction without really knowing what would map most closely to the perceptions of the performer and audience. I felt that the right method would be to study conductors in their real working environments without changing anything about the situation, and monitoring the phenomena using sensors. This empirical approach informed the entire process of the thesis project.

In this section I also discuss my major influences in Tod Machover’s Hyperinstruments and Brain Opera projects. It was through my opportunities to participate in the research, performance, and public education aspects of these projects that I was able to make many of the observations that I express in this thesis. After describing aspects of the Brain Opera and the Digital Baton, I go on to explain why I have chosen conducting as the model to study, and why, in some ways, it is a bad example. Finally, I discuss the higher-level aspects of musicianship, the interpretive trajectories that performers take through musical scores, and the rules and expectations that determine a musician's skill and expressiveness.

1.3.1 Hyperinstruments, the Brain Opera, and the Digital Baton

Beginning in 1987 at the MIT Media Lab, Professor Tod Machover and his students began to bring ideas and techniques from interactive music closer to the classical performing arts traditions with his Hyperinstruments project. About his research, Machover wrote:

"Enhanced human expressivity is the most important goal of any technological research in the arts. To achieve this, it is necessary to augment the sophistication of the particular tools available to the artist. These tools must transcend the traditional limits of amplifying human gestuality, and become stimulants and facilitators to the creative process itself." Among the more popular and enduring of the resultant family of hyperinstruments have been the Hyperviolin, the Hypercello, and the Sensor Chair, all of which were designed for expert and practiced performers. For its time, the Hypercello was among the most complex of real-time digital interfaces; it measured and responded to five different continuous parameters: bow pressure, bow position, bow placement (distance from bridge), bow wrist orientation, and finger position on the strings.

In 1994, Tod Machover began developing the Brain Opera, perhaps the largest cutting-edge, multidisciplinary performance project ever attempted. A digital performance art piece in three parts that invited audiences to become active participants in the creative process, it premiered at Lincoln Center’s Summer Festival in July of 1996 and subsequently embarked on a world tour. During the following two years it was presented nearly 180 times in major venues on four continents. I’m proud to have been a member of the development and performance teams, and think that our most important collective contributions were the new instrument systems we developed. In all, seven physical devices were built: the Sensor Chair, Digital Baton, Gesture Wall, Rhythm Tree, Harmonic Driving, Singing/Speaking Trees, and Melody Easel. Those of us who were fortunate enough to have the opportunity to tour with the Brain Opera also had a chance to observe people interacting with these instruments, and got a sense for how our designs were received and used by the public.

My primary contribution to the Brain Opera was the Digital Baton, a hand-held gestural interface that was designed to be wielded like a traditional conducting baton by practiced performers. It was a ten-ounce molded polyurethane device that incorporated eleven sensory degrees of freedom: 3 degrees of position, 3 orthogonal degrees of acceleration, and 5 points of pressure. The many sensors were extremely robust and durable, particularly the infrared position tracking system that worked under a variety of stage lighting conditions. First suggested by Tod Machover, the Digital Baton was designed by me and built by Professor Joseph Paradiso; it also benefited from the collaborative input of Maggie Orth, Chris Verplaetse, Pete Rice, and Patrick Pelletier. Tod Machover wrote two pieces of original music for it and we performed them in a concert of his music in London’s South Bank Centre in March of 1996. Later, Professor Machover incorporated the Baton into the Brain Opera performance system, where it was used to trigger and shape multiple layers of sound in the live, interactive show. Having designed and contributed to the construction of the instrument, I also wielded it in nearly all of the live Brain Opera performances.

Figure 1. The Digital Baton, February 1996.

Despite the high hopes I had for the Digital Baton and the great deal of attention that it received, however, it ultimately failed to match the expectations I had for it. Perhaps because I had helped to design the device and its software mappings and then had the opportunity to perform with it, I became acutely aware of its shortcomings. From my experience, its biggest problems were:

The baton’s size and heaviness were not conducive to graceful, comfortable gestures; it was 5-10 times the weight of a normal cork-and-balsa wood conducting baton. A typical 45-minute gestural Brain Opera performance with the 10-ounce Digital Baton was often exhausting. This also meant that I couldn’t take it to orchestral conductors to try it out; it was too heavy for a conductor to use in place of a traditional baton.

Its shape, designed to conform to the inside of my palm, caused the wrist to grip in a fixed position. While this made it less likely that I might lose contact with and drop it (particularly when individual fingers were raised), it was not ideal for the individual, ‘digital’ use of the fingers.

Its accelerational data was problematic, since the accelerometers’ signal strength decreased nonlinearly as they rotated off-axis from gravity. Theoretically, with enough filtering/processing, beats can be extracted from that information, but I had trouble recognizing them reliably enough to use them for music. This was disappointing, since accelerometers seemed very promising at the outset of the project.

I initially thought that the Digital Baton’s musical software system should capture and map gestures into sound in the way that an orchestra might interpret the movements of a conductor; this turned out to be incredibly difficult to implement. It was particularly difficult to imagine how to map the positional information to anything useful other than fixed two-dimensional grids. I realized then that I did not have any insight into how conducting gestures actually communicated information.

My simple models did not allow me to extract symbolic or significant events from continuous signals. The event models I had for the baton were too simple to be useful; they needed to use higher-order, nonlinear models.

When the audience perceives a significant, expressive event in the performer’s gestures, they expect to hear an appropriate response. If it doesn’t occur, it confuses them. This causes a disembodiment problem.23 In performances with the baton, it often wasn’t obvious to audiences how the baton was controlling the sound.

The Digital Baton also suffered from the over-constrained gesture problem; brittle recognition algorithms sometimes forced performers to make exaggerated gestures in order to achieve a desired musical effect.

The majority of the problems I encountered with the Digital Baton had to do with a lack of expressiveness in the mappings. At the time I lacked insight and experience in mapping complex real-time information to complex parametric structures. My first response to these problems was to attempt to formulate a general theory of mappings, which resulted in a scheme for categorizing gestures along successive layers of complexity. This allowed for creating sophisticated, high-level action-descriptions from a sequence of minute atoms and primitives, in much the same way that languages are constructed out of phonemes. At the time I also thought that defining a vocabulary of gestures, carefully constructed out of primitives that conformed easily to the information stream coming from the sensors, would be a first step. Ultimately, however, I realized that theorizing about mappings would not help me solve the fundamental problems of the Digital Baton. Instead, I decided to take a new approach to the issues through an in-depth, quantitative, signal-based approach. The resultant project, which is detailed in this dissertation, was motivated and designed precisely with the previous problems in mind. The Digital Baton may have disappointed me as an instrument, but that failure generated a better concept with more scope for exploration and answers.

1.3.2 Why continue with conducting as a model?

"Too much media art is offered up as performance these days without awareness of the fact that it remains ungrounded in any performance practice." Despite the frustrations that I encountered with the Digital Baton, I still felt that the powerful gestural language of conducting was an area that might yield interesting results for sensor-based interfaces. Conducting is a gestural art form, a craft for skilled practitioners. It resembles dance in many ways, except it is generative, and not reflective of, the music that accompanies it. Also, without an instrument to define and constrain the gestures, conductors are free to express themselves exactly as they wish to, and so there is enormous variety in the gestural styles of different individuals.

In addition, conducting is a mature form that has developed over 250 years and has an established, documented technique. The gesture language of conducting is understood and practiced by many musicians, and is commonly used as a basis for evaluating the skill and artistry of conductors. In order to be able to understand the meaning and significance of gestures, it helps to have a shared foundation of understanding. The technique of conducting conveniently provides such a foundation in its widely understood, pre-existing symbol system.

One reason to do use older techniques is because they allow us to have performances by expert, talented musicians instead of inventors; inevitably, the result is stronger. Secondly, there are many subtle things that trained musicians do with their gestures that could be neatly leveraged by sensor systems. As Tod Machover wrote,

"one must consider if it is easier for the person to use the technique that they know, or perhaps examine another way to control the musical gesture…the smart thing to do is keep with the technique that can evolve slowly, no matter how far away the mapping goes." I agree with Professor Machover that with the established technique as a model, one can slowly develop and extend it with sensor-based systems. For example, some future, hybrid form of conducting might keep the basic vocabulary of conducting gestures, while sensing only the degree of verticality in the conductor’s posture. Such a system might use his posture to detect his interest and emotional connection to the musicians, and use the information to guide a graphical response that might be projected above the orchestra.

1.3.3 Conducting Technique

While styles can vary greatly across individuals, conductors do share an established technique. That is, any skilled conductor is capable of conducting any ensemble; the set of rules and expectations are roughly consistent across all classical music ensembles. Conducting technique involves gestures of the whole body: posture in the torso, rotations and hunching of the shoulders, large arm gestures, delicate hand and finger movements, and facial expressions. Conductors’ movements sometimes have the fluidity and naturalness of master Stanislavskian actors, combined with musical precision and score study. It is a gestalt profession; it involves all of the faculties simultaneously, and cannot be done halfheartedly. Leonard Bernstein once answered the question, "How does one conduct?" with the following:

"Through his arms, face, eyes, fingers, and whatever vibrations may flow from him. If he uses a baton, the baton itself must be a living thing, charged with a kind of electricity, which makes it an instrument of meaning in its tiniest movement. If he does not use a baton, his hands must do the job with equal clarity. But baton or no baton, his gestures must be first and always meaningful in terms of the music." The skill level of a conductor is also easily discernable by musicians; they evaluate individuals based on their technical ability to convey information. The conducting pedagogue, Elizabeth Greene, wrote that skillful conductors have a certain ‘clarity of technique,’ and described it in this way: "While no two mature conductors conduct exactly alike, there exists a basic clarity of technique that is instantly -- and universally -- recognized. When this clarity shows in the conductor’s gestures, it signifies that he or she has acquired a secure understanding of the principles upon which it is founded and reasons for its existence, and that this thorough knowledge has been accompanied by careful, regular, and dedicated practice." The presence of a shared set of rules and expectations, most of which are not cognitively understood or consciously analyzed by their practitioners, is a rich, largely untapped resource for the study of emotional and musical communication.

Another reason to stay with the model of conducting is that conductors themselves are inherently interesting as subjects. They represent a small minority of the musical population, and yet stand out for the following reasons:

  1. they are considered to be among the most skillful, expert, and expressive of all musicians
  2. they have to amplify their gestures in order to be easily seen by many people
  3. they have free motion of their upper body. The baton functions merely as an interface and extension of the arm, providing an extra, elongated limb and an extra joint with which to provide expressive effects
  4. their actions influence and facilitate the higher-level functions of music, such as tempo, dynamics, phrasing, and articulation. Their efforts are not expended in the playing of notes, but in the shaping of them.
  5. conductors are trained to imagine sounds and convey them ahead of time in gestures.
  6. conductors have to manipulate reality; they purposefully (if not self-consciously) modulate the apparent viscosity of the air around them in order to communicate expressive effects. Two gestures might have the same trajectory and same velocity, but different apparent frictions, which give extremely different impressions.
Conducting itself is also interesting as a method for broadcasting and communicating information in real-time. It is an optimized language of signals, and in that sense is almost unique. Its closest analogues are sign and semaphore languages, and mime. John Eliot Gardner, the well-known British conductor, describes it in electrical terms: "the word ‘conductor’ is very significant because the idea of a current being actually passed from one sphere to another, from one element to another is very important and very much part of the conductor’s skill and craft." Finally, conducting as a human behavior has almost never been studied quantitatively, and so I wanted to use empirical methods to understand it and push it in new directions.

1.3.4 Why conducting might not be a good model for interactive music systems

Conducting is often associated with an old-fashioned, paternalistic model of an absolute dictator who has power over a large group of people. By the beginning of the eighteenth century when orchestras evolved into more standard forms, this hierarchical model was generally accepted in Western culture. But this model has come under increasing scrutiny and disfavor with the emergence and empowerment of the individual in modern societies. The notion that conductors have a right to be elitist, arrogant, and dictatorial no longer holds true in today’s democratic world-view.

In fact, it seems that even some of the choices that have been made in the development of protocols and standards for electronic music have been informed by anti-conductor sentiments. For example, the chairman of the group that developed the General MIDI standard had this to say about what MIDI could offer to replace the things that were lacking in classical music:

"The old molds to be smashed tell us that music sits in a museum behind a locked case. You are not allowed to touch it. Only the appointed curator of the museum -- the conductor -- can show it to you. Interactively stretching the boundaries of music interpretation is forbidden. Nonsense! The GM standard lets you make changes to what you hear as if you were the conductor or bandleader, or work with you to more easily scratch-pad any musical thought." Secondly, many interactive music systems use the solo instrument paradigm; they are designed to be performed by one player, in much the same way that a traditional instrumentalist might perform on her instrument. However, the model of conducting assumes that the performer is communicating with other people; the gesture language has evolved in order to be optimally visible and discernable by a large ensemble. As the conductor Adrian Boult suggested, you only need the extra appendage of the baton if the extra leverage buys you something by allowing you to communicate more efficiently with others. Therefore it seems unnecessary to make large, exaggerated gestures or use a baton when much less effort could be used to get the computer to recognize the signal.

Thirdly, many conductors spend most of their time working to keep the musicians together and in time, which is basically a mechanical, not an expressive, job. In that sense their primary function is that of a musical traffic cop. Finally, traditional conductors don’t themselves make any sound, so the image of a conductor directly creating music seems incongruous. It causes confusion in the minds of people who expect the gestures to be silent. As a result, it is probably not ideal to redefine the conducting baton as a solo instrument, since the result will cause cognitive dissonance or disconnect in the audience. An alternative to this would be to use a sensory baton like a traditional baton but extend its vocabulary. That is, a conducting model should be used when an ensemble is present that needs a conductor – the conductor will continue to perform the traditional conducting functions, without overhauling the technique. But she would also simultaneously perform an augmented role by, for example, sending signals to add extra sampled sounds or cue lighting changes in time to the music.

1.3.5 Interpretive variation as the key to emotion in music

"Notes, timbre, melody, rhythm, and other musical constructs cannot function simply as ends in themselves. Embedded in these objects is a more complex, indirect, powerful signal that we must train ourselves to detect, and that will one day be the subject of an expanded notion of music theory." From the performer’s perspective, the thing that makes live performances most powerfully expressive, aside from accuracy and musicianship, is the set of real-time choices they make to create a trajectory through the range of interpretive variation in the music. Techniques for creating this variation involve subtle control over aspects such as timing, volume, timbre, accents, and articulation, which are often implemented on many levels simultaneously. Musicians intentionally apply these techniques in the form of time-varying modulations on the structures in the music in order to express feelings and dramatic ideas. Some of these are pre-rehearsed, but some of them also change based on the performer’s feelings and whims during the moment. Techniques for creating these trajectories of variation involve subtle control over aspects such as timing, volume, timbre, accents, and articulation -- sometimes implemented on many levels simultaneously. Musicians intentionally apply these techniques in the form of time-varying modulations on the structures in the music in order to express feelings and dramatic ideas -- some of which are pre-rehearsed and some of which change based on their own moods and whims.

This idea, while supported in the recent literature of computational musicology and musical research, is perhaps controversial. For one thing, some might argue that there is no inherent meaning in this variation, since musicians are not able to verbally articulate what it is that they do. That is, since people intuitively and un-analytically perform these variations, then they cannot be quantified or codified. However, it has been shown that there are rules and expectations for musical functions like tempo and dynamics, and recent research has uncovered underlying structure behind these variations. I describe the work of several scientists and musicologists on this subject in Chapter 2.

Secondly, it might be countered that the dynamic range of such variation is relatively small, compared with the scale of the piece. For example, a very widely interpreted symphonic movement by Mahler might only vary between 8 and 9 minutes in length. The maximum variability in timing would reflect a ratio of 9:8 or 8:7 36. However, this is perhaps an inappropriate level at which to be scrutinizing the issue of timing variation – instead of generalizing across the macrostructure of an entire movement, one should look for the more significant events on the local, microstructural level. For example, rubato might be taken at a particular point in a phrase in order to emphasize those notes, but then the subsequent notes might accelerando to catch up to the original tempo. Thus, on the macrostructural level, the timing between a highly rubato phrase and a strict-tempo phrase might look the same, but on the microstructural level they differ tremendously. Robert Rowe gave an example of this by suggesting the comparison between two performances of a Bach cello suite -- one with expression, and one absolutely quantized: "They could be of exactly equal length, but the difference comes with the shaping of phrases and other structural points. The issue is not 8 minutes or 9 minutes, but 1 second or 2 seconds at the end of a phrase."37

1.3.6 The Significance of Music for Us

"music is significant for us as human beings principally because it embodies movement of a specifically human type that goes to the roots of our being and takes shape in the inner gestures which embody our deepest and most intimate responses. This is of itself not yet art; it is not yet even language. But it is the material of which musical art is made, and to which musical art gives significance."38 Having described the significance of interpretive variation in musical structure, I have to also acknowledge that, for myself, the significance of a great performance does not strictly lie in the microstructural variation alone. Instead, I think that great performers are marked by their abilities as storytellers and dramatists. Great musicians have the ability to capture an audience’s attention and lead them spellbound through the material.39 Of course, this is not something that could be easily proven or discussed empirically. It might be that the dramatic aspect of great performances could be modeled in terms of the microstructural variation, but it’s far from clear that we could determine this. Another possibility is that great performers hear the ratios between contrasting sections and feel pulse differences more sensitively than others, or that the proportions of the expressive relationships work out in fractal patterns. However, it would be very difficult to measure this. Therefore, for practical purposes, I chose not to study it. It’s possible that we may one day be able to explain why one musician is masterful, and why another is merely earnest, but that is beyond the scope of the present project. "Music is that art form that takes a certain technique, requires a certain logical approach, but at the same time, needs subconscious magic to be successful. In our art form, there is a balance between logic and intuition."40 Aside from the issue of quantifying the microstructural variations and determining the ‘rules’ of musicality, there is another dimension to music that must be acknowledged: the magical, deeply felt, emotional (some might call it spiritual) aspect that touches the core of our humanity. Many dedicated musicians believe that this aspect is not quantifiable. I tend to agree. I also think that it is the basic reason why we as a species have musical behaviors. And I think that our current technologies are not yet, for the most part, able to convey this aspect.41 This is one of their most damning flaws. However, I also think that if pieces of wood and metal can be carefully designed and constructed so as to be good conveyors of this magic, then there is no reason that we can’t do the same with silicon and electrons. It just might take more time to figure out how.
 

 Chapter 1.4