The
LAFCam project is designed for the scenario of shooting home videos.
When you have a lot of video, can the cameraman's laughter indicate
points of interest? To explore this possibility we built a system
that recognizes laughter of the cameraman and uses this to index
the video in an editing application.
We
collected data by having the cameraman wear a speech recognition
microphone, and a Sony digital voice recorder. This microphone is
a headset boom microphone with noise canceling so as to maximize
the signal to noise ratio. This audio data was brought into a PC
and segmented into 1-4 second segments of laughter and speech.
With
this data we then trained two Hidden Markov Models (HMM), one for
laughter and one for all other speech. The observation output at
each stage of the model is the 64 spectral coefficients of the spectrogram
of the audio signal at that point in time. The laughter model has
three states and the output is models as a mixture of two gaussians
distributions. The speech model has five states and is also modeled
as a mixture of two gaussians.
We
performed two analyses of these HMMs. In the first case we took
a data corpus of 40 laughter examples and 210 speech examples and
trained the models with 70% of the data. Then the remaining 30%
was used for testing, and we achieved 88% accuracy. The problem
with this analysis is that it deals with "ideal" data.
The test data was segmented by hand, and it was guaranteed to fall
into one of the two classes.
For
use in video editing, we need automatic segmentation and classification
of a continuous audio file. To do this we take the audio file and,
using Matlab, go through the file testing consecutive two second
windows, sliding the window by a half second each trial. The window
size was chosen from the average length of the training examples.
The rate at which we move the window was chosen arbitrarily as a
half second, which turned out to work reasonably well. This method
of segmentation and classification did not perform as well as the
first analysis and resulted in a 35% false positive rate. The segments
that were incorrectly labeled laughter were mostly sounds that could
be considered out of the model vocabulary. There were very few cases
of it labeling speech as laughter, but coughs, loud background noises
like cars and trains, and other such unknown sounds were incorrectly
classified. This problem could be addressed by filtering the audio
file before processing, training other models for the unknown sounds,
or coming up with a criteria for labeling a segment as "out
of vocabulary".
The
video editing application was written in Isis. It displays the whole
video and allows the user to move through with a scrollbar. The
laughter detection data is displayed directly under the video sequence,
visually showing the user where the points of interest may be.
Future
work will incorporate other forms of affective feedback. In addition
to collecting speech data, we collected data about the cameraman's
skin conductivity and video of the cameraman's face for facial expression
analysis. These two channels will be used in addition to the laughter
detection to further analyze the interest and arousal of the cameraman,
attempting to achieve an accurate automated editing system.
Andrea
Lockerd & Florian
'Floyd' Mueller
|