3.5 Formatting, Timing, Graphing and Filtering the Data

Before the data could be analyzed, there was an enormous amount of processing that was required. For some of the largest files (the largest being 371 Megabytes, representing one continuous, high-data-rate scan for 45 minutes), the first requirement was to split the files up into practical chunks. For this I wrote a utility in C to split each file into a series of 1 Megabyte-sized smaller files, with all the channels properly aligned. After the data files became more manageable the next task was to graph and filter them for the visualization analysis.

3.5.1 Non-real-time filters in Labview

One of the first things I wanted to do was to graphically format each data file so that I could set it underneath its corresponding video segment and watch both the data and the video at the same time. To do this, I wrote a routine in Labview to take a previously-acquired data file, graph it in color, and have it scroll past the field of view as if it were being acquired in real-time. I got it to work by reading the ASCII-based data file into a while loop, sending it to a shift register, and adding timers into the loop to emulate the sampling rate. This worked reasonably well, but took a very long time to precompute. Filesizes larger than 1 Megabyte did not work well for this method and would frequently crash the machine. Since files under 1 Megabyte corresponded to relatively short video segments (less than one minute), I eventually abandoned this approach in favor of printing the data in long paper rolls.

I also wrote a utility in Labview to do frequency-domain analyses such as FFTs and spectrograms; these yielded interesting results but basically showed a lack of interesting structure in the frequency domain of the data I had collected.

3.5.2 Non-real-time filters in Matlab

In order to solve the glitching problem in one subject’s concert data mentioned in section 3.3.3, I wrote a filter in Matlab to reassign the misplaced samples. This solved most of the problem, although a few outliers remained (particularly in channels EMG3 and Heart rate). I wrote a second filter that removed these outliers and replaced them with the previous value in that channel. This noise had occurred in channels 3-7, corresponding to his EMG3, Respiration, Temp, GSR, and Heart Rate signals. The basic problem was that there was noise at pseudoregular intervals in each channel, and that the noise in channel x+1 was actually data from channel x. It initially appeared as if this wrapped around at the ends; that the data in channel 7 should have been in channel 3 (this ended up not to work). My filter first looked for large outliers (generally any values greater than 2 standard deviations) in all the channels, and compare their mean with the mean of the following channel to determine that they matched. I then moved the samples to their corresponding indices in the correct channel. Then, if a few outliers remain which do not have matches in the next channel, it replaces them with the previous sample in the same file. Here’s a segment of the Matlab code that accomplished this:

  Chapter 4.1.1

 

 

 

 

 

 

 

 

 

 

 

x = mean(EMG3);

y = std(EMG3);

EMG3index = find(EMG3 > (x + 2*y));

EMG3value = EMG3(EMG3index,:);

EMG3(EMG3index,:) = EKG(EMG3index,:);

EMG3index2 = find(EMG3 < (x - 2*y));

EMG3value2 = EMG3(EMG3index2,:);

EMG3(EMG3index2,:) = EKG(EMG3index2,:);

%% for any remaining outliers, interpolate: EMG3outlierindex = find(EMG3 > (x + 3*y));

EMG3outliervalue = EMG3(EMG3outlierindex,:);

EMG3(EMG3outlierindex,:) = EMG3((EMG3outlierindex - 1),:);

%% EMG3index returns a list of all the indices of the samples at which there

%% is this characteristic noise phenomenon

%% the scalar that multiplies with the y term could be anything greater

%% than 2 or less than 6, but I figured that 2 would only work for the

%% case where the signal is not very active -- so I set it to the highest

%% possible value.

a = mean(Respiration);

b = std(Respiration);

Respirationindex = find(Respiration < (a - 3*b));

Respirationvalue = Respiration(Respirationindex,:);

Respiration(EMG3index,:) = EMG3value;

Respirationoutlierindex = find(Respiration < (a - 3*b));

Respirationoutliervalue = Respiration(Respirationoutlierindex,:);

%% Respiration(Respirationoutlierindex,:)

%% = Respiration(max((find(Respiration >= (a - 3*b))) < Respirationoutlierindex));

n = 1;

if Respiration(Respirationoutlierindex - n) < (a - 3*b)

n = n + 1;

else

Respiration(Respirationoutlierindex - n);

%% b, 2*b, 3*b, 4*b yield the same result d = mean(GSR);

e = std(GSR);

GSRindex = find(GSR > (d + 2*e));

GSRvalue = GSR(GSRindex,:);

GSR(Respirationindex,:) = Respirationvalue;

GSRoutlierindex = find(GSR > (d + 2*e));

GSRoutliervalue = GSR(GSRoutlierindex,:);

GSR(GSRoutlierindex,:) = GSR((GSRoutlierindex - 1),:);

%% for noise that lies below the baseline GSRoutlierindex2 = find(GSR < (d - 2*e));

GSRoutliervalue2 = GSR(GSRoutlierindex2,:);

GSR(GSRoutlierindex2,:) = GSR((GSRoutlierindex2 - 1),:);

%% 4*e yields the same result g = mean(Temperature);

h = std(Temperature);

Temperatureindex = find(Temperature < (g - 2*h));

Temperaturevalue = Temperature(Temperatureindex,:);

Temperature(GSRindex,:) = GSRvalue;

Temperatureoutlierindex = find(Temperature < (g - 2*h));

Temperatureoutliervalue = Temperature(Temperatureoutlierindex,:);

Temperature(Temperatureoutlierindex,:) =

Temperature((Temperatureoutlierindex - 1),:);

%% for the noise that lies above the baseline: Temperatureoutlierindex2 = find(Temperature > (g + 2*h));

Temperatureoutliervalue2 = Temperature(Temperatureoutlierindex2,:);

Temperature(Temperatureoutlierindex2,:) =

Temperature((Temperatureoutlierindex2 - 1),:);

j = mean(EKG);

k = std(EKG);

EKGindex = find(EKG < (j - 2*k));

EKGvalue = EKG(EKGindex,:);

EKG(Temperatureindex,:) = Temperaturevalue;

%% for heart rate EKGoutlierindex = find(EKG < (j - 2*k));

EKGoutliervalue = EKG(EKGoutlierindex,:);

EKG(EKGoutlierindex,:) = EKG((EKGoutlierindex - 1),:);

Running this filter twice on each data file, I successfully removed noise from the Respiration, GSR, and Temperature signals; by carefully determining the mean of the EMG3 signal, I was able to at least remove the noise visually. However, I was not able to correctly remove noise from the Heart Rate signal

3.5.3 General Issues with this Data Set

In addition to processing, graphing, and noise removal, there were several issues that were particular to this data set that needed to be addressed. For example, sampling rates were extremely important, since the time-scale resolution for conducting is extremely small. Events had to be timed precisely to the millisecond in order to line them up with the video and extract meaning from them; glitches and inconsistencies in the data could quickly render it useless. There were some unresolved inconsistencies in timing mechanisms that caused difficulties in resolving a sampling rate. For example, the final subject’s data was sampled at 4 kHz according to the Labview software. However, from inspecting the data, it seemed to be around 3 kHz. I found that in many of the cases the data lined up much better with one or two extra seconds per page, which I attribute to some delay in the use of Labview in ‘continuous run’ mode. Since continuous run causes a buffer to be continuously filled and then flushed, I assume that the time taken to flush and then refill the buffer accounts for this timing inconsistency. For his data, which was sampled at 4kHz, for every 44,000 samples, instead of it lining up with 11 seconds, it seems to line up with approximately 12-13 seconds. For the first 12’16" I found that the data lined up well with 12 seconds per 44,000 samples; for the remainder I found that it lined up better with 13 seconds per 44,000 samples.

Also, differing environmental conditions contribute to differences between subjects’ baselines and noise levels. For example, the noise level for one student subject’s data was much higher than it had been during the previous week in the same room for the two other students. Also, during one professional subject’s rehearsal, the respiration band was extremely tight and therefore amplified tiny fluctuations in movement. The tightness of the respiration band seems to influence its overall range; the band was extremely tight on him during the rehearsal, but less tight at the concert. A tight band will mean a high baseline, which will also influence the resultant signal. That is, if the band starts tight, it might not expand to its fullest extent.

Another issue had to do with the printing of the data sets; depending on how the data had been sampled, the printing of it made a difference in how much structure could be obtained visually. With one professional subject’s data, which had been initially sampled at 300 Hz, when I initially printed it at a density of 5,000 samples per page, it was unintelligible. When I printed it a second time at a density of 500 samples per page, the structure in it suddenly became much clearer. The opposite was true for another subject’s data, which was not intelligible when I initially printed at 14,000 samples per page. Upon reprinting it at 44,000 samples per page, the features became instantly recognizable. Also, low scaling of data relative to the height of the signal reduces the visibility of the features. This may sound obvious, but it is tricky to do correctly without having to rescale each successive page of data to the local value of the sensor stream. Finally, it was crucial to maintain consistent vertical scaling across all-printouts within one file; a change in scale deeply affects perceptual cues about the prominence of features.

Then there are examples where noise conditions yield interesting results. For example, when a sensor is not attached or attached without a signal, the noise is approximately Gaussian; when generating a signal or actively generating noise, the signal radically changes. This could be used to detect whether or not the sensor is sitting correctly on the skin. Also, in the EMG signals, the signal-to-noise ratio is very high; usually, if there is no noticeable motion, then the signal is almost completely flat. The signal-to-noise ratio for each sensor is different; some (particularly the GSR) seem particularly sensitive to motion artifacts.

Finally, there were several instances when noise conditions yielded problems. For example, I found a great deal of apparent motion artifact in the respiration and GSR signals; this seemed to mostly arise from large movements of the arms and shoulders. I also encountered difficulty in collecting the heart rate signal from the subjects. It was in general unusable because it was intermittent at best across most of the subjects.