In an effort to allow more flexibility in the system, we are now using our own speech recognition engine. Using a recurrent neural network, we are able to detect phonemes (in a 40 character alphabet) in real-time. This would allow us to have a more flexible representation of the words being generated. Thus, we could do away with text words that are only recognized poorly by ViaVoice. Using phonemes, words like 'music' and 'musical' which have the same phonetics (short of the 'al' suffix) will be classified as similar. Furthermore, proper nouns and foreign words can also be detected.
We are also considering other output or feedback modalities. For instance, we could change the ambient lighting of the room to slightly vary the mood of a situation. If the conversational context indicates that the users are frustrated (i.e. too many harsh words: 'angry', 'bad', etc.) we could toggle a bluish or greenish lighting which is known to generate a more relaxed ambiance.
Ultimately, we would like such a system to be constructed from real-world training data where a real human mediator is present and responds to the other participants. The machine forms a model of the mediators responses to different key words they say from training sessions. The augmentations (i.e. responses of the mediator) are automatically associated with their trigger stimuli and these could be later synthesized by the machine (i.e. via audio playback).