Question #9 - Disciplines
9) Which disciplines do we need to (re)consider? Can we avoid complete reunification with AI? If not, are we doomed?
Low level vision has progressed to a state where it can reliably support some interesting mid- to high-level vision tasks. As a field, how do we begin to think again about reasoning in high-level vision? My sense is that AI and vision have drifted so far apart that reunification is unlikely. I suspect that the field of computer vision will start from scratch with a different set of tools based on our skills in probability, statistics, and physics. We may reinvent the wheel but we are also somewhat free from the biases of failed AI approaches of the past.
There is already a wealth of visual representations that will be useful for vision-of-action: deformable templates, motion blobs, higher-order moments, flow, temporal textures, texture tracking, eigen-parameter representations, etc. Many of the immediate challenges now lie on the side of interpretation. What do we need? More sophisticated frameworks for dynamical inference and learning over the output of these representations. Better representations of time and different time-scales. Computationally efficient methods for discriminating action from inaction. And, of course, opportunities to field useful systems and thus encounter real problems.
Action perception is interesting for many reasons-it at minimum raises considerable methodological challenges, it certainly suggests new applications that make clear the utility of intelligent visual processing methods, and it potentially forces us to address some issues in attention and the conscious processing of signals that are otherwise avoidable. Practical progress on any of these fronts would be a significant contribution.
At present, I am interested in pursuing two directions. One aimed at extracting very detailed representations that can encompass a structured spatio-temporal model. A model that will allow both analyses in a scale space or in a granular representation space. The second direction we are pursuing is the building of Interactive Environments (see www.cc.gatech.edu/fce) that are context aware and are capable of various forms of information capture and integration. We are interested in using all forms of sensory information and we are trying to work on very specific domains like classroom, living-room, kitchen, car, etc.
The most compelling applications are those which are closest to AI. An obvious example is a querying system for a video database. This contains most of the interesting foreseeable problems --- it is necessary to specify a language to describe the most general forms of action, and then match unknown test sequences to sentences in that language. This seems precisely the sort of problem, along with generalised object recognition and natural language understanding, which is waiting for a breakthrough in AI, and on which current methods seem unlikely to have much success.
The range of work which could be labelled "perception of action'' is rather daunting. The furthest goals, of intelligent classification of action, seem somewhat outside the remit of computer vision, however --- along with many other domains, we are waiting for AI to deliver. There is clearly much useful short to medium term research to be done in the meantime which ties in with other current vision research such as tracking, object recognition, and robotics, so the perception of action should develop into a healthy field even without the holy grail of true machine intelligence.
I don't think mainstream AI people are interested in the topics I described here. But the topics are surely related to science of intelligence. The key issues related to AI are, attention, similarity, intentionality, meaning. Considering any possible application of action perception, these issues cannot be avoided, as any application will commit to decision making (and sometimes action generation) based on action observation. Learning how to recognize actions (or more appropriately, how to relate self actions to what is observed) is also extremely important in the context of autonomous agents.
The role of knowledge representation in vision. We believe that in order to build reliable perceivers, we need to reconsider the role of knowledge representation in vision. In particular, in order to evaluate perceivers, the three components described above (ontology, domain theory, preference policy) should be made explicit in the system. Only in this way can we ensure that the system is sound and complete (ie., finds all and only the appropriate interpretations) with respect to a given representation.
The role of domain structure in the search for interpretations. In some cases the difficult search problems in AI may be avoided through the use of domain structure. In particular, in our dynamics domain, we discovered that we could partition the domain and search for minimal sets of active objects while freezing the possible contacts and attachments at their maximal settings. This partition works (ie., finds all interpretations) because of a special structure of the domain which we call monotonicity (Mann, 1997). Although it is unconfirmed, we believe that this type of structure may be useful in other domains as well.
The role of world structure in knowledge representation. Regularities and structure in the world serve to constrain the forms of knowledge representation schemes we need to consider. In particular, as suggested in (Richards, Jepson, and Feldman, 1996), natural categories of world structure can be described by placing qualitative probabilities over a discrete set of "modal'' processes operating at different spatiotemporal scales. For example, in the motion domain, we may describe various qualitative categories such as: resting (stably supported by a ground plane), rolling, falling, and projectile motion (see Jepson, Richards, and Knill, 1996). These categories, based on world regularities, provide a restricted set of possible concepts our system needs to consider.
The role of world structure in learning. Finally, throughout our work we have assumed that the categories used to describe the scene are given in advance. An important research question concerns how these categories are initially determined. In particular, in order to adapt to new domains, a system should be able to learn which regularities exist in the world. (See Mann and Jepson, 1993 for some preliminary attempts to address these issues for a motion domain.)
We need to reconsider data from psychophysics in the area of human movement perception. While the data is often inconclusive, and it certainly does not begin to explain the underlying perceptual processes in great detail, is lends insight and support for interesting and alternative approaches to machine vision.
First, as industrial computer vision researchers we very often need a complete solution for a real-world problem. Therefore, we usually do not draw a clear line between computer vision and AI or "maintain as much distance as possible from AI/semantics/reasoning" in practice. For example, we take "knowledge-based image analysis" approach, which incorporated with reasoning about multiple visual evidence in time and space, to solving a variety of industrial image analysis tasks in which (a) the domain-specific knowledge is available (b) but the conventional or deterministic processing methods fail. By reasoning and systematically utilizing domain specific knowledge, including physical knowledge about the imaging process, semantic knowledge about the target object and its behavior, and perceptual knowledge about viewing the object, this approach is able to provide better solutions for complex industrial image understanding problems which were considered not possible before. Similar approaches may be needed for the machine perception of action.
We believe that "intelligence" should be one of important characteristics of computer vision. I prefer, therefore, to use the "enhancing the intelligence aspects of computer vision," rather than the term of "reunification with AI," to describe the above mentioned R&D activity. The reason for emphasizing the intelligence aspects of computer vision is based upon the following facts: the eye is just a sensor; the visual cortex of human brain is our primary organ of vision. It turns out that perception is not the simple result of analyzing a set of stimulus patterns, but rather a best interpretation of sensory data based upon prior knowledge. Many researchers believe that one of important abilities of human vision is handling uncertain information through a process of perceptual grouping, evidence gathering, and reasoning based upon prior knowledge. It can make use of partial and locally ambiguous information to achieve reliable identifications. This is done by allowing interpolations through data gaps and extrapolation to be made to new situations for which data are not available. In this way, humans may use knowledge to infer many aspects, including actions, of visual scenes that may not be directly supported by the visual data.
Thus, vision involves mobilizing knowledge and expectations about the environment and actions of objects of interest in it. That relates research in computer vision to some research areas of AI in which large amounts of problem-specific knowledge are used to obtain constrained solutions. We also need to take the following factors into account for real-world industrial image analysis/action recognition applications. Because of missing data, occlusion, and many forms of image degradation, the amount of available information in the raw images may be limited. The viewpoints may be unknown. Due to distortion and dissimilar views, some objects and their actions do not always present look the same. Since none of the currently existing low-level image processing operators is perfect, some important features are not extracted and erroneous features are detected. As a result, visual evidence from the observed data is often incomplete or may conflict, and the rules used in computer vision systems are often just intuitively accepted. In addition, uncertainty also arises from statistical information available from an inadequate training set. All of these facts require a computer system to have the capability to deal with such uncertainty and vagaries in the action recognition.
I believe that by increasing "intelligence" of a computer vision system, including utilizing domain-specific knowledge/semantics/reasoning , computer vision technology /research would be thriving with more and more applications. Machine perception of action is a good test-bed for this idea, since it is really difficult to generate high level labels like 'chopping' or 'shoplifting' without using reasoning at different levels of abstraction.
Back to Workshop Homepage