Question #8 - Research Domains
8) What are good domains and example tasks to try to address? How low-level do patterns need to be? Which domains provide interesting contexts to limit the possible actions to be recognized at any single segment of time. Mercenary question: what tasks will drive research funding?
At this stage, I like the idea of attacking highly constrained domains such as Bobick's "closed worlds". While generality is an ultimate goal, the task of modeling and recognizing action is difficult enough that general solutions tend to be "weak". Highly constrained domains allow one to formulate aspects of the problem that can be very rich in their complexity and suggestive of more general domains.
In our own work we have looked at the domain of speakers giving technical talks using overhead transparencies. In this domain, there are two types of events which we call nuisance events and affordances. Nuisance events include actions such as when the speaker moves/adjusts or occludes their slides. Affordances are actions such as pointing or writing which provide information regarding which portion of the slide the speaker is discussing. Information about these actions can be used for video annotation and indexing.
By simplifying the domain in this case, we formulate a constrained vocabulary of actions for which the image processing is tractable. On problem with such an approach is that the domain may be so highly constrained that one does not learn anything that can be generalized to new contexts. One nice property of the technical talk scenario is that it can be incrementally expanded to related domains such as people working on a whiteboard or interacting with documents on a desktop.
In these domains, only simple tasks can be addressed at the present time. The best way is to start using image sequences of structured scenes, with few people (i.e. not a crowd), a good quality image (e.g. several sensors) and the application context.
If do we consider the influence of active perception on the perception of action (and we would need to find more asymmetric terminology!), I think it could have a significant impact on applications in the domain of action perception. In many domains the action of the observer is implicit in the semantic content of the signal-the signal is designed, (or potentially co-evolves), to have a particular active observation pattern appropriate to detect/understand/enjoy it. Film and video editors have always understood this, and pay close attention to the dynamics of the active observer when constructing their own dynamic media. (To be honest, this has been true in static imagery as well, but it is usually less explicit in the mind of the static artist, and the observer is more able to control his or her experience with a static signal than with a dynamic signal.) The analysis of these types of signals within a machine perception system, either for automated manipulation or search of video footage, or for interpretation of interpersonal gesture in a human-computer interface context, would benefit considerably from an understanding of how the process of active observation operates on particular dynamic signals.
It is by now almost dogmatic within the computer vision community that spatial vision is context dependent and opportunistic. Knowing, with confidence (either from a priori information or from image analysis) the location of some object in an image provides a spatial context for searching for and recognizing related objects. We would never consider a fixed, spatial pattern of image analysis independent of available or acquired context. I believe that the same situation should obtain in temporal vision, and especially in the analysis of human actions and interaction. Our recognition models should predict, from prior and acquired context, actions both forward and backwards in time. This idea is both classic and commonplace in AI planning systems, that reason both forward and backward, and force planning to move through certain states known (or determined) to be part of the solution sought. Such approaches help to control the combinatorics of recognition and naturally allow us to integrate prior and acquired knowledge into our analysis.
The type of representation, may it be completely data-driven (only look at signals and infer actions using probabilistic and/or rule-based classifiers) or model-based (look at signals that can be mapped onto known and modeled actions) also depends on the task at hand.
At present, I am interested in pursuing two directions. One aimed at extracting very detailed representations that can encompass a structured spatio-temporal model. A model that will allow both analyses in a scale space or in a granular representation space. The second direction we are pursuing is the building of Interactive Environments (see www.cc.gatech.edu/fce) that are context aware and are capable of various forms of information capture and integration. We are interested in using all forms of sensory information and we are trying to work on very specific domains like classroom, living-room, kitchen, car, etc.
Some good application domains are:
It is worth thinking about what useful tools might be built in the next few years, given moderate success in the field. The understanding of the full range of Sign Language cannot be far away; once tracking is sufficiently advanced to follow both hands occluding each other and their relation to the rest of the speaker, the understanding of the gestures should require little advance over current work. An interesting medium-term goal might be a robot helper. The task would be to watch a human carrying out some task, for example assembling an object, and flag mistakes in the assembly process given a description of the desired outcome. A continuation of the problem would be to learn the required sequence of actions by watching training examples. This is related to the teleoperations problem of designing a robot to intelligently mimic the human operator --- a screwing motion of the control device should cause the remote robot to accurately position a screwdriver and insert the desired screw. To begin such a project would require the design jointly of an alphabet and language for scene parts and their motions and a tracking system capable of following objects corresponding to letters from that alphabet. In a specialised domain such as human body-tracking, it should be possible to make rapid progress building on existing work, perhaps using a blob tracker for the articulated units of the body, along with higher resolution information taken from the (already localised) hand to identify intricate activities.
For me, the most important thing is to work on a system which imitates anything you do in normal situations. For one thing, open-endedness is the essential thing to work on today, and specifying a particular target task may not be appropriate in intelligence-oriented research. (Of course it is useful in practical situations.) For another thing, dealing with the whole variety of complexity and structures is important... ranging from raw, direct mapping between features and limb motions, to a structurized purposeful action sequence.
In my current project with my colleagues, I am building a whole body humanoid robot which has similar level of mobility to humans in their everyday lives, and in parallel exploring the principles of action imitation in a bottom up manner. The two approach will be integrated as a humanoid robot constantly imitates human actions in everyday situations
It is crucial to address real tasks that integrate perception and action. We study one such natural task: learning by imitation, i.e., acquiring tasks and skills from visual demonstration. This domain demands a non-trivial integration of movement perception, representation, and movement/action generation. Furthermore, it is naturally grounded and prevents the idealization of any part of the system without having to cheat on the rest.
Another difficult action understanding system worth exploring is the multi-robot domain in which the robots interact using vision. While many single-robot test-beds employ vision, very few interact with moving objects (by that we mean not only avoid but manipulate, follow, chase, purposefully nudge and guide, catch, etc.). Introducing visual processing on multiple moving robots brings the problem of understanding action to a very real, grounded domain in which real-time processing and reaction are crucial. We are currently using three mobile robots with color cameras, and having those interact with several (up to 10) others with only infra red and sonar sensors. This highly dynamic domain has pushed us to rethink our notions of vision and develop new ideas and strategies. It merits a great deal more attention by the vision and robotics communities.
Find, track and identify moving objects/stuff, tag suspicious activities within specific context (no-one should go near that door after 9:00 PM. And It REALLY should not be opened). Two people meeting here is suspicious.
All sorts of simple analysis, pattern recognition stuff can feed this.
Gesture recognition, facial expression tagging, lip-reading?? Deliberate angle helps a lot.
e.g. There is a potential traffic jam shaping up here, change traffic light patterns, reroute traffic. There are people/children/animals in the road around blind curve; light warning sign. There is a lot of activity at this military base, take a look at it.
(As a corporate research type, this question is of intense concern to me. The corporate version of the question is "What action recognition capabilities would satisfy a need of a substantial number of people, at a cost they are willing to pay?")
I believe that security and safety monitoring will provide the first large market for activity recognition capabilities, for the following reasons:
An Example: Predicting Driver's Actions
We are now using this approach to identify automobile drivers' current internal (intentional) state, and their most-likey subsequent internal state. In the case of driving the macroscopic actions are events like turning left, stopping, or changing lanes [6, 7]. The internal states are the individual steps that make up the action, and the observed behaviors will be changes in heading and acceleration of the car.
The intuition is that even apparently simple driving actions can be broken down into a long chain of simpler subactions. A lane change, for instance, may consist of the following steps (1) a preparatory centering the car in the current lane, (2) looking around to make sure the adjacent lane is clear, (3) steering to initiate the lane change, (4) the change itself, (5) steering to terminate the lane change, and (6) a final recentering of the car in the new lane. In our current study we are statistically characterizing the sequence of steps within each action, and then using the first few preparatory steps to identify which action is being initiated.
Action recognition is a new technology with many potential applications. One of very important application is the automatic video content extraction. It is well known that automatic video content extraction is a very important technology that will play an important role in visual telecommunication and internet accesses in the next decade. What we would like to argue is that action is the key content of all other contents in the video. Just imagine if you can describe video content effectively without using a verb. A verb is just a description (or expression) of actions. Action recognition will provide new methods to generate video sketching in terms of high-level semantics. For each scene-cut at a segment of time, there are hundreds, even thousands, of frames of video for the description of a complete action process. By recognizing the action, we can probably use only two frames (one for the beginning state and one for the end state of the action) plus a line of description of the recognized action for the same purpose. Thus, the action content can play a central role to link other video contents in a very compact way. This technique will make the automatic video sketching, skimming, clipping, logging, indexing, querying, and browsing possible. Such intelligent content-based video processing methods will pave the way for video-on-demand and digital video library access through internet.
Some sport video may be good application domains that provide interesting contexts to limit the possible actions to be recognized at any single segment of time. Tennis, volleyball, and diving are some examples. In these sports, domain-specific knowledge is available by well defined geometry for the court or the platform, well defined game rules for player positions, and well defined, limited possible actions. In addition to the video processing, the action content is also crucial in real-time applications including key event (unusual actions) detection in subway platform monitoring, security monitoring, home care monitoring, etc. I believe that knowledge-based approach may be a promising way for the action recognition in these application areas. If we can make the action recognition really work for a few important real-world problems, then it may drive more research funding from different resources.
User-interfaces represent a particularly promising domain for action interpretation. One example is the Smart Kiosk project that I am involved in at DIGITAL's Cambridge Research Lab. A kiosk is a free-standing computer that dispenses information (and possibly conducts business transactions) in a public space. We are developing a new generation of active, participatory kiosk interfaces that use vision to sense their users' actions and respond appropriately. The kiosk can differentiate, for example, between a user that is approaching it and a user that is passing by, and initiate an interaction with the former.
A key issue in action interpretation for our domain, which we are only beginning to address, is the context provided by the public space and the kiosk design. A particular context, such as a shopping mall, heavily influences the types of actions we should expect from our users and the types of responses they would view as appropriate. Context is also heavily influenced by the kiosk design itself. We are actively exploring these issues.
-Video surveillance and monitoring.
- Visual lipreading.
- Looking at a person in the office.
-Recognition of animal motion.
What vs Why - (Motion vs Action)
It is interesting to ponder why we might want to automate the recognition of actions, as distinguished form the more direct measurement of motions. The distinction is one of assigning labels to events, as opposed to computing trajectories of entities. There are numerous useful (current and potential) reasons to compute trajectories of things in motion, but why bother to recognize actions?
The advent of low-cost video and high performance computing at reasonable cost opens the doors to many possibilities for monitoring activities, both commercially attractive and valuable for defense purposes. Monitoring tasks that are worthwhile to pursue are likely to possess several attributes:
1) Humans require special skills to perform
2) Task must be repeated very frequently
3) Monitoring must be continuous over extended periods
4) Recognition requires fine distinctions
5) Value of successful detection is high
At the workshop we will explore a variety of potential applications for
recognition of actions. These can be analysed in terms of the above attributes
to identify those that are most likely to lead to viable commercial or military
ble to handle?
Back to Workshop Homepage