This paper demonstrates that automatic framing of TV shows is an interesting and tractable domain for both Computer Vision and Artificial Intelligence. Our basic goal is to build intelligent robotic cameras (SmartCams) able to frame subjects and objects in a TV studio upon verbal request from a TV director. To cope with the problem of relating visual imagery to symbolic knowledge about the scene, we propose the use of an architecture based on two levels of representation. High level approximate world models roughly describe the objects and the occurring actions. Low level view representations are obtained by vision routines selected according to the present state of the world, as described by the approximate models. The approximate world models are updated by contextual information extracted from the script of the TV show and by processing the imagery gathered by wide-angle, low-resolution cameras monitoring the studio. Our Intelligent Studio is composed of one or more SmartCams which share the representations of an approximate world model of the studio. A prototype has been implemented, and some examples of the cameras' responses to different requests are shown in the domain of a cooking show.