Submitted to:
Association for Computational Linguistics (ACL97)
This paper addresses rapid indexing, retrieval, and assessment of narratives from databases of stories. We introduce a method for discovering thematic content in texts via lexical dissimilarity statistics. A maximum-likelihood algorithm clusters words into pools of similar meaning, using a thesaurus for rough estimates of word sense similarities. The highest ranked clusters indicate thematic motifs in the story. With these clusters we can form efficient representations for indexing, keyword summaries, and textual segmentation into thematically coherent sections. We demonstrate this representation in a system that generates 3D visualizations of stories and their thematic structures.