TR #406: A fast greedy pairwise distance clustering algorithm and its use in discovering thematic structures in large data sets

Matthew Brand November 1996

Submitted to:

Journal of Artificial Intelligence Research (JAIR)

Abstract

Finding cluster structure in data is a common problem in linguistics, molecular biology, and psychometrics. Often these datasets are characterized only by pairwise dissimilarities between items, i.e., there are no point coordinates in a metric space. This paper presents a fast clustering algorithm that combines greedy merge operations with cluster refinement in an expectation-maximization framework. The algorithm is distinguished from previous approaches in that it does not depend on a combinatorically expensive metric space embedding and/or user hints as to the dimensionality, number, and variance of the clusters. As the algorithm converges it is possible to estimate the data's actual dimensionality from the discovered cluster structure. We show two applications of the algorithm: Discovering protein families given sequence alignments and extracting thematic content from narratives using rough similarity data from an electronic dictionary.


See TR 411 for a detailed description of an application to visualizing the thematic structure of textual stories.