Submitted to:Journal of Artificial Intelligence Research (JAIR)
Finding cluster structure in data is a common problem in linguistics, molecular biology, and psychometrics. Often these datasets are characterized only by pairwise dissimilarities between items, i.e., there are no point coordinates in a metric space. This paper presents a fast clustering algorithm that combines greedy merge operations with cluster refinement in an expectation-maximization framework. The algorithm is distinguished from previous approaches in that it does not depend on a combinatorically expensive metric space embedding and/or user hints as to the dimensionality, number, and variance of the clusters. As the algorithm converges it is possible to estimate the data's actual dimensionality from the discovered cluster structure. We show two applications of the algorithm: Discovering protein families given sequence alignments and extracting thematic content from narratives using rough similarity data from an electronic dictionary.