Title: Understanding How Choices Change Clusterings: geometric comparison of popular mixture model distances

Speaker: Scott Mitchell, Sandia National Laboratories

Date/Time: Thursday, October 14 , 2010, 4:00 pm Mountain Time (3:00 pm Pacific time)       

Location: CSRI Building/Room 90 (Sandia NM) and 915/S101 in SNL/CA

Brief Abstract: How do we know we have the "right" clustering? At Sandia we cluster data daily, putting points into groups, for cyber, non-proliferation etc. The clustering we get depends on the choices we make along the way, and there are many. Do we even know how these choices change the answer? One of the fundamental choices is how to measure how close two data points are to each other. I’ll describe a basic geometric comparison of the most popular distance functions for a certain model, and we’ll see that they’re pretty similar, but with sharp differences in certain areas. I’ll have many pictures so the talk should be accessible. So far, everyone I’ve talked to who routinely uses these distances has been surprised by these results.

More formally -
For cyber security and text analysis (e.g. non-proliferation), raw data are modeled then transformed into feature space, using linear algebra (e.g. LSA, tensors) or statistics. Statistical Latent Dirichlet Allocation (LDA) has been gaining use at Sandia for cyber defense. LDA produces mixture model points in moderate to high dimensions: a point has positive coordinates and the coordinate-sum is 1. Geometrically these lie on a regular simplex. Numerous other statistical models and techniques also produce data in this geometric category, even though the meaning of the axes and coordinate values differs significantly. A distance function is used to further analyze these points, for example to cluster them or build a graph. Several different distance functions are popular amongst statisticians; which distance function is chosen is usually driven by the historical preference of the application domain, information-theoretic considerations, or by the desirability of the clustering results. Relatively little consideration is usually given to how distance functions geometrically transform data, or the distances algebraic properties. Here we take a look at these issues, in the hope of providing complementary insight and inspiring further geometric thought. Several popular distances; chi-squared, Jensen-Shannon divergence, and the square of the Hellinger distance; are shown to be nearly equivalent, in terms of functional forms after transformations, factorizations, and series expansions; and in terms of the shape and proximity of constant-value contours. This is somewhat surprising given that their original functional forms look quite different. Cosine similarity is the square of the Euclidean distance; same with another cosine and Hellinger. We suggest a geodesic variation of Hellinger. The square-root projection that arises in Hellinger distance is briefly compared to standard normalization for Euclidean distance. We provide plots and proofs of ratio and difference bounds. We provide some constructions that nearly achieve the worst-case ratios.

CSRI POC: Scott Mitchell, 505-845-7594



©2005 Sandia Corporation | Privacy and Security | Maintained by Bernadette Watts and Deanna Ceballos