Title: Roles for Information Theory in Natural Language Processing Speaker: Peter Chew, Moss Adams LLP, consulting firm, Senior Manager Date/Time: Tuesday, August 31, 2010, at 9:00 to 10:00am Location: CSRI Building/Room 279 (Sandia NM) videoconferenced to Sandia/CA 915/S101 Brief Abstract: A key function of language is to encode and convey information. This has been recognized in core areas of the field of (Computational) Linguistics for a number of decades, such as Chomsky's formal grammars, Jakobson's phonological classification, and, more recently, phoneme classification and recognition systems, unsupervised discovery of word collocations, unsupervised morphological analysis, and Statistical Machine Translation (SMT). In all these cases, the role of Information Theory (IT) (and by extension probability theory) is explicitly acknowledged and theoretically demonstrated. In Information Retrieval (IR), this has not been as true. The popular log-entropy term weighting scheme does have connections to IT, but for example this is alluded to almost as an afterthought by Dumais (1991). I argue that this has resulted in unnecessary theoretical opacity and complexity for IR. In this presentation, I summarize a number of years of my research, the results of which consistently point to the fact that IT has very important and largely untapped potential in vector-space approaches to Natural Language Processing. I describe how a vision for a truly multilingual document clustering system gave rise to a useful and non-trivial testing framework which has allowed variants on approaches to IR to be empirically tested against one another. The variants I consider involve term weighting, morphological pre-processing, collocation pre-processing, and cross-language term alignment based on SMT. Comparisons between IT-based and "traditional" approaches are made using a variety of vector-space models including Latent Semantic Analysis and Latent Dirichlet Allocation. In all cases, it is shown that using IT as a basis for rethinking one's approach to IR results not only in greater theoretical transparency and extensibility (for example to tensor-based approaches to IR), but also in empirical improvements which are extremely unlikely to occur by chance and often surprisingly large.CSRI POC: William E. Hart, 505-844-2217 |