Recent world events have prompted the development of detection and assessment systems for chemical, biological, and rad/nuc attacks. In addition to the design and deployment of sensors, these threat and attack detection efforts require efficient algorithm techniques for interpreting large data sets generated by sensors and for simultaneously considering related information such as intelligence reports or epidemiological data. In the world of informatics, the latter objective is a real-time, interpretive data mining problem, and it is a computational challenge of the first order. Some similar bioinformatics based issues have been successfully addressed using a new and powerful method known as data pipelining (DP). We propose to extend this technique and make it applicable and reliable for homeland security problems. In particular, we will modify the algorithm core of DP to extend its effectiveness, evaluate the resulting new process on both synthetic and real data, and attempt to provide a theoretical foundation to define when the method can and cannot be reliably used.
Data pipelining is a computational approach to managing the exchange of information among various databases in order to maximize both the flexibility for addressing unexpected data types and the potential for introducing human creative insight. Its capabilities include altering searches based on "outside" observations or interpretations, categorizing new data based on prior results, and deciphering incomplete or inaccurate data by considering it in conjunction with related information and human judgment. The advantages of a DP algorithm derive largely from a unique architecture based on the notion of a meta-classifier, which coordinates and optimizes the states determined by base classifiers. Each base classifier has only partial access to the data and may be remotely distributed as sensor controllers, be intermittently or newly available, or represent interactive human interpretation. The function of the meta-classifier is to combine the individual base classifications into one global classification that best uses the information in ALL of the data. One existing DP meta-classifier of particular interest is trained using genetic algorithms. It has been shown to work well for some data sets but not for others, and its creators admit that it would be more useful were the optimization heuristic made more reliable. We propose to improve the performance and extend the reliability of this meta-classifier using Sandia's expertise in optimization and algorithm theory.
A major challenge facing us in this work is that there is currently no underlying convergence theory for meta-classifiers. To manage this risk and produce a successful method, we will leverage Sandia's skills in algorithm development and apply a robust optimization method, such as pattern search. We plan to implement and test our algorithm by producing prototype software. This effort will be connected to Sandia's work for the Department of Homeland Security (DHS) through the Systems Research Group, who will provide sample data and applications. If successful, this project will compliment and contribute to Sandia DHS information integration and analysis projects such as BWIC (Biological Warning and Incident Characterization) and NBACC/BKC (National Biodefense Analysis and Countermeasures Center, and the Biodefense Knowledge Center. We anticipate that a successful project outcome would also have valuable applications in many other areas of interest to Sandia.
To solve the problem of fusing large, multi-attribute data sets, we propose the development and implementation of an algorithm based on an ensemble classifier technique called stacking. Stacking, also referred to as stacked generalization or mixture of experts, is used to combine classifications obtained from several different learning techniques using a separate trainable meta-learner [4,7]. For our problem, we will first apply an appropriate learning technique to each data type and make a subsequent base classification. Then, the results from the base classification will be sent to a meta-classifier. The meta-classifier will decide how to combine these results using an optimization based learning technique that will iterate over some discriminate variables until a global classification can be achieved. This idea can be explained in more detail as follows. Each new piece of data will be processed according to its data type. Then, the appropriate base classifier will be applied. The resulting state information is sent to the meta-classifier. It is the job of the meta-classifier to determine the optimal distribution base classifications in order to make a global classification of the entire data system. Since the meta-classifier is an optimization problem, it will iterate back to the base classifiers and share relevant information about the states of the other data types until its optimal discriminate variables are achieved and a global classification can be made. This algorithm can be defined as a data pipelining technique because it allows the exchange of information between data types without requiring them to be translated into the same format. Moreover, data classification can be improved by using previous results or by introducing additional related information.
Significant advances in methods of data collection coupled with decreasing storage costs have led to the creation of large data sets in a wide variety of disciplines. In order to extract relevant information from these data sets, knowledge discovery and data mining techniques have become increasingly important. One area of particular interest is data fusion or the integration of related data from disparate sources. To tackle this problem, pharmaceutical companies are developing an idea known as data pipelining for drug discovery [1]. Loosely defined, data pipelining is a computational tool that manages the exchange of information among various data sources and applications. There are also documented uses of this technique in bioinformatics for high throughput protein analysis [6]. Despite numerous descriptions of the theory and assets of pipelining, there is no common underlying algorithm described in the literature. Furthermore, it is clear that the biggest obstacle facing data pipelining is finding appropriate algorithms to oversee the process. For example, a group at the University of Pittsburgh is attempting to develop a method for real time disease assessment (RODS) using medical data. They document their many attempts to find an analysis technique and explain why they settle on a Bayesian network approach [5]. We note that although this report is extensive, it does not offer any optimization based alternatives.
Many traditional methods of deriving information from databases assume minimal error in the data and can therefore result in incorrect conclusions. To overcome inaccuracies, methods are applied that combine several different classifiers. One of these methods, stacking, combines the predictions of multiple classifiers via a separate, trainable, meta-classifier. Experiments have shown that results obtained using stacking are an improvement over those obtained from a single classifier for a single data set [4]. However, the meta-classifier introduced is based on genetic algorithms and the authors suggest that their stacking technique would be more useful were the optimization heuristic to be made more reliable and were the optimization algorithm applied scalable to large data sets. Our proposed algorithm is an extension of [4] in that our meta-classifier will combine the classification results from several disparate data types with varied attributes.
Informatics is an area of new and emerging research at Sandia, and currently, there is some work being done that is related to our proposed project. For example, some discriminate data analysis research is being carried out, and it may be applicable to our design of the base classifiers. There are also groups at Sandia who specialize in data base management. However, their focus is on the design, maintenance, and security of data bases whereas our proposed project concentrates on the using the data stored within the data bases. Other projects at Sandia address the design, deployment and networking of sensors that produce the data types that we are interested in examining as well as virtual and operational WMD attack detection and incident characterization environments such as WMD-DAC or BWIC.
We plan to improve upon the DP approach described in [4] by replacing the genetic algorithm (GA) used there with a more robust optimization method. Initially we will apply APPSPACK [3] since we have had previous success in obtaining improved performance and reliability by replacing use of GA with APPS in 3D protein modeling applications [2]. Moreover, achieving better results with APPS as compared to GA are consistent with published theory. Our balanced team gives us unique capabilities to innovate in algorithm theory, statistical classifier design, and algorithm development. Our access to DHS projects and data positions us to solve the issues introduced by real data and to ensure the usability of our algorithm in real situations.
Privacy and Security. Last modified: 20 Jan. 2005.