The Second Sandia National Laboratories Workshop on

Computational Molecular Biology


Abstracts



Plenary Speakers



Ken Dill (email)

Title: Protein Folding: Simple Computational Models

Abstract: We are interested in principles of protein folding. We have developed a very simple computational model based on assuming that proteins are copolymers of hydrophobic (H) and polar (P) monomers that are configured as self-avoiding walks on lattices. This HP model shows many features of the behaviors of real proteins, including some secondary and tertiary structures, some protein-like symmetries, cooperativity in thermodynamics and kinetics of folding, and ligand binding.


Richard Karp (email)

Title:Combinatorial Aspects of Restriction Mapping

Abstract: A restriction map of a DNA molecule gives the locations of the restriction sites for one or more restriction enzymes. Such maps provide information at a finer level of detail than genetic maps, radiation hybrid maps or STS content maps. They are useful for homing in on disease genes, and their construction is an important preliminary step toward sequencing a genome. The construction of a restriction map requires the following steps:

This talk will focus on the conbinatorial problems arising in step (3). We formulate the problem as a chain decomposition problem and attack it with a battery of techniques involving bipartite matching, shortest paths, linear programming and the traveling-salesman problem.The problem leads to novel combinatorial structures and NP-hard problems for which we have developed approximation algorithms and interesting heuristics. This is joint work with Dan Fasulo, Tao Jiang and Ed Thayer of the University of Washington.


Jonathan King (email)

Title:Protein Folding, Protein Misfolding, and Human Disease

Abstract: All of the amino acid sequences expressed in the entire range of organisms found on Earth have evolved to fold up after synthesis on ribosomes within cells. A subset of these proteins can fold up from the fully denatured state into their biologically active conformation in the absence of any cellular components. However for many proteins folding processes represent difficult transitions of the life cycle. Chains fold into incorrect conformations, associate with each other to insoluble states, and display properties absent from simple models. Examples of these will be given for the Phage P22 capsid and tailspike proteins, for which these unpopular properties have been carefully studied (King et al, 1996). The failure of folding processes has emerged as major problems in the biotechnology industry, in biomedical research and as causative agents of the amyloid class of human diseases, which includes Alzheimer's. Cells deal with these problems through the synthesis of a large family of helper proteins, the chaperonins, which have probably evolved to alleviate these folding problems. For computational biology to properly describe the relationship between amino acid sequence and native conformation, will require including these associated and polymerized states.

King, J., Haase-Pettingell, C., Robinson, A. S., Speed, M. and Mitraki, A. (1996) Thermolabile folding intermediates: inclusion body precursors and chaperonin substrates.10, 57-66.


Irwin Kuntz (email)

Title:Mathematical Issues in Molecular Recognition

Abstract: The use of computation in molecular design hinges on our understanding of atom-atom interactions, our knowledge of how to count molecular arrangements, and mathematical descriptions of molecular shape. The first two topics are related to fundamental thermodynamics and statistical mechanics. The mathematical issues involved in describing complex shapes are related to well-known NP complete problems such as the knapsack packing problem and the identification of isomorphic subgraphs. Simple, very approximate, solutions to these problems have proven useful in a computational screening procedure that identifies molecules or molecular fragments from databases that interact with proteins and nucleic acids of known structures.


Michael Waterman (email)

Title:A New Method for Shotgun Sequence Assembly

Abstract: Shotgun sequencing is a standard method for estimating DNA sequences from sequenced fragments. The biological experiments produce a set of fragment sequences, and computational methods infer the sequences from pairwise fragment overlaps. In this talk, a computational method is proposed that applies the mathematical ideas of Eulerian graphs as developed by Pevzner for sequencing by hybridization. The new algorithm combines the data from a shotgun sequencing with the ideas from SBH.



Regular Speakers



Stephen Altschul (email)

Title:Heuristics for Searching Sequence Databases

Abstract:Variations of the BLAST algorithm have been incorporated into several popular programs for searching protein and DNA databases for sequence similarities. These programs require time proportional to the length of both the query sequence and the database searched. With the recent and prospective rapid growth in database sizes, BLAST servers are under increasing pressure to keep pace with the demands placed upon them. However, several new algorithmic ideas hold the prospect of substantially increased speed for the BLAST programs, with equivalent or even improved sensitivity to weak similarities. These ideas are only in the process of being implemented, so the behavior of the new algorithms on real sequences remains to be thoroughly tested. Nevertheless, random simulations based upon theoretical models of sequence similarity allow the sensitivity of the new approach to be analyzed.


Sarina Bromberg (email)

Title: What Makes Protein Folding Cooperative?

Abstract: Many proteins cooperatively fold into unique conformations under physiological conditions. Stable intermediates are not often detectable, making it difficult to determine how a sequence determines a structure. As a consequence, computer algorithms for folding protein sequences perform wide conformational searches. Ideally, algorithms should be guided by the energies and entropies that guide proteins to their native states. Theories of protein folding cooperativity represent different energetic bases for protein folding. The jigsaw puzzle packing model proposes a critical distance at which amino acid sidechains cooperatively lock together. The hydrophobic core model describes the cooperative segregation of hydrophobic monomers into the interior of the protein. We present computer experiments with (1) lattice models for linear chains with specific sequences of hydrophobic and polor monomers and with (2) lattice model chains having simple sidechains that support the view that sidechains need not lock. We show that core formation by some sequences of hydrophobic and polar monomers can give rise to the two state cooperativity of protein folding.

Joint work with Hue Sun Chan and Ken A. Dill, Department of Pharmaceutical Chemistry, UCSF.


Steve Bryant (email)

Title: Protein Structure Prediction by Threading

Abstract: NA


Martin Farach (home page) (email)

Title: Efficient Algorithms for Inverting Evolution

Abstract: Evolution is a stochastic process which operates on the DNA of species. The evolutionary process leaves tell-tale signs in the DNA which can be used to construct phylogenies, or evolutionary trees, for a set of species. So called Maximum Likelyhood Estimations (MLE) methods seek the evolutionary tree which is most likely to have produced the DNA under consideration. While these methods are widely accepted and intellectually satisfying, they are computationally intractable. We offer a simple polynomial time algorithm for finding a tree which approaches the most likely tree, complementing our result with a lower bound on how well any algorithm can do, no matter its computational complexity. We thus show that our simple approach is close to optimal in terms of the amoutn of DNA needed to find a "good" tree.

Joint work with Sampath Kannan.


Nat Goodman (home page) (email)

Title: The Data Management Systems Used to Map the Human and Mouse Genomes at the Whitehead Institute/MIT Center for Genome Research

Abstract: This talk describes the laboratory informatics system that supports several large scale genome mapping projects at the Whitehead Institute/MIT Center for Genome Research. These projects include a genetic map of the mouse with 7,000 microsatellite markers, a physical map of the human with 15,000 STSs, and a physical map of the mouse that will eventually contain 10,000 STSs.

The informatics system knits together a complex web of manual and automated laboratory activities. These include experiment scheduling and setup, robot control, raw-data capture, multiple stages of preliminary analysis and quality control, and release of finished results. The system is able to cope smoothly and quickly with changes in laboratory protocols which occur frequently in our Genome Center.

The system includes Unix-based programs for data analysis, data management, and workflow management. Data management is accomplished with a domain-specific DBMS called LabBase running on top of ObjectStore, a C++-based persistent object manager. Workflow management is accomplished with a collection of Perl programs that process tabular workflow descriptions to generate database queries and updates needed to coordinate workflow activities. The system also includes Macintosh-based programs for data entry and access by laboratory personnel, and Web-based programs for access by personnel within the Genome Center and around the world.


Dan Gusfield (email)

Title: Graph Traversals, Genes and Matroids: A TSP Result Arising from Sequencing by Hybridization

Abstract: In this talk we discuss graph traversal problems that arise from a particular technology for DNA sequencing -- sequencing by hybridization (SBH). We first explain the connection of the graph problems to SBH and then focus on the traversal problems. The results provide a polynomial time solution for the Travelling Salesman Problem in a rich class of directed graphs (including edge weighted binary de Bruijn graphs), and provide bounded-error approximation algorithms for the maximum weight TSP in a superset of those directed graphs. The TSP algorithm we develop has a greedy structure which is establishes the existence of a matroid defined on the set of Euler and Hamilton paths.

Joint work with Lusheng Wang and Paul Stelling, Computer Science Department, University of California, Davis.


William Hart (home page) (email)

Title: Combinatorial Issues in Core Formation for Protein Folding

Abstract: In this talk we discuss algorithmic ideas for designing efficient approximation algorithms for protein folding in the hydrophobic-hydrophilic model. We have previously described fast approximation algorithms that in the worst case construct protein structures whose energy is asymptotically within 3/8 of the optimal protein structure. Motivated by a statistical model of protein sequences, we describe an algorithm that has good average case performance. We present an analysis of this algorithm that can be used to predict its performance. Our experiments with this algorithm indicate that its expected performance is better than the expected performance of a naive folding algorithm. The structure that is generated by this algorithm forms a "sandwich" of two planes of hydrophobic residues. However, the method used to construct these planes is not directly related to our present analysis. We describe several combinatorial problems related to different approaches of constructing these planes.


Tao Jiang (home page) (email)

Title: Mapping Clones with A Given Ordering or Interleaving

Abstract: We study the problem of constructing a most compact physical map for a collection of clones whose ordering or interleaving on a DNA molecule are given. Each clone is a contiguous section of the DNA and is represented by its fingerprint obtained from biochemical experiments. In this paper, the fingerprint of a clone is either a multiset containing the sizes of the restriction fragments occurring in the clone in single complete digest mapping or a multiset containing the short oligonucleotide probes occurring in the clone in mapping by hybridization of probes. Our goal is to position the clones and restriction fragments (or probes) on the DNA consistently with the given ordering or interleaving so that the total number of restriction fragments (or probes, resp.) required on the DNA is minimized.

We first formulate this as a constrained path cover problem on a multistage graph. Using this formulation, it is shown that finding a most compact map for clones with a given ordering is NP-hard. The approximability of the problem is then considered. We present a simple approximation algorithm with ratio $2$. This is in fact the best possible as the above NP-hardness proof actually shows that achieving ratio $2-\epsilon$ is impossible for any constant $\epsilon > 0$, unless P = NP. We also give a polynomial time approximation scheme when the multiplicity is bounded by one (i.e. when the multisets are actually sets). The exact complexity of the problem in this special case is presently unknown. Finally we consider the mapping problem when an interleaving is given which depicts how the clones overlap with each other on the DNA. In the case of restriction fragment data, it is shown that finding a consistent map is NP-complete even if the multiplicity is bounded by $3$. This may suggest that informatio about the interleaving of clones does not necessarily make the problem computationally easier in single complete digest mapping. On the other hand, in the case of hybridization data, there is an efficient algorithm to construct a most compact map when the interleaving of clones is given.

This is a joint work with Richard M. Karp.


John Kececioglu (email)

Title: Computing Optimal Multiple Sequence Alignments

Abstract: NA


Alan Lapedes (email)

Title: Neural Net Representations of Empirical Protein Potentials

Abstract: Recently, there has been considerable interest in deriving and applying knowledge-based, empirical potential functions for proteins. These empirical potentials have been derived from the statistics of interacting, spatially neighboring residues, as may be obtained from databases of known protein crystal structures.

We employ neural networks to redefine empirical potential functions from the point of view of discrimination functions. This approach generalizes previous work, in which simple frequency counting statistics are used on a database of known protein structures. This generalization allows us to avoid restriction to strictly pairwise interactions. Instead of frequency counting to fix adjustable parameters, one now optimizes an objective function involving a neural network parameterized probability distribution.

We show how our method reduces to previous work in special situations, but also allows extensions to include orders of interaction beyond pairwise interaction. Given the close packing of proteins, steric interactions etc., the inclusion of higher order interactions is critical for developing an accurate potential. A key feature in the approach we advocate is the development of a representation to describe the spatial location of interacting residues that exist in a sphere of small fixed radius around each residue. This is a ``shape representation'' problem that has a natural solution for the interaction neighborhoods of protein residues. We demonstrate in a series of numerical experiments that the neural network approach improves discrimination over that obtained by previous methodologies limited to pair-wise interactions.


Pavel Pevzner (email)

Title: Gene Recognition: Combinatorics versus Statistics

Abstract: Gene structure prediction is one of the most important problems in computational molecular biology. Previous attempts to solve this problem were based on statistics and artificial intelligence and, surprisingly enough, applications of theoretical computer science methods for gene recognition were almost unexplored. Recent advances in large-scale cDNA sequencing open a way towards a new combinatorial approach to gene recognition. This paper describes a spliced alignment algorithm and a software tool which explores all possible exon assemblies in polynomial time and finds the multi-exon structure with the best fit to a related protein. Unlike other existing methods, the algorithm successfully recognizes genes even in the case of short exons or exons with unusual codon usage; we also report correct assemblies for genes with more than 10 exons. On a test sample of human genes with known mammalian relatives the average correlation between the predicted and the actual genes was 99%, which is a very high accuracy as compared with other existing methods. The algorithm correctly reconstructed 87% of genes and the rare discrepancies between the predicted and real exon-intron structures were caused by either (i) extremely short (less than 5 amino acids) initial or terminal exons, or (ii) alternative splicing, or (iii) errors in database feature tables. Moreover, the algorithm predicts human genes reasonably well when the homologous protein is non-vertebrate or even prokaryotic. The surprizingly good performance of the method was confirmed by extensive simulations: in particular, with target proteins showing just 25% similarity, the correlation between the predicted and actual genes still was as high as 95%.

This is a joint work with Mikhail Gelfand and Andrey Mironov.


Cindy Phillips (email)

Title:Constructing Evolutionary Trees in the Presence of Polymorphic Characters

Abstract:Most phylogenetics literature and construction methods based upon characters presume monomorphism (one state per character per species), and yet polymorphism (multiple states per character per species) is well documented in both Biology and Historical Linguistics. In this paper we consider the problem of inferring evolutionary trees for polymorphic characters. We show efficient algorithms for the construction of perfect phylogenies from polymorphic data. These methods have been used to help construct the evolutionary tree proposed by Warnow, Ringe, and Taylor for the Indo-European family of languages, which was presented by invitation at the National Academy of Sciences this November.

Joint Work with Maria Bonet, Tandy Warnow, and Shibu Yooseph, University of Pennsylvania.


R. Ravi (home page) (email)

Title: Lawler's taxonomy of problems in Phylogeny and Alignment.
(Dedicated to the memory of Eugene Lawler)

Abstract: I will present a taxonomy of problems related to computing a phylogenetic tree and a multiple alignment of a given set of sequences, that is due to Eugene Lawler. In particular, we consider three parameters: (1) a tree topology with the given sequences typically labeling the leaves, (2) ancestral sequences that appear at internal nodes in such a tree topology, and (3) a multiple alignment of the given sequences. Depending on whether each of these parameters is known or unknown as input to the problem, we arrive at eight possible problems relating phylogeny and alignment. In casting these problems, we use two principles: the maximum parsimony principle for phylogenetic reconstruction and the sum-of-pairs objective for multiple alignments. We will also briefly review algorithmic strategies for tackling the different problems that arise.


Ron Shamir (home page) (email)

Title: How hard is the integration of physical maps?

Abstract: The mapping efforts of the human genome project will soon be changing emphasis from de-novo generation of maps into the integration of data from various existing maps into new, better and more accurate maps. In this study we address some basic problems which arise in map integration. Specifically, given clone overlap data, we study the algorithmic problems arising by the introduction of additional information on clone sizes or clone order.

We use graph-theoretic tools: Clone overlap data is modeled by an interval graph. When the additional information is order constraints on pairs of disjoint intervals, we give a linear time algorithm. Extant algorithms for this problem required quadratic time. When the constraints are bounds on distances between endpoints, and the graph admits a unique clique order, we show that the problem is linearly equivalent to network flow and hence polynomial. However, for arbitrary interval graphs, even when the lengths of all intervals are precisely predetermined, the problem is NP-complete. Side constraints on unit interval graphs will also be considered.

Joint work with Itsik Pe'er, Tel Aviv University


Tandy Warnow (email)

Title: New Models for Inferring Consensus Trees

Abstract: The construction of evolutionary trees is fundamental to many scientific endeavors. Unfortunately, it is often extremely difficult to determine with complete precision the best tree for a given data set. Although analytical results show that infinitely long biomolecular data generated under narrowly defined conditions will produce a single optimal tree which can be constructed in polynomial time, other analytical results show that when the assumptions are relaxed there can be more than one tree generating the same data. Furthermore, real data often doesn't quite fit the idealized models used to generate these analytical results, and even when it does, data of finite length can have more than one optimal tree. Finally, because most optimization problems in evolutionary tree construction are NP-hard, finding the optimal trees (whether unique or not) can be taxing. To summarize, trying to infer what ``really" happened in evolutionary history for real data sets is often quite difficult, and inferring the consensus of a set of differing hypotheses about the evolutionary history is a standard component of many existing software packages for phylogeny construction.

Unfortunately, the consensus models upon which current methods are based (strict consensus and majority consensus) often lose valuable information and thus are of limited applicability for many real data sets. In this talk, we will present new models for inferring consensus of different evolutionary trees, and show how these consensus trees can be computed efficiently. We compare these new models to existing models and show that these models can provide significantly more information about the evolutionary history than previous models.

This is joint work with Monika Henzinger. Sampath Kannan, Valerie King, Cindy Phillips, and Shibu Yooseph.



Poster Abstracts



Title: Partial Refolding of Ubiquitin: Molecular Dynamics Simulations of Hydrophobic Collapse

Author:
Darwin Alonso and Valerie Daggett
Department of Medicinal Chemistry, H165
University of Washington Seattle, Washington 98195-7610
alonso@torre.mchem.washington.edu

Abstract:Unfolded and partially unfolded conformations of ubiquitin, a globular protein of 76 residues, were simulated under native conditions. The non-native starting structures were generated during two different thermal denaturation trajectories. All simulations included all protein and solvent atoms explicitly, and simulation times ranged from 1-2.4 ns. The seven non-native starting structures for the simulation under native conditions had alpha-carbon root mean square (RMS) deviations between 4-11 angstroms from the crystal structure and radii of gyration as high as 1.3 times that of the native state. The radii of gyration of the protein calculated during these seven simulations converged in all but one case with that of a simulation that started from the crystal structure. In contrast, none of the structures collapsed when simulated in a 60% methanol:water mixture. There was a disproportionately high burial of hydrophobic residues accompanying collapse, but for the most non-native starting structures, intraprotein hydrogen bonding did not increase concomitant with collapse. The RMS deviation to the crystal structure decreased from 5 to 3 angstroms during the course of the simulation of one of the most native-like starting structures. The less native-like starting structures acquired some native secondary structure and tertiary contacts during the the simulation but remained non-native as judged by RMS deviation to the crystal structure. The least native-like starting structures collapsed, expanded and collapsed again, and the later collapsed state had and increased burial of hydrophobic residues relative to the earlier collapsed state. Although these molecular dynamics simulations are orders of magnitude too short in duration to follow an entire folding trajectory; taken as a group representing the changes that occur during simulations of structures ranging from unfolded to native-like, they suggest that the the early events in protein folding involve cyclic collapse and expansion of single protein molecules until approximately native contacts are formed. Then, there is further consolidation of the core, better and more specific packing interactions develop, and specific hydrogen bonds form.


Title: The megaprior heuristic for discovering protein sequence patterns

Author:
Timothy L. Bailey and Michael Gribskov
San Diego Supercomputer Center
P.O. Box 85608
San Diego, California 92186-9784
tbailey@sdsc.edu and gribskov@sdsc.edu

Abstract: Several computer algorithms for discovering patterns in groups of protein sequences are in use that are based on fitting the parameters of a statistical model to a group of related sequences. These include Hidden Markov model (HMM) algorithms for multiple sequence alignment, and MEME and the Gibbs sampler for discovering motifs. These algorithms are sometimes prone to producing models that are incorrect because two or more patterns have been combined. The statistical model produced in this situation is a convex combination (weighted average) of two or more different models. This paper presents a solution to the problem of convex combinations in the form of a heuristic based on using extremely low variance Dirichlet mixture priors as part of the statistical model. This heuristic, which we call the megaprior heuristic, increases the strength (i.e., decreases the variance) of the prior in proportion to the size of the sequence dataset. This causes each column in the final model to strongly resemble a single component of the prior, regardless of the size of the dataset. We describe the cause of the convex combination problem, analyze it mathematically, motivate and describe the implementation of the megaprior heuristic, and show how it can effectively eliminate the problem of convex combinations in protein sequence pattern discovery.


Title: Classifying Random and Real DNA Sequences Using Multi-Dimensional Scaling

Author:
Jonathan D. Blake
Department of Computer Science
Box 5101
Tennessee Technological University
Cookeville, TN 38505
jdblake@csc.tntech.edu

Abstract: Multi-Dimensional Scaling (MDS) refers to a class of techniques that reduces a set of N objects to a set of N points in a smaller space (often 2- or 3-space) in such a fashion that distances between the N points preserve, as best as possible, the original inter-object proximities. Proximities are values that are used to measure the relative similarities or dissimilarities between pairs of objects. Often these values are numeric, and can thus be thought of as Euclidean distance measures. MDS is particularly useful when the number of dimensions of the input objects is prohibitively high, and not easily representable. The technique works by creating a set of N candidate points, and shifts them independently in the candidate space to minimize disagreement between their inter-point distances, and the original inter-object distances.

MDS techniques can be used to distinguish objects within any collection for which proximities can be generated. This includes linear strings of monomeric sequence elements of the bio-informational polymers, the nucleic acids, and the proteins. The measure used to determine proximities between these biomolecules is based on the fractional occurrences of the alphabetic features, bases or amino acids respectively. The particular sequence of the many millions of bases in a DNA reflects the information content that is embedde in it, and therefore could be used to form a metric for distinguishing sequences.

If we consider sub-sequences of K contiguous bases as "words" in a DNA sequence, then there are predetermined vocabularies (depending on the size of K) that are common for all DNAs. The fractional contents of each word of one of these vocabularies can be measured for a collection of DNAs.

The fractional contents of overlapping words (K-tuples) in DNA are measured as real numbers. Each DNA can be represented as a vector of length W (where W is the size of the K-tuple vocabulary. in the case of DNA, W = 4^K) representing a point in W-space.

Since MDS uses proximity measurements between objects to determine the final geometric configuration of points, we look to the Euclidean distance between any two DNA sequences of a collection to determine their proximity. The task of displaying in representable 2- or 3-space the classification of a collection of DNA sequences based on their fractional contents of K-tuples can be achieved by MDS.

This project explores the efficacy of MDS for distinguishing DNA sequences of various types. Initial tests were carried out on a large collection of random DNA sequences where base content was the only distinguishing feature. It is demonstrated that MDS clearly classifies the sequences geometrically based on the four independent factors distinguishing them: %A, %C, %G, and %T.

The analysis has been extended to actual DNA sequences from the NCBI database. Preliminary results demonstrate the problem of distinguishing sequences of different species and function, as well as when attempts are made to normalize for biases in base composition.


Title: On Physical Mapping Algorithms - A Fault-Tolerant Test for the Consecutive Ones Property

Author:
Wen-Lian Hsu
Institute of Information Science
Academia Sinica
Taipei, Taiwan, R.O.C.
hsu@iis2.iis.sinica.edu.tw

Abstract:An important problem in physical mapping is to test the consecutive ones property of a (0,1)-matrix: that is, whether it is possible to permute the columns so that each row of the resulting matrix has the ones occur in a consecutive block. This is useful, for example, in probe hybridization for cosmid clones and in the STS content mapping of YAC library. The linear time algorithm by Booth and Lueker (1975) for this problem has a serious drawback: the data must be error-free. However, laboratory work is never flawless. We devised a new algorithm for this problem, which has the following advantages: 1. conceptually, it is very simple; 2. it produces a matrix satisfying the consecutive ones property in linear time when the data is error-free; 3. with reasonable assumptions, it can accomodate the following three types of errors: false negatives, false positives and chimeric clones; 4. in the rare case that the assumptions in 2 are not satisfied, our algorithm would suggest additional lab work that could reduce the degree of ambiguity.


Title:Visualizing Measures of Genetic Distance

Authors:
Eric M. Jordan
Laboratory for Computer Science
Massachusetts Institute of Technology
545 Technology Square
Cambridge, MA 02139e
emjordan@theory.lcs.mit.edu

Abstract:An intuitive way to try to measure the distance between two permutations is to write the two permutations above each other on a piece of paper, and draw lines between the corresponding elements. The number of times that any two lines cross can be used as a quantitative measure of how different the permutations are. This has the advantage that it is automatically accompanied by an illustration that can be used to qualitatively visualize what is being measured. It was described in (Cedergren and Abel, 1990). However, this measure is inappropriate when the permutations are related by global operations, like transpositions and reversals. For the 1995 DIMACS implementation challenge, we showed that certain quatities used to measure global permutation distance have a topological interpretation that allows them to be qualitatively visualized as well. Rather than having lines cross as they travel from one permutation to the other, we use handles and crosscaps (intuitively, bridges attached to the paper) to avoid crossing. It turns out that the number of crosscaps needed to do this is a lower bound on the signed reversal distance that has been used in branch and bound algorithms and in approximation algorithms (Bafna and Pevzner, 1995), and that the number of handles is a lower bound on the number of transpositions (Kececioglu and Sankoff, 1994; Hannenhalli and Pevzner, 1995; Bafna and Pevzner, 1993). For the implementation challenge, we wrote a program that takes a permutation and displays the associated surface. We hope that our program will reveal intuition that can be used to further the research on the genome rearrangement problem, and will allow researchers to visualize the relationship between any two specific sequences. We are continuing our research on this problem and are improving the implementation. In particular, we hope to have our program connected to a VRML site, so that participants at the workshop can experiment with this new visualization of genetic distance.


Title:STALK: A Molecular Docking System

Authors:
David Levine
Mathematics & Computer Science Division
Argonne National Laboratory
levine@mcs.anl.gov

Fred Stevens
Center for Mechanistic Biology and Biotechnology
Argonne National Laboratory
stevens@anlbem.bim.anl.gov

Abstract:STALK is a system for studying the docking of a ligand molecule into a protein binding site. The problem is posed as an optimization problem where the objective is to maximize the intermolecular interaction energy between the two molecules by minimizing the free energy of the system. Because the search space can be very large, we use a parallel genetic algorithm to explore the possible conformations. The problem we are ultimately interested in is amyloid formation; a form of protein self-assembly that occurs in many diseases including Alzheimers.

STALK consists of two parts; a numerical simulation program and a (optional) visualization program. The numerical simulation program uses a simple energy function that incorporates Coulombic and van der Waals interactions. The degrees of freedom are the translation and rotation of the ligand molecule. Since evaluating this potential is very time consuming for large problems, and because we must evaluate many such potentials each iteration, we use the PGAPack parallel genetic algorithm library on an IBM SP POWERparallel computer. Results to date indicate that by taking advantage of the inherent parallelism in the genetic algorithm hundreds of thousands of possible receptor-ligand interactions can be evaluated within a few minutes.

The visualization program runs in the CAVE virtual-reality environment and allows real-time visualization of the results. Each iteration of the genetic algorithm, the lowest energy conformation found by the GA is displayed. This allows us to watch the progress of the genetic algorithm and to examine the best conformation(s) found. In addition, the user can change the position of the ligand molecule and restart the simulation. This capability allows a scientist to use his or her intuition to assist the algorithm in finding a low-energy conformation, or to specify alternative starting positions from which to study the docking process.

This work was supported by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Computational and Technology Research, U.S. Department of Energy, under Contract W-31-109-Eng-38 and the Office of Health and Environmental Research, U.S. Department of Energy, under Contract W-31-109-Eng-38.


Title: ParaMEME: Discovering DNA and Protein Motifs with a Scalable Parallel Computer

Authors:
William N. Grundy
Department of Computer Science and Engineering
University of California at San Diego
bgrundy@cs.ucsd.edu

Timothy L. Bailey
San Diego Supercomputer Center

Charles P. Elkan
Department of Computer Science and Engineering
University of California at San Diego

Abstract: Many advanced software tools fail to reach a wide audience because they require specialized hardware, installation expertise, or an abundance of CPU cycles. The worldwide web offers a new method of distributing such systems. A web interface can provide access to a centralized server running the software on a specialized machine. This distribution paradigm is particularly useful to inexperienced or occasional users who may not want or be able to install the software locally. It also facilitates software maintenance and upgrades.

MEME is one such advanced software tool. MEME discovers repeated patterns, called motifs, in DNA or protein sequences. The principal input to MEME is a set of sequences, and its principal output is a series of probabilistic sequence models, each corresponding to one motif, whose parameters have been estimated by expectation-maximization.

When no background knowledge is assumed, MEME obtains increased robustness from a method for determining motif widths automatically, and from probabilistic models that allow motifs to be absent in some input sequences. On the other hand, MEME can exploit prior knowledge about motifs being present in all input sequences, about the length of a motif and whether it is a palindrome, and (using Dirichlet mixtures) about expected patterns in individual motif positions. In experiments involving 75 Prosite protein families, MEME successfully identifies 119 out of 135 known motifs.

MEME is now available to biologists over the worldwide web, using an asynchronous, single-program multiple-data version of the program called ParaMEME that runs on an Intel Paragon XP/S parallel computer at the San Diego Supercomputer Center. ParaMEME scales gracefully to 64 nodes on the Paragon with efficiencies above 72% for large data sets. The worldwide web interface to ParaMEME accepts a set of sequences interactively from a user, submits the sequences to the Paragon for analysis, and emails the results back to the user. ParaMEME is available for free public use through http://www.sdsc.edu/CompSci/Biomed/MEME.

Keywords: intelligent system, supercomputer, worldwide web, sequence analysis, parallel computing.


Title: SEQIO: A Package for Reading and Writing Sequence Files

Authors:
James R. Knight
Dept. of Computer Science
Univ. of California, Davis
Davis, CA 95616
knight@cs.ucdavis.edu

Abstract:Almost anyone who performs sequence analysis or works with computerized databases eventually runs into the problem of wanting to do ``something else.'' Whether it's extracting entries from a database based on a relationship no one has provided software for, or writing new software that can handle different sequence file formats, the question then becomes, ``Are the results gained from doing this `something else' really worth the effort?'' The SEQIO package is a set of C functions that has been designed to reduce that effort, both for people with little or no programming experience and for more experienced software developers.

For non-programmers, the SEQIO package's release also comes with several complete programs. One of them is a file conversion program based on Don Gilbert's ``readseq'' program but with several improvements, the most interesting of which is the ability to parse FASTA output and construct a multiple alignment of some or all of the pairwise local alignments.

For those with a little programming experience, the SEQIO package code has been designed to make reading and writing sequence files as simple to use as the C stdio package. For example, the complete program in the next column scans a database's sequences and outputs the entry text for every entry whose sequence has a match to the given keyword. This package is ideal for writing those ``quick and dirty'' programs to extract entries from a database or perform a new analysis on the sequences of a file or database. (The functions seqfopendb, seqfgetseq and seqfentry are SEQIO functions that open a database, read the next sequence and return the current sequence's entry. strstr is a C library function that finds the first occurrence of its second argument in its first argument.)

int main(int argc, char *argv[])
{
  int len;
  char *seq, *entry;
  SEQFILE *sfp;
 
  if (argc != 3) {
    fprintf(stderr, "Usage:  prog keyword dbase\n");
    exit(1);
  }
 
  if ((sfp = seqfopendb(argv[2])) == NULL)
    exit(1);
 
  while ((seq = seqfgetseq(sfp, &len, 0)) != NULL) {
    if (len > 0 && strstr(seq, argv[1])) {
      entry = seqfentry(sfp, NULL, 0);
      fputs(entry, stdout);
    }
  }
  seqfclose(sfp);
}
For more experienced programmers, the SEQIO package is an general purpose, efficient and soon to be cross-platform module for reading and writing a number of different sequence file formats. It can read and return sequences and entries, as well as extract other information from an entry, such as identifiers, descriptions, organism names, comments and other things. It can handle large sequences and large databases efficiently (the above program took less than 8 minutes on a DEC 5000 to search all of GenBank Release 87.0, about 800MB of text and 250MB of sequence, for a random twenty character sequence).

The package supports the GenBank Flat File, PIR/CODATA, EMBL/Swiss-Prot, FASTA, NBRF, IG/Stanford, Phylip, ASN.1 text file and FASTA output formats. It is compatible with most of the Unix variants, and by conference time should be ported to Windows NT. And, the addition of new formats or the support of other operating systems only requires willing volunteers to provide examples of the format or to help me test and port the package onto a new machine.

The SEQIO package is freely available to anyone by anonymous ftp from ftp.cs.ucdavis.edu (in file /pub/strings/seqio.tar.gz). It's a gzip'ed, tar file containing the package code, documentation files and the example programs.


Title: Rigid Molecular Docking by Surface Registration

Authors:
Doug Ierardi and Seongbin Park
Department of Computer Science
University of Southern California
separk@pollux.usc.edu

Abstract: We consider the rigid molecular docking problem: Given 3 dimensional structures for a ligand and a receptor molecule, find a rigid body transformation that achieves the best fit of the ligand into the receptor, without regard to conformational changes upon binding. Our approach is based on the heuristic that in many instances, the greater the surface area of contact between molecules, the better the likelihood of a good match. Unlike other surface complementarity based method, we do not impose surface features on the molecular surfaces. Instead, we sample points uniformly on the molecular surface and fit approximate "smoothed" molecular surface into these points as a collection of surface patches.

Our procedure then continues as follows. For each pair of surface patches, one taken from a ligand and the other from a receptor, we find their best registrations. The 6 parameters of rigid body transformations that result in these registrations are stored into a hash table with number of occurrences (votes) associated with them. When we finish this pairing match computations, we can find the 6 parameters that determine a good candidate global rigid body transformation. This is the one that has the most votes in the hash table.

There are three parameters that affect our experimental results. First is the radius of surface patches, which affects the amount of overlapping done. Second is the sampling density for points on the molecular surface. Third is the resolution of the discretization of the 6 dimensional parameter space.

We have run the procedure in two ways: First, as described above with a given parameter set. Then, iterating a program with finer resolutions parameters to refine the candidate answers from the previous run. Our implementation used a hash table of 5M entries. All test data were taken from the Brookhaven Protein DataBank. We have tested our procedure with different set of parameter sets and the ones that had good results are shown below.

 
-------------------------------------------------------------------
Complex   Sampling density   Smoothness   Docking   Best RMS   Rank
-------------------------------------------------------------------
1fkf            3.5             4.0       Global    1.571618    36 
                                          Indexed   1.493737    69 
-------------------------------------------------------------------
1hew            1.7             2.0       Global    2.568764   100
                                          Indexed   2.542256    85 
-------------------------------------------------------------------
1hef            1.7             3.5       Global    4.553256    11
                                          Indexed   3.842648    25
-------------------------------------------------------------------
1cho            3.5             1.5       Global    2.888747    18
                                          Indexed   2.364316    13
-------------------------------------------------------------------

Title: Genetic Algorithms for DNA Sequence Assembly

Authors:
Rebecca J. Parsons(1), Stephanie Forrest(2), Mark E. Johnson(3)
(1) University of Central Florida, Department of Computer Science
(2) University of New Mexico, Department of Computer Science
(3) University of Central Florida, Department of Statistics
rebecca@cs.ucf.edu

Abstract:The assembly of DNA sequences is one of the many combinatorially challenging aspects of large-scale sequencing projects. We have been using this problem as a testbed for exploring various aspects of the performance of genetic algorithms on permutation problems. This experimentation has led us to the design of specialized operators for the DNA sequence assembly problem that mirror some of the evolutionary changes occuring in DNA. Additionally, we have been led to non-traditional parameter settings for the genetic algorithm. We are continuing to investigate the issues contributing to the need for these non-traditional setting. An understanding of these issues could simplify the application of genetic algorithms to other permutation problems related to the genome project. Through these experiments, we have significantly improved the performance of the genetic algorithm on realistically-sized data sets. Specifically, we have been able to assemble data sets for a 35KB parent represented by approximately 800 fragments. This poster will report on the current status of the genetic algorithm as it applies to this problem and will describe the steps that we have taken in improving the genetic algorithm's performance. We will briefly describe the analysis of the genetic algorithm and how this is impacting our theorectical understanding of the performance of genetic algorithms for permutation problems.


Title: Functional relations between a collection of transcriptional regulators of sigma70 promoters

Authors:
Ernesto Perez-Rueda
C.I.F.N. U.N.A.M. Lab. de Biologia Computacional

Abstract:We have analyzed a collection of transcriptional regulators of sigma70 promoters in E. coli in terms of amino acid sequences, regulatory properties, DNA binding domains (Amino or carboxyl terminal) and DNA binding sites. When the regulatory proteins were grouped by regulatory activity vs. dna-binding motif position (N- or C-terminal end) we obtained very functional stable families such as GalR/LacI and LysR (repressor, activator or dual activity). We explain possible hypotheses about this functional relation and evolve implications.


Title: Pattern Recognition Using Evolutionary Algorithms Applied to Understanding Water-Mediated Recognition in Proteins

Authors:
Michael L. Ramyer(1,2), William F. Punch(2), Erik D. Goodman(3), Paul C. Sanschagrin(1), and Leslie A. Kuhn(1)

(1) Department of Biochemistry, Michigan State University
(2) Department of Computer Science, Michigan State University
(3) Case Center for Computer-Aided Engineering, Michigan State University

Abstract: (postscript)


Title: 1X Genomic Sequencing and Computational Analysis of a Megabase region of Human Chromosome 16p13 with SAmple SEquencing (SASE) Analysis and Sequence Comparison ANalysis (SCAN)

Authors:
Darrell O. Ricke, Judith M. Buckingham, A. Christine Munk, Rebecca Lobb, Elizabeth H. Saunders, Han-Chang Chi, Jingmei Liu, Ying Shi, David C. Bruce, J.-R. Wu, Norman A. Doggett, Michael R. Altherr, Larry L. Deaven, and Robert K. Moyzis
Center for Human Genome Studies
Los Alamos National Laboratory
Los Alamos, NM 87545
ricke@lanl.gov

Abstract:The human chromosome 16 physical map (Doggett et al., Nature 377:Suppl:335- 365, 1995) provides the ideal framework for sequencing a human chromosome. We are using a SAmple SEquencing (SASE) approach to rapidly generate low redundancy sequences along the chromosome 16 physical map. SASE analysis is a method for rapidly RscanningS large genomic regions with minimal cost, while identifying and localizing most genes. SASE analysis provides the initial step to complete large scale genomic sequencing. Briefly, individual mapped clones are partially digested with Sau3A and 3 kb fragments are recloned into double-strand sequencing vectors. By sequencing both ends of a 1X sampling of these recloned fragments along with end sequences of the clone, 70% sequence coverage is achieved with 98% clone coverage. The majority of this clone coverage is ordered by the relationship between the subclone end sequences. These ordered sequences are ideal substrates for directed sequencing strategies. SASE analysis has been initiated on the 40 Mb short arm of chromosome 16. Sample sequencing of the first megabase region of 16p13 has been completed. We propose to make chromosome 16 SASE sequences, along with feature annotation, publicly available through GSDB. Such data are sufficient to allow PCR amplification of the sequenced region from GSDB submissions alone, eliminating the need for extensive clone archiving and distributing. Therefore, SASE analysis provides the opportunity for numerous laboratories to complete the distributive genomic sequencing of chromosome 16. To identify and annotate regions of biological interest, we are developing the SCAN (Sequence Comparison ANalysis) algorithm to extract and identify significant homologies from database search results of SASE data. Briefly, SCAN integrates the results from Blast, Fasta, and Grail into summary reports and WWW HTML pages with hot-links to GSDB and NCBI sequence database entries. Multiple overlapping homology results are displayed in a multiple sequence alignment format or reduced to a two line summary for repeats, vector, and E.coli homologies. SCAN results on SASE sequences are identifying multiple candidate genes and exons. Thus far, more than 100 candidate exons, tRNAs, snRNAs, etc. have been identified in the megabase region. This method is localizing many EST sequences with no additional mapping overhead. This work is funded by USDOE under contract W-7405-ENG-36.


Title: A combinatorial approach to genotype-phenotype studies

Authors:
Hugh Salamon
c/o Glenys Thomson
Dept. Integrative Biology
3060 VLSB
University of California
Berkeley, CA 94720
salamon@allele5.biol.berkeley.edu

Jorma Tarhio
Dept. Computer Science
P.O. Box 26 (Teollisuuskatu 23)
FIN-000014 University of Helsinki
Finland

Abstract:Biological problems defining the relationships between sequence data and traits influenced by polymorphic genetic regions occur in studies of both diploid and haploid genomes. There is a need for computational methods which can efficiently analyze DNA and amino acid sequence data to isolate the genetic patterns responsible for complex genetic traits. The unique combinations method is a novel method recently developed for investigating phenotypically contrasting sets of aligned polymorphic sequences. The method identifies all minimal sets of sites at which variants distinguish a given sequence from a set of different aligned sequences, without investigation the 2^n - 1 distinct combinations of n polymorphic sites that potentially distinguish a sequence from a set of different sequences. A set covering approach is exploited in the efficient algorithm which has performed well on human nuclear sequence data. An extension to the method permits identification of combinations of variants distinguishing every member of a set of aligned sequences from a second set. A further extension of the method permits analysis of multilocus diploid genotype data. The unique combinations method has proven very useful in defining the influence of human class II MHC loci in the insulin-dependent diabetes mellitus (IDDM) trait, providing much improved risk assessment for IDDM using amino acid variants in the antigen presenting molecules encoded by class II MHC loci (Salamon et al. 1996). Overcoming combinatorial problems is essential in analysis of complex genetic traits influenced by highly polymorphic genetic loci. The algorithm incorporated into the unique combinations method will also solve other combinatorial problems useful in genotype-phenotype studies of biological sequence data. Applications in medical and molecular evolutionary genetic studies will be aided by further development of software to implement these methods.


Title: Integration and analysis of E. coli gene regulatory data

Authors:
Dr Denis Thieffry
Lab. de Biologia Computational
DEM - CIFN - UNAM
A.P. 565-A, Cuernavaca
Morelos 62100, Mexico
denis@cifn.unam.mx

Abstract:


Title: Molecular modeling of v-FPS kinase

Authors:
Igor Tsigelny (1), Venus Oumarova (2), Lynn F. Ten Eyck (1,3), and Joseph A. Adams (2)
(1) University of California, San Diego
(2) San Diego State University
(3) San Diego Supercomputer Center
teneyckl@sdsc.edu

Abstract:A molecular model of the v-FPS protein kinase was buile using the catalytic subunits of Insulin Receptor Kinase (IRK) and the cyclic AMP dependent protein kinase (cAPK). The active form was modeled with an activation loop based on homology with cAPK, and an inactive form based on the activation loop of IRK. The models were relaxed using simulated annealing with MOPAC. The electrostatic potential fields of the models were used to predict interactions with three potential peptide inhibitors.

The electrostatic potential field of v-FPS is mostly positive and quite different from that of either IRK or cAPK. This leads to strong differences in the electrostatic profiles of possible substrates and ihnibitors which should have a net negative charge. Electrostatic docking calculations have given useful starting points for docking model substrates and inhibitors by simulated annealing.

This research is being used to generate hypotheses for the experimental design of new inhibitors for v-FPS. The models by themselves are of course highly speculative, but they provide useful guidance for experiments both in the design of inhibitors and mutational analysis of the v-FPS kinase.


Title: Docking of small molecules to acetylcholinesteras using dot

Authors:
Igor Tsigelny and Lynn F. Ten Eyck
University of California, San Diego
and
San Diego Supercomputer Center
teneyckl@sdsc.edu

Abstract:The new program package DOT was developed for rapid docking of molecules. It uses convolutions for evaluating the interaction energy of one molecule in the potential field of the second. Fast Fourier Transform methods are used to speed up the energy calculations.

We calculated the docking of different crystallographic structures of acetylcholine to mouse and torpedo acetylcholinesterases (AChE) with and without the protein inhibitor fasciculin. The results of these calculations shed light on the precise mechanisms of penetration of the acetylcholine molecules into the active site of AChE.

The calculations are highly parallel and complete in less than 5 hours on a Digital Equipment Corporation alpha-farm with 10 computersprocessors for a 128x128x128 grid sampled at 1 A intervals. The method thus scales well for large problems. The program is parallelized using xpvm and runs on systems ranging from parallel supercomputers to workstation farms and collections of heterogeneous workstations.


wehart@cs.sandia.gov
Thu Jul 27 14:30:08 MDT 1995