BCB 2003
Bertinoro Computational Biology Meeting 7-13 June 2003
University of Bologna Residential Center
Bertinoro (Forlì), Italy
[ What the Meeting is About
| Seminar Schedule
| Important Dates
| Location
| How to Reach Bertinoro
| List of Participants
| Abstracts
| Organization and Sponsorship
Local Weather Forecast]

What the Meeting is About

The meeting will provide an interdisciplinary environment for presentations as well as time for informal interaction. The meeting will focus on the following three areas (1) evolution, including genome evolution, (2) motif finding in the context of gene regulation (and gene regulation more generally), and (3) SNPs.

It is expected that several biologists and geneticists will participate.

Besides the speakers, about 20 PhD students, for whom financial aid is available, are expected to participate. An expression of interest can be sent to Alessandro Panconesi

Seminar Schedule

08.00-09.00 arrivals breakfast
09.45-10.25 Eskin Pachter Sagot Linial   departures
10.25-10.55 coffee coffee coffee coffee coffee
10.55-11.35 Friedman Lagregren Tompa Tramontano Yakhini
11.35-12.05 Kifer El-Mabrouk Csüros Shefi Preparata (till 12:25)
12.30-13.30 lunch! lunch! lunch! lunch! lunch!
14.30-15.10 Pevzner Durand Pupko Sightseeing Chor
15.10-15.40 coffee! coffee! coffee! coffee!
15.40-16.00 Buhler Sankoff Panconesi Gramm
16.10-16.30 Caprara   Hallett Lancia

Important Dates

Arrival: Suturday 7 June 2003
Departure: Friday 13 - Saturday 14 June, 2003


The meeting will be held in the small medieval hilltop town of Bertinoro. This town is in Emilia Romagna about 50km east of Bologna at an elevation of about 230m.  Here is a map putting it in context. It is easily reached by train and taxi from Bologna and is close to many splendid Italian locations such as Ravenna, a treasure trove of byzantine art and history, and the Republic of San Marino (all within 35km) as well as some less well-known locations like the thermal springs of Fratta Terme and the castle and monastic gardens of Monte Maggio.  Bertinoro can also be a base for visiting some of the better-known Italian locations such as Padua, Ferrara, Vicenza, Venice, Florence and Siena.

Bertinoro itself is picturesque, with many narrow streets and walkways winding around the central peak.  The meeting will be held in a redoubtable ex-Episcopal fortress that has been converted by the University of Bologna into a modern conference center with computing facilities and Internet access.  From the fortress you can enjoy a beautiful the vista that stretches from the Tuscan Apennines to the Adriatic coast.

How to Reach Bertinoro

List of participants

Organization and Sponsorship

Scientific Organizing Committee Benny Chor Tel Aviv University
Jens Lagergren, KTH, Stockholm
Pavel Pevzner University of California at San Diego
Local Organization
Andrea Bandini, Elena Della Godenza, Centro Congressi di Bertinoro
Sponsored by BICI   Bertinoro International Center for Informatics


Modeling Transcription Factor Binding Sites and Their Relation to High-throughput Genomic Assays
by Nir Friedman, Hebrew University

The availability of whole genome sequences and high-throughput genomic assays opens the door for in silico analysis of transcription regulation. This includes methods for discovering and characterizing the binding sites of DNA-binding proteins, such as transcription factors.

I will attempt to discuss several questions:

* How to represent the binding site of a transcription factor?

A common approach is to use position specific score matrices (PSSMs) for representing probability distributions over possible sequences at the binding site. The PSSM model makes the strong assumption that binding site positions are independent of each other. I will describe Bayesian network representations of binding sites that provide different tradeoffs between complexity (number of parameters) and the richness of dependencies between positions.

* How to learn such representation from genomic data?

I will discuss how to adapt machine learning tools for learning such richer binding site models from data and for estimating the statistical significance of putative binding sites.

* How to relate the learned representations to other genomic assays?

Can we use these learned representation to "explain" gene expression measurements and other data?

Duration: 40 minutes

Gene expression patterns in blood cells classify post-traumatic stress disorder among trauma survivors
by Noa Shefi, Hebrew University

Joint work with Nir Friedman and Ronnen Segman

Posttraumatic Stress Disorder (PTSD) is an idiopathic anxiety disorder, which affects up to 20% of individuals exposed to extreme events and commonly follows a chronic treatment refractory course. The immediate clinical presentation among trauma survivors has a low predictive value for long term course. I will present a new work, in which, using various computational methods, we show for the first time that four months after trauma exposure, microarray based large scale gene-expression profiles of peripheral mononuclear blood cells (PMBC), classify survivors of trauma into those remaining healthy and those developing chronic PTSD. In addition, expression patterns of further subsets harvested hours or four months after trauma exposure, correlate with prospectively observed decrease in symptoms among a subset of trauma survivors showing high immediate stress reactivity.

Duration: 20 min.

Algorithmic approach for detecting functional regions
by Tal Pupko, Florida State University

We present a new algorithm to identify functionally important regions (patches) on the surface of proteins. We identify positions with low evolutionary rates that correspond to conserved regions, and thus are likely to be functionally important. To detect functional sites we develop an algorithm that searches for a "patch" of several amino acids that are both conserved and physically close to each other on the surface of the protein. Our algorithm combines the evolutionary information (for assigning a rate for each site) and the 3D information (for searching the functional regions). The algorithm is a probabilistic one: We assign a probability to each possible "patch" and find the patch with the highest probability.

Duration: 40 minutes.

Detecting non-adjoining dependencies in molecular sequence data
by Mike Hallett, McGill University

Molecular sequence data contain many evolutionarily conserved stretches of nucleotides or amino acids that act as {\em signals}. This paper extends a graph theoretic framework from Agarwal and Bafna that detects and models adjoining and non-adjoining dependencies in these signals. In their method, a vertex corresponds to a random variable modelling a position in the signal and the weight of each arc $\arc{u,v}$ is the information content of random variable $v$ given $u$. Whereas previous work allows each position in the signal to be dependent on only one other position ($l=1$), we focus on the case where a position may be dependent on a number of other positions ($l \geq 1$). We offer several computational strategies (approximation, fixed parameter algorithms, heuristics) to overcome two significant obstacles in this extended framework. Firstly, the computational problems for $l > 1$ are intractable. Secondly, a solution to their graph theoretic problem does not necessarily imply the existence of a Markov model, when pairwise information content weights are used. We give some experimental comparisons of the methods using synthetic data.

Joint with N. Dutil

Duration: 20 minutes

Selecting Targets for Structural Determination Based on Clustering the Protein Space
by Ilona Kifer, Hebrew University

Joint work with Nathan Linial and Michal Linial.

Rational target selection is a major ingredient in structural genomic projects. We propose a method for selecting small sets of sequences whose 3D structure is still unknown and that represent proteins belong to new, currently unknown superfamilies. This study is in continuation of our earlier work which yielded ProTarget, an interactive web site that ranks each unsolved protein according to its likelihood of belonging to a new superfamily. ProTarget is based on ProtoMap clustering method ().

For this work, we use ProtoNet, a hierarchical clustering of the protein space. We clustered together all proteins from SwissProt and from PDB databases (total of ~150,800 proteins). This combined database of sequences and 3D structures contains a substantial redundancy. A preliminary process was performed to reduce the number of proteins to only representative once. Following this step, we ended up with ~98000 seeds of representative proteins. We define an intrinsic measure for a protein, called First-Solved-Ancestor (FSA). The FSA of a protein is the first (lowest) ancestor containing a solved 3D structure that is encountered by going up the clustering tree.

Our hypothesis is that the higher in the tree the FSA of a protein is located, the chance of it belonging to a new superfamily increases. We validated this hypothesis for already solved PDB structures and their FSAs in the ProtoNet tree. Finally, we define the best possible threshold that separates with high accuracy - the proteins that belong to new superfamilies from the ones that belong to already solved ones.

Duration: 20 minutes

Gene regulation and its evolution in fly development
by Eric Siggia, Rockefeller University

The early patterning of the fly embryo has been studied well enough genetically that most of the transcription factors are known, but the majority of targets are not. I will how computations using different sources of prior information, can lead to the prediction of many regulatory interactions, some of which have been confirmed experimentally. The utility of related genomes for motif discovery will be discussed along with some description of how regulatory modules evolve. Related data from yeast and bacteria will be used for comparison

Duration: 40 minutes

Probabilistic orthology analysis and reconciliation of gene and species trees
by Jens Lagergren, KTH Stockholm

Two genes are orthologs if there least common ancestor in the gene tree is a duplication. A reconciliation of a gene tree and a species tree explains how the gene tree evolved with respect to the species tree; it, moreover, gives a partition of the internal vertices of the gene tree into duplications and speciations. Previously, only parsimony based reconciliation methods where known.

We will describe a probabilistic model for gene duplications, where a gene tree, describing the evolution of a gene family, evolves inside a species tree. A reconciliation is a mathematical description of such an evolution. Given a gene and a species tree, the likelihood of a reconciliation is the probability that it is the correct reconciliation. It turns out that the likelihood can be computed using dynamic programming. Markov Chain Monte Carlo techniques are then used to compute the posterior probability distribution for reconciliations. The latter also gives the posterior probability distribution for the orthology of any given pair of genes.

Duration: 40 minutes

On Exact and Approximation Algorithms for Consensus String Problems
by Jens Gramm, University of Tübingen

We present results on the parameterized complexity of consensus string problems occurring in the analysis of genomic data. The example problems considered in this talk are the NP-complete Closest String problem and generalizations thereof, having applications, e.g., in primer design and motif search. Closest String asks, for a given set of strings, for a solution string which is, with respect to Hamming metric, ``close'' to each of the given strings. Using these examples, we show, on the one hand, how fixed-parameter algorithms offer a constructive and powerful approach to obtaining practical solutions despite the (seemingly inevitable) intractability of NP-hard problems. On the other hand, we present hardness results showing that fixed-parameter algorithms are unlikely for certain generalizations of Closest String. Such hardness results can also have implications with respect to approximation algorithms: for example, we show that a recently given polynomial-time approximation scheme (PTAS) for the Distinguishing Substring Selection problem cannot be replaced by a so-called efficient polynomial-time approximation scheme (EPTAS) unless an unlikely collapse of parameterized complexity classes occurs.

Duration: 20 minutes

Exploiting structure information for protein function discovery
by Anna Tramontano, University of Rome La Sapienza

In the post-genomica era, we have the possibility of understanding life at a molecular level, of modelling and simulating the behavior of whole cells and of understanding the molecular basis of diseases that we might cure or diagnose at an earlier stage. We might be able to predict the propensity of an individual to a certain pathology and minimize its probability of occurrence or devise methods to delay its onset, we might catalogue individuals according to the probability that they respond to a specific pharmacological treatment, and so on.

All this should be possible using the information contained in a genome sequence and that is the linear sequence of four letters, the four nucleotides. How can we exploit this information effectively? How can we make sense of this large amount of mono-dimensional information and translate them into the three dimensional complex interconnected picture of a living organism? The first step must be the discover of the function of the biological macromolecules encoded by the genome and this can be achieved, by and large, following three routes: experimental methods, evolutionary relationships with molecules of known function and elucidation or prediction of the three-dimensional structure of the molecule and subsequent structure analysis. I will discuss the state of the art of the methods for the prediction of protein structures and the difficulties involved in using them for the discovery of biological function.

Duration: 40 minutes

The Design and Application of Microarray Hybridization Assays, the Computational Viewpoint
by Zohar Yakhini, Agilent

Microarray hybridization assays have become a routine tool for measuring gene expression and are now expanding into other application areas. Different applications entail different performance criteria and thus require different algorithmic and statistical approaches to the computer-aided design phase. The same is true for the data analysis phase. We will decribe applications such as expression profiling, CGH (measuring DNA copy number and aberrations) and SNP-typing and discuss some of the related information aspects.

Duration: 40 minutes

Hardness and approximation results for the Parsimony SNP Haplotyping problem
by Giusseppe Lancia, Università di Udine

Given n genotypes (vectors of {0,1,2}) the Parsimony Haplotyping (PH) problem consists in determining K haplotypes (vectors of {0,1}) such that (i) for each genotype g there are two haplotypes h and q for which g = h + q; (ii) the total number K of haplotypes is minimum. The sum of two haplotypes is defined, component-wise, as 0+0=0, 1+1=1, 0+1=1+0=2.

The PH objective follows naturally from other famous haplotyping objectives (such as Clark's rule) and is today regarded as very appropriate, expecially for blocks of SNPs where no recombination has occurred.

We discuss the computational complexity of the problem (showing it is APX-hard) and an Integer Programming formulation which leads to an approximation algorithm whose performance guarantee depends on the number of "2"s in the genotypes. We also describe an approximation algorithm of different nature (probabilistic) which achieves a similar performance-guarantee.

Duration: 20 min

Sequencing by hybridization: a progress report and a cautious forecast
by Franco Preparata, Brown University

A decade of research on SBH has produced the consensus that its ultimate success is predicated on the adoption of gapped probing patterns. Unfortunately, ideal universal bases remain an elusive goal. While the synthesis of such component is being pursued, efficiency can be gained by tightly coupling the biochemical and combinatorial steps of the technology.

Duration: 40 minutes

Phylogenetic Shadowing
by Lior Pachter, University of California at Berkeley

I will describe phylogenetic shadowing, which is a new approach to comparative genomics that seeks to utilize the comparison of many closely related organisms to infer functional regions in genomes. In particular, I'll describe the mathematical problems associated with shadowing, and the application of phylogenetic shadowing to finding functional regions in the human genome using primate sequences. I will also contrast this approach with traditional approaches which rely on comparisons of more distant vertebrates.

Duration: TBA

Haplotype histories as pathways of recombination and gene conversion
by Nadia El-Mabrouk, Universitè de Montreal

Retracing the trajectories of past genetic events is crucial to understand the structure of the genome, both in individuals and across populations. A haplotype describes a string of polymorphic sites in a DNA segment. The diversity of a haplotype is due to mutations creating new variants, and to recombinations and gene conversions that mix and redistribute these variants among individual chromosomes in populations. A number of studies have revealed a relatively simple pattern of haplotype diversity in the human genome, dominated by a few common haplotypes representing founder ancestral ones. New haplotypes are usually rare and have a limited geographic distribution. We propose a method to derive a new haplotype from a set of putative ancestral haplotypes, once mutations in place, through minimal recombination and gene conversion pathways.

Duration: 20 minutes

A Unified Framework for Motif Finding: Finding Position Specific Scoring Matrices using Patterns with Mismatches
by Eleazar Eskin, Hebrew University

An important part of deciphering gene regulatory mechanisms is discovering transcription factor binding sites. These sites can be detected because they are typically over represented in genomic sequences. The detection of the over represented signals in sequences, or motif-finding has become a central problem in computational biology with vast literature. There are two major computational frameworks for attacking the motif finding problem which differ in their representation of the signals. The most widely used is the PSSM (Position Specific Scoring Matrix) representation. The goal of these algorithms is to obtain probabilistic representations of the over represented signals. The second is the pattern with mismatches representation which represents a signal as discrete consensus pattern and allows some mismatches to occur in the instances of the pattern. The advantage of PSSMs is the expressiveness of their representation while the advantage of the pattern with mismatches approach is the existence of efficient algorithms that guarantee to discover the best patterns.

In this work, we present a unified framework for motif finding which represents signals as weight matrices. We will shown how both PSSMs and patterns with mismatches can be represented via these weight matrices. The main advantage of our framework is that it motivates an algorithm, MITRA-PSSM, which uses algorithms designed for patterns with mismatches to discover PSSMs. MITRA-PSSM provides the guarantees of finding the best signals while preserving the expressiveness of the representation of PSSMs.

Here is the relevant paper: [pdf]   [ps]

Duration: 40 minutes.

Quantitative Detection of Splice Variants: Designing PCR Colony Experiments
by Jeremy Buhler, Washington University in St.Louis

PCR colonies are an emerging high-throughput technology that can sample a complex mixture of DNA sequences. Millions of molecules from the mixture are isolated and individually amplified into discrete, homogeneous spots, or "colonies," on a microscope slide. PCR colonies can therefore form the basis of assays that survey the content of a mixture of sequences without requiring the design of problem-specific microarrays.

One projected use of PCR colony assays is the detection and quantitation of alternative splicing, a major "post-genome" source of diversity and perhaps regulation of expressed gene products. Sampling a cDNA mixture formed from total cellular mRNA results in millions of colonies, each containing a single mRNA. The gene present in each colony can be identified by sequencing a small part of of it in place on the slide. Given this information, we would like to determine, for each colony, which splice variant of its mRNA it contains.

Our approach to distinguishing splice variants is to design a collection of olignucleotide probes whose presence or absence in a colony determines which exons of a gene are present. I will discuss the problem of designing a small set of probes that can efficiently distinguish among splice variants of multiple genes at once. This design problem is constrained in at least two (conflicting) ways. First, the set of of probes used should be small, so as to minimize costs associated with their synthesis. Second, it should be possible to aggregate the probes into just a few pools, such that every probe in a pool can be checked against a slide in a single hybridization experiment while still yielding unambiguous identification of each colony's splice variant.

Duration: 20 minutes

Interdisciplinary Collaborations on Discovering Regulatory Elements
by Martin Tompa, University of Washington

I will talk about two interdisciplinary collaborations focused on the discovery of regulatory elements in specific biological systems. The first arises in the regulation of various genes during the hypoxic response of M. tuberculosis. The second arises in the regulation of the interferon gamma gene, central in the human immune system. Duration: 40 minutes

There's a monkey in the pool!
by Miklos Csuros, Universitè de Montreal

I will discuss novel methods for sequence assembly and physical mapping that rely solely on pooling and shotgun sequencing of genomic clones. The talk will focus on Pooled Genomic Indexing (PGI), a novel method for constructing comparative physical maps. PGI is carried out by pooling arrayed clones, collecting random shotgun fragment sequences from pools, and by comparing the shotgun sequences against known macromolecular sequences. I will talk about a probabilistic model for PGI, and discuss a number of pooling designs. The probabilistic model and the pooling schemes were validated in simulated experiments where 625 rat BAC clones and 207 mouse BAC clones were mapped onto homologous human sequences. I will also describe a recent application of PGI that constructs a comparative physical map of Rhesus macaque and Human involving 2304 macaque clones covering about 10% of the genome.

This is a joint work with Aleksandar Milosavljevic and other members of the Bioinformatics Research Laboratory and the Human Genome Sequencing Center, Baylor College of Medicine.

Duration: 20 minutes

Can the exemplar method really help in inferring rearrangement history under extensive paralogy?
by David Sankoff, Universitè de Montreal

Duration: 20 minutes.

Maximum likelihood analysis of phylogenetic trees
by Benny Chor, Tel Aviv University

Among various methods for constructing phylogenetic trees from sequence data, maximum likelihood (ML) is considered the most reliable, and it is widely used. Yet, there are many issues that are not very well understood, including the number of local maximum points on the likelihood surface, and the computational complexity of ML.

I will describe several results and open problems in this area.

Duration: 40 minutes

The metric space of proteins, from sequence to functional inference
by Michal Linial, Hebrew University

A large fraction of biological research concentrates on individual proteins and on small families of proteins. One of the major challenges in bioinformatics is to extend our knowledge also to very large sets of proteins. Several major projects have tackled this problem. Such undertakings usually start with a process that clusters all known proteins or large subsets of this space. Some work in this area is carried out automatically, while other attempts incorporate expert advice and annotation. We propose a novel technique that automatically clusters protein sequences. The method is based on an all-against-all BLAST similarity test among all SwissProt proteins proceeded by a continuous bottom-up clustering process. We compare the clusters that result from alternative clustering rules, and validate the results against InterPro, SCOP, GO and other annotation sources [1]. To extend our view on the quality of ProtoNet clusters as well on other sets of proteins derived from computational and experimental settings we developed GRAPA (GRaphical Analysis of Protein Annotations), a web-based tool that provides an automatic representation of the biological knowledge associated with any set of proteins. We will present GRAPA and discuss how the use of ProtoNet and GRAPA enhances the biological understanding for structural [2] and functional inference of large sets of proteins. The outcome of these investigations can be viewed at here and here.

[1] Sasson, O., Vaaknin, A., Fleischer, H., Portugaly, E., Bilu, Y., Linial, N. and Linial M. (2003) ProtoNet: hierarchical classification of the protein space. Nucleic Acids Research 31, 348-352.

[2] Portugaly E Kifer I and Linial M. (2002) Selecting targets for structural determination by navigating in a graph of protein families. Bioinformatics 18, 899-907.

Duration: 40 minutes

Fast and accurate heuristics for SNP haplotype reconstruction
by Alessandro Panconesi, University La Sapienza of Rome

Single Nucleotide Polymorphism, or SNPs for short, are the smallest form of genetic variation and seem to be responsible for much of the diversity we observe among individuals. They have therefore attracted much attention and are a good source of interesting algorithmic problems. One of these problems was introduced and studied by Lancia et al. in a couple of papers and boils down to the following. We are given two aligned sequences H1 and H2 (haplotypes) of As and Bs. Many copies are made of both H1 and H2 and broken at random into fragments. The fragments are sequenced, i.e. for each fragment F we know the set of positions it spans and their value (A or B). In this process however we lose the information concerning whether F comes from H1 or H2. Given a set of fragments, can we reconstruct the two original (unknown) haplotypes H1 and H2?

This problem is easily solved if we assume that there are no sequencing errors, but it becomes much more challenging if we realistically assume that a certain fraction of errors are present. Lancia et al. studied the computational complexity of the problem and proposed a dynamic programming algorithm. In this talk we introduce three other approaches to the problem: (a) a clustering approach, (b) an approach based on the theory of approximation algorithms, and (c) some simple, common sense based heuristics. We present computational results concerning all the approaches, including dynamic programming.

As it is often the case, common sense wins hands down.

This is joint work with Stefano Causarano and Mauro Sozio.

Duration: 20 minutes.

Reversals on genomes and 2-element transpositions on permutations
by Alberto Caprara, University of Bologna

We point out the analogy between the reversal distance between two genomes (represented by signed permutations) and the minimum number of 2-element transpositions (i.e., exchanges in the positions of two elements) that transform an (unsigned) permutation into another, called the transposition distance. The nice property of the transposition distance is that it is exactly equal to the number of elements in the permutations minus the number of cycles in a suitable graph, which is true for the reversal distance only in case there are no hurdles around.

The result is that many things that are complicated for the reversal case turn out to be accessible for the 2-element transposition case, like computing the expected distance after a given number of random operations, and characterizing all the permutations on the shortest path between two given ones. We will discuss the impact of these results on the reversal case.

Duration: 20 minutes

Transforming Men into Mice and the random breakage theory
by Pavel Pevzner, University of California at San Diego

Duration: 40 minutes

Gene clusters in comparative genomics: accident or design?
by Dannie Durand, CMU

Large scale gene duplication, the duplication of whole genomes and subchromosomal regions, is a major force driving the evolution of genetic functional innovation. Whole genome duplications are widely believed to have played an important role in the evolution of the maize, yeast and vertebrate genomes. Two or more linked clusters of similar genes found in distinct regions on the same genome are often presented as evidence of large scale duplication. However, as the gene order and the gene complement of duplicated regions diverge progressively due to insertions, deletions and rearrangements, it becomes increasingly difficult to distinguish remnants of common ancestral gene order from coincidental similarities in genomic organization. In this talk, I present computational approaches to validating gene clusters in comparative genomics.

Duration: 40 minutes.

Tiny steps towards addressing a small part of the DNA rearrangement puzzle
by Marie-France Sagot, Inria Rhône-Alpes and University Claude Bernard, Lyon

Joint work with Estela Maris Rodrigues and Yoshiko Wakabayashi from the Institute of Mathematics and Statistics, University of São Paulo, Brazil

Phylogenetic trees are a standard model for representing evolutionary processes, mostly involving biological entities such as species or genes. By a phylogenetic tree, we mean in this talk a rooted unordered tree whose leaves are uniquely labeled with elements of some set S, and whose internal nodes are unlabeled and have exactly two children. The elements of S stand for the contemporary taxa whose evolutionary relationships one intends to model. These taxa correspond to the leaves of the tree, whereas the ancestral taxa correspond to its internal nodes, so that for each ancestral taxon, all of its nearest derived taxa are depicted as its children in the tree.

When there has been genetic transfer, that is, exchange of genetic material between the taxa, a taxon may have more than one ancestor. This poses a formidable problem for the reconstruction of a phylogeny from the genetic information present in such taxa: the phylogeny can not anymore be represented by a single tree. It is then important to be able to detect the presence and exact position of all genetic transfers that may have happened between the taxa of interest. This is a difficult problem which contains as sub-problem that of determining how distant two phylogenetic trees are from each other in terms of genetic transfer. In this talk, we shall just mention an initial pragmatic approach to the first problem and concentrate instead on the second sub-problem.

Various metrics for measuring a distance between two trees have been defined, among which is SPR for Subtree Prune and Regraft. This was initially thought to be directly related to another way of comparing two trees T and U, which consists in computing their so-called Maximum Agreement Forest (MAF). Informally, the number of components of an agreement forest tells how many edges need to be cut from each of T and U so that the resulting forests agree, after performing some forced edge contractions. The relation between SPR and MAF is not so direct as was initially thought but there is a conjecture that the two are close. Being able to calculate the MAF is therefore an interesting issue for our purposes.

This problem is known to be NP-hard. It was introduced by Hein et al. in 1997, who presented an approximation algorithm for it. We show that the performance ratio of Hein's algorithm is 4 and present three approximation algorithms for this problem whose ratio is 3. All three algorithms are very simple. Obtaining a proof of the ratio for one of them is however tricky. We show that proving it was nevertheless worth the effort as indicated by simulations: the corresponding algorithm presents a ratio that in practice is much better than for the other two 3-approximation algorithms, and is between 1.5 and 2 for trees that present a moderate to high level of rearrangement. The less well performing among the three new approximation algorithms is interesting for indirectly providing a lower bound for the MAF between two trees that is good, specially for trees with a moderate level of rearrangement.


Maintained by Alessandro Panconesi