BCB 2003 Bertinoro Computational Biology Meeting 7-13 June 2003 University of Bologna Residential Center Bertinoro (Forlì), Italy |
---|
It is expected that several biologists and geneticists will participate.
Besides the speakers, about 20 PhD students, for whom financial aid is available, are expected to participate. An expression of interest can be sent to Alessandro Panconesi
08.00-09.00 | arrivals | breakfast | |||||
---|---|---|---|---|---|---|---|
09.45-10.25 | Eskin | Pachter | Sagot | Linial | departures | ||
10.25-10.55 | coffee | coffee | coffee | coffee | coffee | ||
10.55-11.35 | Friedman | Lagregren | Tompa | Tramontano | Yakhini | ||
11.35-12.05 | Kifer | El-Mabrouk | Csüros | Shefi | Preparata (till 12:25) | ||
12.30-13.30 | lunch! | lunch! | lunch! | lunch! | lunch! | ||
14.30-15.10 | Pevzner | Durand | Pupko | Sightseeing | Chor | ||
15.10-15.40 | coffee! | coffee! | coffee! | coffee! | |||
15.40-16.00 | Buhler | Sankoff | Panconesi | Gramm | |||
16.10-16.30 | Caprara | Hallett | Lancia |
Arrival: | Suturday 7 June 2003 |
---|---|
Departure: | Friday 13 - Saturday 14 June, 2003 |
Bertinoro itself is picturesque, with many narrow streets and walkways winding around the central peak. The meeting will be held in a redoubtable ex-Episcopal fortress that has been converted by the University of Bologna into a modern conference center with computing facilities and Internet access. From the fortress you can enjoy a beautiful the vista that stretches from the Tuscan Apennines to the Adriatic coast.
Scientific Organizing Committee | Benny Chor Tel Aviv University |
---|---|
Jens Lagergren, KTH, Stockholm | |
Pavel Pevzner University of California at San Diego | |
Local Organization | |
Andrea Bandini, Elena Della Godenza, Centro Congressi di Bertinoro | |
Sponsored by | BICI Bertinoro International Center for Informatics |
Modeling Transcription Factor Binding Sites and Their Relation to
High-throughput Genomic Assays
The availability of whole genome sequences and high-throughput genomic
assays opens the door for in silico analysis of transcription
regulation. This includes methods for discovering and characterizing
the binding sites of DNA-binding proteins, such as transcription
factors.
I will attempt to discuss several questions:
* How to represent the binding site of a transcription factor?
A common approach is to use position specific score matrices (PSSMs)
for representing probability distributions over possible sequences
at the binding site. The PSSM model makes the strong assumption that
binding site positions are independent of each other. I will
describe Bayesian network representations of binding sites that
provide different tradeoffs between complexity (number of
parameters) and the richness of dependencies between positions.
* How to learn such representation from genomic data?
I will discuss how to adapt machine learning tools for learning such
richer binding site models from data and for estimating the
statistical significance of putative binding sites.
* How to relate the learned representations to other genomic assays?
Can we use these learned representation to "explain" gene expression
measurements and other data?
Duration: 40 minutes
Gene expression patterns in blood cells classify post-traumatic stress disorder among trauma survivors
Joint work with Nir Friedman and Ronnen Segman
Posttraumatic Stress Disorder (PTSD) is an idiopathic anxiety disorder, which affects up to
20% of individuals exposed to extreme events and commonly follows a chronic treatment
refractory course. The immediate clinical presentation among trauma survivors has a low
predictive value for long term course.
I will present a new work, in which, using various computational methods, we show for the
first time that four months after trauma exposure, microarray based large scale gene-expression
profiles of peripheral mononuclear blood cells (PMBC), classify survivors of trauma into those
remaining healthy and those developing chronic PTSD.
In addition, expression patterns of further subsets harvested hours or four months after
trauma exposure, correlate with prospectively observed decrease in symptoms among a subset
of trauma survivors showing high immediate stress reactivity.
Duration: 20 min.
Algorithmic approach for detecting functional regions
We present a new algorithm to identify functionally important regions
(patches) on the surface of proteins. We identify positions with low
evolutionary rates that correspond to conserved regions, and thus are
likely to be functionally important. To detect functional sites we
develop an algorithm that searches for a "patch" of several amino acids
that are both conserved and physically close to each other on the
surface of the protein. Our algorithm combines the evolutionary
information (for assigning a rate for each site) and the 3D information
(for searching the functional regions). The algorithm is a probabilistic
one: We assign a probability to each possible "patch" and find the patch
with the highest probability.
Duration: 40 minutes.
Detecting non-adjoining dependencies in molecular sequence data
Molecular sequence data contain
many evolutionarily conserved stretches of nucleotides or amino acids
that act as {\em signals}.
This paper extends a graph theoretic framework from Agarwal and Bafna that
detects and models adjoining and non-adjoining dependencies in these
signals.
In their method, a vertex corresponds to a random variable modelling a
position
in the signal and the weight of each arc $\arc{u,v}$
is the information content of random variable $v$ given $u$.
Whereas previous work allows each position in the signal to be
dependent on
only one other position ($l=1$),
we focus on the case where a position may be dependent
on a number of other positions ($l \geq 1$).
We offer several computational strategies (approximation, fixed
parameter
algorithms, heuristics) to overcome two significant obstacles
in this extended framework.
Firstly, the computational problems for $l > 1$ are intractable.
Secondly, a solution to their graph theoretic problem does not
necessarily
imply the existence of a Markov model, when pairwise information content
weights are used.
We give some experimental comparisons of the methods using synthetic
data.
Joint with N. Dutil
Duration: 20 minutes
Selecting Targets for Structural Determination Based on Clustering the Protein Space
Joint work with Nathan Linial and Michal Linial.
Rational target selection is a major ingredient in structural genomic projects.
We propose a method for selecting small sets of sequences whose 3D structure
is still unknown and that represent proteins belong to new, currently unknown
superfamilies.
This study is in continuation of our earlier work which yielded ProTarget,
an interactive web site that ranks each unsolved protein according to its
likelihood of belonging to a new superfamily. ProTarget is based on ProtoMap
clustering method ().
For this work, we use ProtoNet, a hierarchical clustering of the protein space.
We clustered together all proteins from SwissProt and from PDB databases
(total of ~150,800 proteins). This combined database of sequences and 3D structures
contains a substantial redundancy. A preliminary process was performed to reduce
the number of proteins to only representative once. Following this step, we ended
up with ~98000 seeds of representative proteins. We define an intrinsic measure
for a protein, called First-Solved-Ancestor (FSA). The FSA of a protein is the
first (lowest) ancestor containing a solved 3D structure that is encountered
by going up the clustering tree.
Our hypothesis is that the higher in the tree the FSA of a protein is located, the
chance of it belonging to a new superfamily increases. We validated this hypothesis
for already solved PDB structures and their FSAs in the ProtoNet tree. Finally, we
define the best possible threshold that separates with high accuracy - the proteins
that belong to new superfamilies from the ones that belong to already solved ones.
Duration: 20 minutes
Gene regulation and its evolution in fly development
The early patterning of the fly embryo has been studied well
enough genetically that most of the transcription factors are known, but
the majority of targets are not. I will how computations using different
sources of prior information, can lead to the prediction of many regulatory
interactions, some of which have been confirmed experimentally. The
utility of related genomes for motif discovery will be discussed along
with some description of how regulatory modules evolve. Related data
from yeast and bacteria will be used for comparison
Duration: 40 minutes
Probabilistic orthology analysis and reconciliation of gene and species trees
Two genes are orthologs if there least common ancestor in the gene tree
is a duplication. A reconciliation of a gene tree and a species tree
explains how the gene tree evolved with respect to the species tree;
it, moreover, gives a partition of the internal vertices of the gene tree
into duplications and speciations. Previously, only parsimony
based reconciliation methods where known.
We will describe a probabilistic model for gene duplications,
where a gene tree, describing the evolution of a gene family,
evolves inside a species tree. A reconciliation is a mathematical
description of such an evolution. Given a gene and a species tree,
the likelihood of a reconciliation is the probability that it is the
correct reconciliation. It turns out that the likelihood can be computed
using dynamic programming. Markov Chain Monte Carlo techniques are then
used to compute the posterior probability distribution for reconciliations.
The latter also gives the posterior probability distribution for the
orthology of any given pair of genes.
Duration: 40 minutes
On Exact and Approximation Algorithms for Consensus String Problems
We present results on the parameterized complexity of consensus string
problems occurring in the analysis of genomic data. The example
problems considered in this talk are the NP-complete Closest String
problem and generalizations thereof, having applications, e.g., in
primer design and motif search. Closest String asks, for a given set
of strings, for a solution string which is, with respect to Hamming
metric, ``close'' to each of the given strings. Using these examples,
we show, on the one hand, how fixed-parameter algorithms offer a
constructive and powerful approach to obtaining practical solutions
despite the (seemingly inevitable) intractability of NP-hard problems. On
the other hand, we present hardness results showing that
fixed-parameter algorithms are unlikely for certain generalizations of
Closest String. Such hardness results can also have implications
with respect to approximation algorithms: for example, we show that a
recently given polynomial-time approximation scheme (PTAS) for the
Distinguishing Substring Selection problem cannot be replaced by a
so-called efficient polynomial-time approximation scheme (EPTAS)
unless an unlikely collapse of parameterized complexity classes
occurs.
Duration: 20 minutes
Exploiting structure information for protein function discovery
In the post-genomica era, we have the possibility of understanding life at
a molecular level, of modelling and simulating the behavior of whole
cells and of understanding the molecular basis of diseases that we might
cure or diagnose at an earlier stage. We might be able to predict the
propensity of an individual to a certain pathology and minimize its
probability of occurrence or devise methods to delay its onset, we might
catalogue individuals according to the probability that they respond to a
specific pharmacological treatment, and so on.
All this should be possible using the information contained in a genome
sequence and that is the linear sequence of four letters, the four
nucleotides. How can we exploit this information effectively? How can we
make sense of this large amount of mono-dimensional information and
translate them into the three dimensional complex interconnected picture
of a living organism?
The first step must be the discover of the function of the biological
macromolecules encoded by the genome and this can be achieved, by and
large, following three routes: experimental methods, evolutionary
relationships with molecules of known function and elucidation or
prediction of the three-dimensional structure of the molecule and
subsequent structure analysis.
I will discuss the state of the art of the methods for the prediction of
protein structures and the difficulties involved in using them for the
discovery of biological function.
Duration: 40 minutes
The Design and Application of Microarray Hybridization Assays, the Computational Viewpoint
Microarray hybridization assays have become a routine tool for measuring gene expression and
are now expanding into other application areas. Different applications entail different
performance criteria and thus require different algorithmic and statistical approaches to the
computer-aided design phase. The same is true for the data analysis phase. We will decribe
applications such as expression profiling, CGH (measuring DNA copy number and aberrations) and
SNP-typing and discuss some of the related information aspects.
Duration: 40 minutes
Hardness and approximation results for the Parsimony SNP Haplotyping problem
Given n genotypes (vectors of {0,1,2}) the Parsimony Haplotyping
(PH) problem consists in determining K haplotypes (vectors of {0,1}) such
that (i) for each genotype g there are two haplotypes h and q for which g
= h + q; (ii) the total number K of haplotypes is minimum.
The sum of two haplotypes is defined, component-wise, as 0+0=0, 1+1=1,
0+1=1+0=2.
The PH objective follows naturally from other famous haplotyping
objectives (such as Clark's rule) and is today regarded as very
appropriate, expecially for blocks of SNPs where no recombination has
occurred.
We discuss the computational complexity of the problem (showing it is
APX-hard) and an Integer Programming formulation which leads to an
approximation algorithm whose performance guarantee depends on the number
of "2"s in the genotypes. We also describe an approximation algorithm of
different nature (probabilistic) which achieves a similar
performance-guarantee.
Duration: 20 min
Sequencing by hybridization: a progress report and a cautious forecast
A decade of research on SBH has produced the consensus that
its ultimate success is predicated on the adoption of gapped probing
patterns.
Unfortunately, ideal universal bases remain an elusive goal. While the
synthesis of such component is being pursued, efficiency can be gained
by tightly coupling the biochemical and combinatorial steps of the
technology.
Duration: 40 minutes
Phylogenetic Shadowing
I will describe phylogenetic shadowing, which is a new approach to
comparative genomics that seeks to utilize the comparison of many closely
related organisms to infer functional regions in genomes. In particular,
I'll describe the mathematical problems associated with shadowing, and the
application of phylogenetic shadowing to finding functional regions in the
human genome using primate sequences. I will also contrast this approach
with traditional approaches which rely on comparisons of more distant
vertebrates.
Duration: TBA
Haplotype histories as pathways of recombination and gene conversion
Retracing the trajectories of past genetic events is crucial to
understand the structure of the genome, both in individuals and
across populations. A haplotype describes a string of polymorphic
sites in a DNA segment. The diversity of a haplotype is due to
mutations creating new variants, and to recombinations and gene
conversions that mix and redistribute these variants among
individual chromosomes in populations. A number of studies have
revealed a relatively simple pattern of haplotype diversity in the
human genome, dominated by a few common haplotypes representing
founder ancestral ones. New haplotypes are usually rare and have a
limited geographic distribution. We propose a method to derive a new
haplotype from a set of putative ancestral haplotypes, once
mutations in place, through minimal recombination and gene
conversion pathways.
Duration: 20 minutes
A Unified Framework for Motif Finding: Finding Position
Specific Scoring Matrices using Patterns with Mismatches
An important part of deciphering gene regulatory mechanisms is
discovering transcription factor binding sites. These sites can be
detected because they are typically over represented in genomic
sequences. The detection of the over represented signals in
sequences, or motif-finding has become a central problem in
computational biology with vast literature. There are two major
computational frameworks for attacking the motif finding problem which
differ in their representation of the signals. The most widely used
is the PSSM (Position Specific Scoring Matrix) representation. The
goal of these algorithms is to obtain probabilistic representations of
the over represented signals. The second is the pattern with
mismatches representation which represents a signal as discrete
consensus pattern and allows some mismatches to occur in the instances
of the pattern. The advantage of PSSMs is the expressiveness of their
representation while the advantage of the pattern with mismatches
approach is the existence of efficient algorithms that guarantee to
discover the best patterns.
In this work, we present a unified framework for motif finding which
represents signals as weight matrices. We will shown how both PSSMs
and patterns with mismatches can be represented via these weight
matrices. The main advantage of our framework is that it motivates an
algorithm, MITRA-PSSM, which uses algorithms designed for patterns
with mismatches to discover PSSMs. MITRA-PSSM provides the guarantees
of finding the best signals while preserving the expressiveness of the
representation of PSSMs.
Here is the relevant paper:
[pdf]
[ps]
Duration: 40 minutes.
Quantitative Detection of Splice Variants: Designing PCR Colony Experiments
PCR colonies are an emerging high-throughput technology that can
sample a complex mixture of DNA sequences. Millions of molecules from
the mixture are isolated and individually amplified into discrete,
homogeneous spots, or "colonies," on a microscope slide. PCR colonies
can therefore form the basis of assays that survey the content of a
mixture of sequences without requiring the design of problem-specific
microarrays.
One projected use of PCR colony assays is the detection and
quantitation of alternative splicing, a major "post-genome" source of
diversity and perhaps regulation of expressed gene products. Sampling
a cDNA mixture formed from total cellular mRNA results in millions of
colonies, each containing a single mRNA. The gene present in each
colony can be identified by sequencing a small part of of it in place
on the slide. Given this information, we would like to determine, for
each colony, which splice variant of its mRNA it contains.
Our approach to distinguishing splice variants is to design a collection
of olignucleotide probes whose presence or absence in a colony determines
which exons of a gene are present. I will discuss the problem of
designing a small set of probes that can efficiently distinguish among
splice variants of multiple genes at once. This design problem is
constrained in at least two (conflicting) ways. First, the set of
of probes used should be small, so as to minimize costs associated with
their synthesis. Second, it should be possible to aggregate the probes
into just a few pools, such that every probe in a pool can be checked
against a slide in a single hybridization experiment while still yielding
unambiguous identification of each colony's splice variant.
Duration: 20 minutes
Interdisciplinary Collaborations on Discovering Regulatory Elements
I will talk about two interdisciplinary collaborations focused on the
discovery of regulatory elements in specific biological systems. The
first arises in the regulation of various genes during the hypoxic
response of M. tuberculosis. The second arises in the regulation of
the interferon gamma gene, central in the human immune system.
Duration: 40 minutes
There's a monkey in the pool!
I will discuss novel methods for sequence assembly and
physical mapping that rely solely on pooling and shotgun
sequencing of genomic clones. The talk will focus on Pooled
Genomic Indexing (PGI), a novel method for constructing
comparative physical maps. PGI is carried out by pooling
arrayed clones, collecting random shotgun fragment sequences
from pools, and by comparing the shotgun sequences against
known macromolecular sequences. I will talk about a
probabilistic model for PGI, and discuss a number of pooling
designs. The probabilistic model and the pooling schemes
were validated in simulated experiments where 625 rat BAC
clones and 207 mouse BAC clones were mapped onto homologous
human sequences. I will also describe a recent application
of PGI that constructs a comparative physical map of Rhesus
macaque and Human involving 2304 macaque clones covering
about 10% of the genome.
This is a joint work with Aleksandar Milosavljevic and other
members of the Bioinformatics Research Laboratory and the
Human Genome Sequencing Center, Baylor College of Medicine.
Duration: 20 minutes
Can the exemplar method really help in
inferring rearrangement history under extensive paralogy?
Duration: 20 minutes.
Maximum likelihood analysis of phylogenetic trees
Among various methods for constructing phylogenetic trees
from sequence data, maximum likelihood (ML) is considered the
most reliable, and it is widely used. Yet, there are many
issues that are not very well understood, including the number
of local maximum points on the likelihood surface, and the
computational complexity of ML.
I will describe several results and open problems in this area.
Duration: 40 minutes
The metric space of proteins, from sequence to functional inference
A large fraction of biological research concentrates on individual proteins and on small
families of proteins. One of the major challenges in bioinformatics is to extend our
knowledge also to very large sets of proteins. Several major projects have tackled this
problem. Such undertakings usually start with a process that clusters all known proteins
or large subsets of this space. Some work in this area is carried out automatically,
while other attempts incorporate expert advice and annotation. We propose a novel
technique that automatically clusters protein sequences. The method is based on an
all-against-all BLAST similarity test among all SwissProt proteins proceeded by a
continuous bottom-up clustering process. We compare the clusters that result from
alternative clustering rules, and validate the results against InterPro, SCOP, GO and
other annotation sources [1]. To extend our view on the quality of ProtoNet clusters as
well on other sets of proteins derived from computational and experimental settings we
developed GRAPA (GRaphical Analysis of Protein Annotations), a web-based tool that
provides an automatic representation of the biological knowledge associated with any set
of proteins. We will present GRAPA and discuss how the use of ProtoNet and GRAPA enhances
the biological understanding for structural [2] and functional inference of large sets of
proteins. The outcome of these investigations can be viewed at
here and
here.
[1] Sasson, O., Vaaknin, A., Fleischer, H., Portugaly, E., Bilu, Y., Linial, N. and
Linial M. (2003) ProtoNet: hierarchical classification of the protein space. Nucleic
Acids Research 31, 348-352.
[2] Portugaly E Kifer I and Linial M. (2002) Selecting targets for structural
determination by navigating in a graph of protein families. Bioinformatics 18, 899-907.
Duration: 40 minutes
Fast and accurate heuristics for SNP haplotype reconstruction
Single Nucleotide Polymorphism, or SNPs for short, are the smallest form of genetic variation
and seem to be responsible for much of the diversity we observe among individuals.
They have therefore attracted much attention and are a good source of interesting
algorithmic problems. One of these problems was introduced and studied by Lancia et al.
in a couple of papers and boils down to the following. We are given two aligned sequences H1 and H2
(haplotypes) of As and Bs. Many copies are made of both H1 and H2 and broken at random
into fragments. The fragments are sequenced, i.e.
for each fragment F we know the set of positions it spans and their value (A or B).
In this process however we lose the information concerning whether F comes from H1 or H2.
Given a set of fragments, can we reconstruct the two original (unknown) haplotypes H1 and H2?
This problem is easily solved if we assume that there are no sequencing errors, but it
becomes much more challenging if we realistically assume that a certain fraction of errors are present.
Lancia et al. studied the computational complexity of the problem and proposed a dynamic
programming algorithm. In this talk we introduce three other approaches to the problem:
(a) a clustering approach, (b) an approach based on the theory of
approximation algorithms, and (c) some simple, common sense based heuristics.
We present computational results concerning all the approaches, including dynamic programming.
As it is often the case, common sense wins hands down.
This is joint work with Stefano Causarano and Mauro Sozio.
Duration: 20 minutes.
Reversals on genomes and 2-element transpositions on permutations
We point out the analogy between the reversal distance between two genomes
(represented by signed permutations) and the minimum number of 2-element
transpositions (i.e., exchanges in the positions of two elements) that
transform an (unsigned) permutation into another, called the transposition
distance. The nice property of the transposition distance is that it is
exactly equal to the number of elements in the permutations minus the
number of cycles in a suitable graph, which is true for the reversal
distance only in case there are no hurdles around.
The result is that many things that are complicated for the reversal case
turn out to be accessible for the 2-element transposition case, like
computing the expected distance after a given number of random operations,
and characterizing all the permutations on the shortest path between two
given ones. We will discuss the impact of these results on the reversal
case.
Duration: 20 minutes
Transforming Men into Mice and the random breakage theory
Duration: 40 minutes
ABSTRACTS
by Nir Friedman, Hebrew University
by Noa Shefi, Hebrew University
by Tal Pupko, Florida State University
by Mike Hallett, McGill University
by Ilona Kifer, Hebrew University
by Eric Siggia, Rockefeller University
by Jens Lagergren, KTH Stockholm
by Jens Gramm, University of Tübingen
by Anna Tramontano, University of Rome La Sapienza
by Zohar Yakhini, Agilent
by Giusseppe Lancia, Università di Udine
by Franco Preparata, Brown University
by Lior Pachter, University of California at Berkeley
by Nadia El-Mabrouk, Universitè de Montreal
by Eleazar Eskin, Hebrew University
by Jeremy Buhler, Washington University in St.Louis
by Martin Tompa, University of Washington
by Miklos Csuros, Universitè de Montreal
by David Sankoff, Universitè de Montreal
by Benny Chor, Tel Aviv University
by Michal Linial, Hebrew University
by Alessandro Panconesi, University La Sapienza of Rome
by Alberto Caprara, University of Bologna
by Pavel Pevzner, University of California at San Diego