INESC-ID   Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed


Knowledge Discovery and Bioinformatics
Inesc-ID Lisboa


Probabilistic retrieval and visualization of biologically relevant microarray experiments

04/15/2009 - 16:00
04/15/2009 - 17:00

As ArrayExpress and other repositories of genome- wide experiments are reaching a mature size, it is becoming more meaningful to search for related experiments, given a particular study. We introduce methods that allow for the search to be based upon measurement data, instead of the more customary annotation data. The goal is to retrieve experiments in which the same biological processes are activated. This can be due either to experiments targeting the same biological question, or to as-yet unknown relationships. We use a combination of existing and new probabilistic machine learning techniques to extract information about the biological processes differentially activated in each experiment, to retrieve earlier experiments where the same processes are activated, and to visualize and interpret the retrieval results. Case studies on a subset of ArrayExpress show that, with a sufficient amount of data, our method indeed finds experiments relevant to particular biological questions. Results can be interpreted in terms of biological processes using the visualization techniques. The code is available from

Beyond Edman Degradation: Algorithmic De novo Protein Sequencing of Monoclonal Antibodies

04/07/2009 - 14:00
04/07/2009 - 15:00

The characterization and engineering of monoclonal antibodies is usually preceded by time-consuming Edman/cDNA sequencing steps for determination of the heavy and light chain sequences – a low-throughput pipeline that does not address post-translational modifications. In a departure from these platforms, we have developed the Comparative Shotgun Protein Sequencing (CSPS) suite of algorithms – a mass spectrometry based protein sequencing approach resulting in over 95% sequence coverage and automatic discovery of unexpected post-translational modifications. In contrast with the current multiple-week duration of typical sequencing projects, CSPS delivers additional functionality while reducing the time required to sequence an antibody to under 72 hours, a dramatic reduction as compared to the average 2-4 months for classical Edman sequencing of an entire antibody. While we demonstrate CSPS on monoclonal antibodies, the underlying techniques are not antibody-specific and the results indicate that CSPS has the potential to be a disruptive technology for all protein sequencing applications.

Modelling HIV-1 Evolution under Drug Selective Pressure

01/16/2009 - 16:00
01/16/2009 - 17:00

This talk will address methods for the analysis and modeling of HIV evolution, including phylogenetics and the relationship between genotype and phenotype of the HIV virus.

Kernel methods for the prioritization of candidate genes

12/19/2008 - 11:00
12/19/2008 - 12:00

Hunting disease genes is a problem of primary importance in biomedical research. Biologists usually approach this problem in two steps: first a set of candidate genes is identified using traditional positional cloning or high-throughput genomics techniques; second, these genes are further investigated and validated in the wet lab, one by one. To speed up discovery and limit the number of costly wet lab experiments, biologists must test the candidate genes starting with the most probable candidates. So far, biologists have relied on literature studies, extensive queries to multiple databases and hunches about expected properties of the disease gene to determine such an ordering. Recently, the data mining tool ENDEAVOUR has been introduced, which performs this task automatically by relying on different genome-wide data sources, such as Gene Ontology, literature, microarray, sequence and more. A novel kernel method that operates in the same setting is presented: based on a number of different views on a set of training genes, a prioritization of test genes is obtained. A thorough theoretical analysis of the guaranteed performance of the method will also be presented. Finally, the application of the method to the disease data sets on which ENDEAVOUR has been benchmarked, will be reported, showing that a considerable improvement in empirical performance has been obtained.

Faithful modeling of transient expression and its application to elucidating negative feedback regulation

10/30/2008 - 15:00

Modeling and analysis of genetic regulatory networks is essential both for better understanding their dynamic behavior and for elucidating and refining open issues. We hereby present a discrete computational model that effectively describes the transient and sequential expression of a network of genes in a representative developmental pathway. Our model system is a transcriptional cascade that includes positive and negative feedback loops directing the initiation and progression through meiosis in budding yeast. The computational model allows qualitative analysis of the transcription of early meiosis-specific genes, specifically, Ime2 and their master activator, Ime1. The simulations demonstrate a robust transcriptional behavior with respect to the initial levels of Ime1 and Ime2. The computational results were verified experimentally by deleting various genes and by changing initial conditions. The model has a strong predictive aspect, and it provides insights into how to distinguish among and reason about alternative hypotheses concerning the mode by which negative regulation through Ime1 and Ime2 is accomplished. Some predictions were validated experimentally, for instance, showing that the decline in the transcription of IME1 depends on Rpd3, which is recruited by Ime1 to its promoter. Finally, this general model promotes the analysis of systems that are devoid of consistent quantitative data, as is often the case, and it can be easily adapted to other developmental pathways.

Local Properties of Biological Networks

10/29/2008 - 11:00
10/29/2008 - 12:00

The study of biological networks has led to the development of a variety of measures for characterizing network properties at different levels. Global analysis provides summary measures such as diameter, clustering coefficients, and degree distribution that describe the network as a whole, whereas local properties, such as the occurrences of motifs and graphlets allow us to focus on specific phenomena within the network. Local characteristics are suitable to study networks that are incompletely explored; in particular, they faithfully capture the neighborhoods of these parts of the networks that are better studied. In this talk I will describe several methods to analyze both protein-protein interaction (which are undirected graphs) as well as regulation networks (which are directed) along with the biological consequences that they have yielded.


07/24/2008 - 16:30
07/24/2008 - 17:30

RNA binding proteins (RBPs) are emerging as multifunctional entities that act on the mRNA biogenesis pathway from transcription initiation through translation and decay. Association of RBPs with mRNAs through untranslated sequence elements has been proposed to constitute a mechanism that allows for the coordination of gene expression at the post-transcriptional level, defining post-transcriptional operons (Keene, 2002). We have recently characterized the mRNA interactome of two human mRNA binding proteins (Gama-Carvalho, 2006). Classification of the target mRNAs into Gene Ontology (GO) groups suggests that each protein associates with functionally coherent mRNA populations, supporting a coordinating role in gene expression. To understand whether these RNA populations contain distinctive sequence elements we have performed sequence motif searchs for consensus binding sites in the whole transcript, coding sequence and UTRs and compared to a non-associated mRNA population. The results support the model of differential interaction between functionally related mRNA populations and specific regulatory RNA binding proteins through the presence of untranslated sequence elements for regulation (USER) codes. Identification of potential gene networks in the population of target mRNAs using the Ingenuity Pathways KnowledgeBase suggests that these proteins may be involved in the coordination of key cellular functions and signaling pathways, with potential antagonistic effects. We have obtained preliminary evidence for regulatory functions of both proteins on their target mRNAs and we now aim to model these RNA-protein interaction networks and their effects on gene expression, as well as to develop methods to identify USER codes involved in the post-transcriptional coordination of gene expression.

Fully compressed Sufix Trees

05/08/2008 - 17:00

Suffix trees are by far the most important data structure in
stringology, with myriads of applications in fields like
bioinformatics and information retrieval. Classical representations of
suffix trees require O(n \log n) bits of space, for a string of size
n. This is considerably more than the n \log_2\sigma bits needed for
the string itself, where \sigma is the alphabet size. The size of
suffix trees has been a barrier to their wider adoption in practice.
Recent compressed suffix tree representations require just the space
of the compressed string plus \Theta(n) extra bits. This is already
spectacular, but still unsatisfactory when \sigma is small as in DNA

In this talk we introduce the first compressed suffix tree
representation that breaks this linear-space barrier. Our
representation requires sublinear extra space and supports a large set
of navigational operations in logarithmic time. An essential
ingredient of our representation is the lowest common ancestor (LCA)
query. We reveal important connections between LCA queries and suffix
tree navigation.

Dynamic Energy Budget Theory: A General Mathematical Theory in Biology, Empirically Tested for the Major Groups of Organisms

04/03/2008 - 16:00
04/03/2008 - 17:00

Dynamic Energy Budget (DEB) theory, developed by Bas Kooijman at the Department of Theoretical Biology in the Free University of Amsterdam is the first general biological theory at the organism level since the theory of evolution. It is a mathematical theory, comprising all taxonomic groups, with extensive empirical testing and already several practical applications, namely in toxicology (where its use is recommended by ISO and OECD), environmental engineering and biological engineering. It is based on simple mechanistic rules for the uptake of energy and nutrients and the consequences of these rules for physiological organization along the life cycles of organisms.
The broad generality of DEB theory opens very significant research opportunities. In particular, the use of data mining and text mining techniques to obtain parameters for the widest set of organisms possible, and its use as a macroscopic theory to establish a framework for systems biology approaches.

Knowledge discovery in environmental microbiology and physiology: problems, tools and protocols

03/13/2008 - 16:00

The present talk deals with dynamical processes observed at the organismal level in conditions close to real-world environments. The relatively small amount of data and replicates available in such experiments poses specific challenges to the design, deployment and application of integrated computational tools for data management and analysis. They are exemplified by microcosm studies of phototrophic biofilms and in-vivo circadian rhythms of body temperature in mammalians. On the basis of these experiences, I will discuss potential alterations to common protocols of interdisciplinary collaboration, which might be useful in enhancing the efficiency of computational tools in knowledge discovery.