INESC-ID   Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed


Knowledge Discovery and Bioinformatics
Inesc-ID Lisboa


What are functional modules in biological networks?

02/28/2008 - 16:00
02/28/2008 - 17:00

Modularity has become in recent years a widely accepted feature of biological networks. However, it seems to mean different things in different networks and even within the same type of network. This poses a challenge to the development of methods to partition networks into functionally meaning entities. I will discuss in my talk modularity in the context of protein interaction networks, from method development to evolutionary studies.

Identification of Transcription Factor Binding Sites in Promoter Regions by Modularity Analysis of the Motif Co-Occurrence Graph

01/31/2008 - 16:00
01/31/2008 - 17:00

Many algorithms have been proposed to date for the problem of finding biologically significant motifs in promoter regions. They can be classified into two large families: combinatorial methods and probabilistic methods. Probabilistic methods have been used more extensively, since they require less input from the user, and their output is easier to interpret. Combinatorial methods have the potential to identify hard to detect motifs, but their output is much harder to interpret, since it may consist of hundreds or thousands of motifs. In this work, we propose a method that processes the output of combinatorial motif finders in order to find groups of motifs that represent variations of the same motif, thus reducing the output to a manageable size. This processing is done by building a graph that represents the co-occurrences of motifs, and finding communities in this graph. We show that this innovative approach leads to a method that is as easy to use as a probabilistic motif finder, and as sensitive to low quorum motifs as a combinatorial motif finder. The method was integrated with two combinatorial motif finders, and made available on the Web, integrated in an application that can be used to analyze promoter regions in S. cerevisiae. Experiments performed using this system show that the method is effective in the identification of relevant binding sites

Using sequence and expression data to predict microRNA targets in animals

12/20/2007 - 17:00

MicroRNAs are short (20-22) nucleotide non-coding RNAs involved in post-transcriptional regulation of gene expression. One of the essential requirements to understand the function of a microRNA is to know the genes it regulates, the so-called targets of the microRNA. It is believed that in plants most target sites present a near perfect complementarity to the sequence of the microRNA. However, in animals most target sites present lower number of complementary nucleotides. It is therefore difficult to design target prediction methods that simultaneously have high specificity and sensitivity. It has recently been discovered that in animals microRNAs not only suppress translation but also induce the destabilization of mRNA transcripts. This discovery has opened up the possibility of using data from microarray assays, performed on cells where microRNA expression is modified, to predict its targets. In the first part of the talk we will review current sequence-based microRNA target prediction tools. In the second part of the talk we will show how microarray data can be used to improve microRNA target prediction.

Location Proteomics Using Machine Learning Techniques

12/18/2007 - 17:00
12/18/2007 - 18:00

Fluorescent microscopy is a method by which a labeled protein can be imaged inside a cell. Such images can be used to determine the subcellular location of the protein. I will show how machine learning techniques have been used to automate this task. I will present the basic methods used as well as some more recent work.

Design of a web based typing tool

12/06/2007 - 17:00
12/06/2007 - 18:00

ccrB typing, based on the DNA sequencing of an internal fragment of ccrB, was developed as a potential first-line SCCmec typing strategy for methicillin resistant Staphylococcus aureus, since the ccrB sequence is part of the ccrAB locus, whose allotypes are used for the definition of SCCmec types. Clustering of ccrB sequences has been shown to properly discriminate between different SCCmec types (Oliveira et al, J. Antimicrob Chemother 58(1):23-30). To make this approach available to the general public, a web based typing tool interface was developed so that users can submit ccrB sequences and obtain an automatic assignment to a ccrB allele and presumptive SCCmec type classification.

The website serves as a front-end for a database and user management system, coupled with two sequence alignment programs (MUSCLE and Clustal) and an alignment and tree viewer software (Jalview). For each SCCmec type, a set of known sequences are defined as a prototypes. The user submitted ccrB sequences are then, after an automatic trimming, compared by multiple sequence alignment with the defined prototypes, and the most similar prototype defines the ccrB allele, allowing the inference of ccrAB allotype and SCCmec type.

The website ( is being used since late October 2007. At the website launch date the database had 98 ccrB sequences, including 17 prototype sequences, with references and with other important isolate information. All the public information can be easily extracted to csv format and an online tutorial explains the users how to work with the web tool. The data can be saved privately or it can be submitted to the public database after being reviewed by a system curator.

Web-based applications have proved to be the best approach for the comparison of sequence based typing methodologies results. Anyone with internet connection can contribute with their data to the public database and help to validate and improve the typing methodology. The implemented website has demonstrated its usefulness for ccrB typing giving good results for several staphylococci species, allowing for an expanding database of curated ccrB sequences. Additionally the web based typing tool interface was implemented to be rapidly configurable, and can be used as a framework for other sequence based typing methods based either on direct sequence similarity or on the comparison of repeated patterns.

Speculations on a New Approach to Modeling Biological Systems

11/29/2007 - 17:00
11/29/2007 - 18:00

Computational systems biology complements experimental biology in unique ways that are hoped to reveal insights and a depth of understanding not achievable without systems approaches. A major challenge of systems biology continues to be the determination of parameter values for mathematical models. While some models can be analyzed in symbolic form, these are few and far between, and the lack of parameter values is a true obstacle for most computational analyses of realistic biological phenomena. As a consequence, computational modelers tend to take on a problem only if there is a relatively solid database for parameter estimation. Interestingly, biologists very often have very detailed mental models of the phenomenon they are investigating and are not really interested in absolutely precise numerical results, as long as they can test relevant, semi-quantitative hypotheses. However, neither they nor their modeling colleagues have the means of translating the mental models into numerical mathematical structures that would allow advanced diagnosis and testing. I will speculate in this presentation on a possible way to bridge the cleft between mental and numerical models, using modern methods from Biochemcial Systems Theory. The envisioned technique is tentatively called "concept map modeling" and seems quite reasonable, but I do not have proof yet that it will actually work in real-world applications.

An evaluation of the impact of side chain positioning on the accuracy of discrete models of protein structures

11/15/2007 - 17:00
11/15/2007 - 18:00

Discrete models are important to reduce the complexity of the protein folding problem. However, a compromise must be made between the model complexity and the accuracy of the model. Previous work by Park and Levitt has shown that the protein backbone can be modeled with good accuracy by four state discrete models. Nonetheless, for abinitio protein folding, the side chains are important to determine if the structure is physically possible and well packed. We extend the work of Park and Levitt by taking into account the positioning of the side chain in the evaluation of the accuracy. We show that the problem becomes much harder and more dependent on the type of protein being modeled. In fact, the structure fitting method used in their work is no longer adequate to this extended version of the problem. We propose a new method to test the model accuracy. The presented results show that, for some proteins, the discrete models with side chains cannot achieve the accuracy of the backbone discrete models. Nevertheless, for the majority of the proteins an RMSD of four angstrom or less is obtained, and, for many of those, we reach an accuracy near the two angstrom limit. These results prove that discrete models can be used in protein folding for obtaining low resolution models. Since the side chains are already present in the models, the refinement of these solutions is simpler and more effective.

Mining Queries

11/06/2007 - 16:00

User queries in search engines and Websites give valuable information on the interests of people. In addition, clicks after queries relate those interests to actual content. Even queries without clicks or answers imply important missing synonyms or content. In this talk we show several examples on how to use this information to improve the performance of search engines, to recommend better queries, to improve the information scent of the content of a Website and ultimately to capture knowledge, as Web queries are the largest wisdom of crowds in Internet.

Efficient learning of Bayesian network classifiers: An extension to the TAN classifier

10/25/2007 - 17:00

We introduce a Bayesian network classifier less restrictive than Naive Bayes (NB) and Tree Augmented Naive Bayes (TAN)classifiers. Considering that learning an unrestricted network is unfeasible the proposed classifier is confined to be consistent with the breadth-first search order of an optimal TAN. We propose an efficient algorithm
to learn such classifiers for any score that decompose over the network structure, including the well known scores based on information theory and Bayesian scoring functions. We show that the induced classifier always scores better than or the same as the NB and TAN classifiers. Experiments on modeling transcription factor binding sites show that, in many cases, the improved scores translate into increased classification accuracy.

Compressing Web Graphs as Texts

09/28/2007 - 16:30
09/28/2007 - 17:30

The need to run different kinds of algorithms over large Web graphs motivates the research for compressed graph representations that permit accessing without decompressing them. At this point there exist a few such compression proposals, some of them very effective in practice. In this talk we introduce a novel approach to graph compression, based on regarding the graph as a text and using existing techniques for text compression/indexing. This permits accessing the graph efficiently without decompressing it, and in addition brings in new functionalities over the compressed graph. Our experimental results show that our technique has the potential of being competitive with the best alternative techniques, yet not fully satisfactory. Then we introduce a second approach, where we go back to pure compression. By far the best current result is the technique by Boldi and Vigna, which takes advantage of several particular properties of Web graphs. We show that the same properties can be exploited with a different and elegant technique, built on Re-Pair compression, which achieves about the same space but much faster navigation of the graph. Moreover, the technique has the potential of adapting well to secondary memory. Finally, we comment on ongoing work to combine those approaches. The successful scheme can be enriched with succinct data structures so as to permit further graph traversal operations.