INESC-ID   Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed


Knowledge Discovery and Bioinformatics
Inesc-ID Lisboa


A Estatística ao Encontro da Biologia Molecular

01/25/2006 - 14:00
01/25/2006 - 15:00

A associação entre a explosão de dados relacionados com genética molecular e o avanço tecnológico a nível de meios informáticos é, presentemente, um desafio para o estatístico na medida em que é requerido um melhoramento dos métodos existentes e o desenvolvimento de métodos de inferência mais eficientes para lidar com dados de natureza tão complexa.
Pretende-se mostrar como a estatística é fundamental na abordagem de sistemas tão diversos como a estrutura de proteínas e microarrays. Estudos concretos em Biologia Molecular, com objectivos muito específicos, servem de base à apresentação das metodologias estatísticas mais aplicadas nesta área. Salienta-se também a importância do software disponível, nomeadamente alguns packages do R e programas com interface web.

Syntons, metabolons and interactons: an exact graph-theoretical approach for exploring neighbourhood between genomic and functio

01/20/2006 - 11:00
01/20/2006 - 12:00

Modern comparative genomics does not restrict to sequence but involves the comparison of metabolic pathways or protein-protein interactions as well. Central in this approach is the concept of neighbourhood between entities (genes, proteins, chemical compounds). Therefore there is a growing need for new methods aiming at merging the connectivity information from different biological sources in order to infer functional coupling. We present a generic approach to merge the information from two or more graphs representing biological data. The method is based on two concepts. The first one, the correspondence multigraph, precisely defines how correspondence is performed between the primary data-graphs. The second one, the common connected components, defines which property of the multigraph is searched for. Although this problem has already been informally stated in the past few years, we give here a formal and general statement together with an exact algorithm to solve it.

Improved Indexing of Text using the Ziv-Lempel Trie

01/06/2006 - 14:30
01/06/2006 - 15:30

A full-text index provides fast search for any pattern in a text. Tradicional full-text indexes, however, have a serious problem with space usage. A recent trend is to develop indexes that exploit the compressibility of the text so that their size is a function of the size of the compressed text. This field has introduced the concept of self-indexes, which, in addition to providing index capabilities, are capable of replacing the original text. The two most successful lines of research are the ones exploring compressed suffix arrays and the Burrows-Wheeler transform. In this talk we will describe an algorithm that builds an index based on the well known Ziv-Lempel data compression technique. This algorithm represents a significant improvement over previous methods. We expect that it will be very competitive against the two mainstream approaches.

Networks from Genomes and Metabolomes

11/25/2005 - 15:00
11/25/2005 - 16:00

In this talk we will present two unrelated methods for constructing networks from molecular sequence data
and from metabolic profiles.

Phylogenetic networks are a generalization of phylogenetic trees that permit the representation of conflicting signal or alternative phylogenetic histories. Networks can provide a useful tool for phylogenetic analysis when the underlying evolutionary history is non treelike. For example, recombination, hybridization, and lateral gene transfer can all lead to histories that are not adequately represented by a single tree. In the first part
of the talk, we present some techniques for the whole-genome scale construction of phylogenetic networks.

The second part of the talk is concerned with reconstruction of metabolic networks. A well-studied problem in biochemical system analysis is the reconstruction of model parameters from metabolic profiles. Due to the high dimensionality of this problem, most current methods for its solution are based on heuristic algorithms. However, such methods often produce point estimates for parameters, with no guarantees of accuracy. Here, we present an alternative way to solve the reconstruction problem that is based on the mathematical theory of interval analysis that has the advantage of providing reliable estimates of parameters.

Dotted Suffix Trees: A Structure for Approximate Text Indexing

11/18/2005 - 14:30
11/18/2005 - 15:30

The problem we address is text indexing for approximate matching. We consider that we are given a text T which undergoes some preprocessing to generate an index. We can later query this index to identify the places where a string occurs up to a certain number of allowed substitutions k. We present a structure for indexing which occupies space O(n log^k n) in the average case, independent of alphabet size, n being the text size. This structure support searching in O(m^{k+1}) time, for patterns of size m, again independent of alphabet size. The construction of the structure has time bound by O(kN|S|), where N is the number of nodes in the index and |S| the alphabet size.

Biclustering Time-Series Expression Data: from Gene Expression to Biological Processes and Regulatory Networks

09/30/2005 - 11:30
09/30/2005 - 12:30

In this talk we will focus on biclustering, a technique that has recently shown to be remarkably effective in a variety of applications in biological data analysis and other data mining tasks. The importance of biclustering in the identification of groups of genes with coherent expression patterns (in a subset of the experimental conditions), and its advantages (when compared to clustering) in the discovery of local expression patterns has been extensively studied and documented. The use of these techniques is therefore critical to identify the dynamics of biological systems as well as the different groups of genes involved in each biological process. This talk is organized in three main parts. First, we will talk about gene expression data analysis and its general goals and challenges. We will then talk about biclustering and its applications to expression data analysis. Finally, we will focus on biclustering in time-series expression data and its applications to the discovery of gene regulatory modules and networks.

Scheduling algorithms for large DSP problems

07/14/2005 - 16:30
07/14/2005 - 17:30

Over the past years, several scheduling techniques have been proposed for high level synthesis. However, the main focus of the research in this area has been in the development of optimal algorithms for small problems. Since the scheduling problem as been shown to be NP-complete, simpler algorithms must be investigated for scheduling large problems. In this sense, it will be shown that by using simple algorithms it is possible to acheive solutions close to optimal in a very short time.

Emparelhamento em texto comprimido

06/16/2005 - 16:30

A apresentação centra-se em emparelhamento directamente no texto comprimido. Será abordada a técnica de compressão "códigos de redundância mínima" (Códigos de Huffman), dando especial interesse à codificação de "Tagged Huffman Coding". Baseado nesta codificação, serão apresentadas duas formas de pesquisa no texto comprimido.

Fast Detection of Common Sequence Structure Patterns in RNAs

06/02/2005 - 16:30
06/02/2005 - 17:30

RNA is an essential element in every living organism where it plays crucial roles. Since RNA structure is strictly related to its function, it is important to find common RNA motifs in order to find functional similarities. We will present the method proposed by Backofen and Siebert to solve the exact common sequence/pattern problem, which was unsolved until 2004.

Advances in Logistic Regression

05/12/2005 - 16:30
05/12/2005 - 17:30

Logistic regression is one of the workhorses of statistical learning and it has been used and studied for several decades. In this talk I will focuss on two recent advances in logistic regression:

1) Logistic regression under sparseness priors (of the LASSO type) can not be approached using the standard algorithms, due to the non-differentiability of these priors. I will describe bound-optimization algorithms for sparse logistic regression (with parallel or sequential updates). Experimental results on real benchmark data show that sparse logistic regression outperforms both support vector machines and relevance vector machines. Performance bounds on this type of classifiers will be briefly mentioned.

2) In the second part of the talk, I will present a semi-supervised extension of logistic regression, which is able to use both labelled and unlabelled data. A Bayesian formulation and an EM algorithm allow for the automatic adjustment of the tradeoff between the contributions of the labelled and unlabelled data. Encouraging results are presented on benchmark data.