INESC-ID   Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed


Knowledge Discovery and Bioinformatics
Inesc-ID Lisboa


O M(in)istério da Educação: ou o Problema da Colocação dos Docentes 2004/2005

04/28/2005 - 16:30
04/28/2005 - 17:30

Todos nós, portugueses, seguimos com atenção o concurso de colocação dos docentes dos ensinos pré-escolar, básico e secundário, ano lectivo de 2004/2005. Pelo que na altura veio a público, ficou claro que no âmago da questão encontrava-se um problema de natureza técnica, matemática, que não fora devidamente valorizado pelo Ministério da Educação. Nesta palestra, pretendemos identificar as questões de índole matemática que se levantam num concurso para colocação de docentes. Em primeiro lugar, iremos perceber como é que da legislação vigente se passa para uma especificação formal. Identificaremos duas especificações possíveis, ambas conformes com a lei, chamando óptima localmente verificável a uma e lexicograficamente óptima à outra. Estas duas especificações têm propriedades diferentes e resultam em colocações diferentes. Em segundo lugar, indicaremos algoritmos para cumprir cada uma das duas especificações. E, em terceiro lugar, enquadraremos o problema da colocação de docentes no contexto dos emparelhamentos estáveis e de peso mínimo, tão queridos dos matemáticos.

Sistema de gestão da informação dos mecanismos de regulação genómica do organismo Saccharomyces cerevisiae

04/21/2005 - 16:30
04/21/2005 - 17:30

Após a sequenciação dos genomas de diversos organismos, passou-se à fase de anotação dos genes, tendo muita desta informação ficado disponível para ser processada e transformada em conhecimento. Neste contexto, é especialmente importante o estudo dos mecanismos de interacções entre genes, ou seja, o estudo das redes de regulação genética. Este trabalho tem como objectivo desenvolver uma plataforma que suporte o estudo destas redes, esperando-se que venha a constituir um repositório integrado de toda a informação relevante para os fenómenos de regulação genómica no organismo Saccharomyces cerevisiae. A plataforma inclui uma base de dados de suporte à informação já adquirida, interfaces para utilizadores finais e para administradores e curadores da informação. Inclui também conectores para implementações de algoritmos de análise e suporte à descoberta de conhecimento.

Indexing email for Fast Search and Filtering

03/31/2005 - 16:30
03/31/2005 - 17:30

The purpose of this project was to provide email indexing for kmail, an open-source email client. As underlying indexing structure, inverted files were used. The project is broadly divided into a core library and its use by kmail. This talk will focus mostly on the core library, its design and implementation. The kmail integration will be presented as a demonstration.

Finding Common motifs in DNA sequences: a survey

03/10/2005 - 16:30
03/10/2005 - 17:30

In recent years, especially after the completion of genome sequencing projects for various organisms, there has been a growing interest in the study of regulation and gene expression mechanisms. The amount of data available makes it unfeasible to pursue a manual analysis calling for some sort of automatic processing. In this context, bioinformatics tools have become more and more central to the activity of biologists. Despite the remarkable success of these tools in some areas of application like gene finding, sequence alignment, etc, there are still problems for which no significant results have been achieved. Notably, the identification of biologically meaningful motifs in cis-regulatory regions remains an open problem. However, many approaches have been proposed and one can find a panoply of published papers describing novel algorithms to address the problem. These algorithms can be roughly classified in two main groups: combinatorial and statistical. In this survey we will concentrate on a specific sub-group of combinatorial algorithms: algorithms based on graph theory. This interest is chiefly motivated by the fact that some recent proposals using this approach have claimed to gain efficiency by avoiding unnecessary explorations of the search space. Furthermore, we will compare their efficiency and operation with other combinatorial algorithms which use other data structures.

A Linear Time Biclustering Algorithm for Time Series Genomic Expression Data

02/24/2005 - 16:30
02/24/2005 - 17:30

Recent developments in DNA chips now enable the simultaneous measure of the expression level of a large number of genes (sometimes all the genes of an organism) for a given experimental condition. Most commonly, gene expression data is arranged in a data matrix, where each gene corresponds to one row and each condition to one column. The conditions may correspond to different time points, different environmental conditions, different organs or different individuals. Simply visualizing this kind of data is challenging. Using it to extract biologically relevant knowledge is even harder.

Several non-supervised machine learning methods have been used in the analysis of gene expression data obtained from microarray experiments. Recently, biclustering, a non-supervised approach that performs simultaneous clustering on the row and column dimensions of the data matrix, has been shown to be remarkably effective in a variety of applications. The goal of biclustering is to find subgroups of genes and subgroups of conditions, where the genes exhibit highly correlated behaviors. In the most common settings, biclustering is an NP-complete problem, and heuristic approaches are used to obtain sub-optimal solutions using reasonable computational resources.

In this talk, we describe a particular setting of the problem, where we are concerned with finding biclusters in time series expression data, and present a linear time biclustering algorithm to achieve this goal.

When analyzing time series expression data, with the goal of isolating coherent activity between genes in a subset of conditions, it is reasonable to restrict the attention to biclusters with contiguous columns. We support this view by assuming that the activation of a set of genes under specific conditions corresponds to the activation of a particular biological process. As time goes on, biological processes start and finish, leading to increased (or decreased) activity of sets of genes that can be identified because they form biclusters with contiguous columns. In this setting, we are interested in finding biclusters where the columns are consecutive in time. For this particular version of the problem, we propose an algorithm that finds and reports all relevant biclusters in time linear on the size of the data matrix. This impressive reduction in complexity is obtained by manipulating a discretized version of the data matrix and by using advanced string manipulation techniques based on suffix trees.

The talk will give a short introduction to biclustering and suffix trees, present the biclustering algorithm and show results in synthetic data and preliminary results on a real biological data set from Yeast, that show the effectiveness of the approach.

An Efficient Algorithm for Generating Super Condensed Neighborhoods

02/10/2005 - 16:30
02/10/2005 - 17:30

Approximate string matching is the problem of finding a pattern in a text allowing for some errors to occur.
Indexing methods for the approximate string matching problem spend a considerable effort generating condensed neighborhoods. In this session we point out that condensed neighborhoods are not a minimal representation of a pattern neighborhood. We show that we can restrict our attention to super condensed neighborhoods which are minimal. We then present a basic algorithm for generating Super Condensed Neighborhoods. The algorithm runs in O(m s), where m is the pattern size and s is the size of the super condensed neighborhood. Previous algorithms took O(m c) time, where c is the size of the condensed neighborhood. We present an implementation of the algorithm using modern Bit-Parallelism and Increased Bit-Parallelism techniques.

2004 - A good year for grammatical inference

01/10/2005 - 15:00
01/10/2005 - 16:00

This talk will address recent developments in grammatical inference, with emphasis on the ADIOS approach (Automatic DIstillation Of Structure), a system for the inference of natural language grammars that has been shown to perform well in language proficiency tests. The methods achieves this remarkable result by using better statistical techniques for distinguishing constituents from non-constituents.

Pieter Adriaans, a reference name in machine learning and computer science, is the author of numerous articles and a number of books on topics related to both computer science and philosophy, which include books on machine learning, systems analysis, and, more recently, on client/server and distributed databases as well as data mining. He has been involved in numerous projects, of which the most famous may be the RobotSail project, that had as one result the Syllogic Sailing Lab., a sailing boat that learned how to sail from experienced skippers and participated in the Singlehanded TransAtlantic Race 2000. He is currently a professor at the University of Amsterdam and senior research advisor of Perot Systems Corporation.

Learning and Recommending Shortcuts in Semantic Peer-to-Peer Networks

12/09/2004 - 15:00
12/09/2004 - 17:00

A major problem within peer-to-peer systems is to find the best peer given a certain query. Inspired by the work in the area of social networks we present a novel peer-to-peer system called INGA (Interest-based Node Grouping Architecture). Peers cooperate to efficiently route queries along adaptive shortcuts based overlays using only local knowledge. We propose active and passive shortcut creation strategies and a new routing algorithm that combines a greedy, high degree and flooding based search depending on one's knowledge. We quantify the benefit of the overlay network by comparing the performance of INGA in the SWAP simulation infrastructure against a simple Gnutella style network, against recently proposed shortcut networks without similarity metrics and against a relaxation based approach. While obtaining the same recall we show in our experiments that with INGA we half the messages for a query.

Construção semi-automática de uma ontologia sobre a estrutura de artigos técnicos

11/18/2004 - 16:00
11/18/2004 - 17:00

Apesar do sucesso da pesquisa de informação na "Internet" baseada em palavras chave, uma parte significativa das tarefas de pesquisa requer informação semântica sobre os dados. Propõe-se uma abordagem que combina uma pequena ontologia, construída manualmente, com um método de associação automática dos conceitos a partes de artigos técnicos disponíveis na "Web". Esta abordagem, combinada com uma linguagem de procura especializada, permite realizar pesquisas avançadas que não são possíveis usando as ferramentas actuais. São apresentados resultados preliminares sobre a precisão da associação dos conceitos aos dados, e sobre a correcção dos resultados das pesquisas.

Utilização da Estrutura de Ligações da Web em Problemas de Recuperação de Informação

11/04/2004 - 16:00
11/04/2004 - 17:00

Entre as muitas novas técnicas de Recuperação de Informação (RI) criadas no contexto da Web, análise de ligações é uma que tem atraído grande atenção. Neste trabalho, estudamos como ligações entre páginas Web podem ser aplicadas na resolução de dois problemas distintos: (a) ordenação de respostas a uma consulta e (b) classificação de documentos da Web. Para isso, modelos formais baseados em redes Bayesianas são propostos e validados através de testes executados numa colecção extraída da Web brasileira. Os resultados mostram que, efectivamente, ligações entre páginas Web são uma fonte de evidência importante, tanto para ordenar como para classificar documentos. Em ambos os casos, combinação de informação de ligações entre páginas Web com informação sobre o conteúdo das páginas produz resultados melhores do que aqueles obtidos com o uso de cada fonte de evidência isoladamente.