INESC-ID   Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed


Knowledge Discovery and Bioinformatics
Inesc-ID Lisboa


Functional organization of chromosomes in the mammalian cell nucleus

11/27/2006 - 14:00
11/27/2006 - 15:00

Chromosomes are not randomly folded in a spaguetti-like state in the mammalian cell nucleus, as initially thought, but occupy distinct territories. Recent studies show that these chromosome territories have preferential arrangements in different cell types, which correlate with the kinds of chromosome rearrangements that occur preferentially in each cell type. Evidence for a growing number of long-range interactions between DNA segments in the same or different chromosomes has raised the possibility of a three-dimensional network of genome interactions. As the long-range interactions described so far correlate with gene activity states, they are likely to influence and be influenced by the transcriptome of each cell type. We propose that this interchromosomal network of interactions contains epigenetic information and determines cell-type specific chromosome conformations and re-arrangements.

Network Inference From Co-occurrences

11/16/2006 - 14:00
11/16/2006 - 15:00

We consider the problem of inferring the structure of a network from co-occurrence data; observations that indicate which nodes occur in a signaling pathway but do not directly reveal node order within the pathway. This problem is motivated by network inference problems arising in computational biology and communication systems, in which it is difficult or impossible to obtain precise time ordering information. Without order information, every permutation of the activated nodes leads to a different feasible solution, resulting in combinatorial explosion of the feasible set. However, physical principles underlying most networked systems suggest that not all feasible solutions are equally likely. Intuitively, nodes which co-occur more frequently are probably more closely connected. Building on this intuition, we model path co-activations as randomly shuffled samples of a random walk on the network. We derive a computationally efficient network inference algorithm and, via novel concentration inequalities for importance sampling estimators, prove that a polynomial complexity Monte Carlo version of the algorithm converges with high probability.

Knowledge Discovery in Genomics and BioIntelligence Research

10/30/2006 - 11:00
10/30/2006 - 12:00

Knowledge discovery is the process of developing strategies to discover useful and ideally all previously unknown knowledge from historical or real-time data. Applied to high throughput genomics applications, knowledge discovery processes will help in various research and development activities, such as (i) studying data quality for possible anomalous or questionable expressions of certain genes or experiments, (ii) identifying relationships between genes and their functions based on time-series or other high throughput genomics profiles, (iii) investigating gene responses to treatments under various conditions such as in-vitro or in-vivo studies, and (iv) discovering models for clinical diagnosis/classifications based on expression profiles among two or more classes.
This presentation consists of three parts. In part one, we provide an overview of knowledge discovery in genomics and the BioMine project. In part two of this talk we describe some of our case studies using the BioMiner data mining software that we have built in this project. These are all cases in which real genomics data sets (obtained from public or private sources) have been used for tasks such as gene function identification and gene response analysis. We will describe a few examples explaining complexities and challenges in dealing with real data. In the last part of this talk, we share our experiences gained over the last 6 years and describe our current activities and future plans in BioIntelligence research direction.

Dynamic Entropy-Compressed Sequences and Applications

10/09/2006 - 16:00
10/09/2006 - 17:00

Data structures are called succinct when they take little space (meaning usually of lower order) compared to the data they give access to. A more ambitious challenge is that of compressed data structures, which aim at operating within space proportional to that of the compressed data they give access to. Designing compressed data structures goes beyond compression in the sense that the data must be manageable in compressed form without first decompressing it. This is a trend that has gained much attention in recent years. In this talk we will introduce a simple data structure for managing bit sequences, so that the space required is essentially that of the zero-order entropy of the sequence, and the operations of inserting/deleting bits, accessing a bit position, and computing rank/select over the sequence, can all be done in logarithmic time. Rank operation gives the number of 1 (or 0) bits up to a given position, whereas select gives the position of the j-th 1 (or 0) bit in the sequence. This basic result has a surprising number of consequences. We show how it permits obtaining novel solutions to the dynamic partial sums with indels problem, dynamic wavelet trees, and dynamic compressed full-text indexes.

Formats and services for data and algorithm interoperation in Bioinformatics

09/25/2006 - 14:00
09/25/2006 - 15:00

Data integration in life sciences is, presently, at a conundrum. On the one hand the diversity of data is increasing as explosively as its volume but on the other hand the value of individual data sets can only be appreciated when enough of those distinct pieces of the systemic puzzle are put together. Consequently, it is just as imperative to have agreeable standard formats as it is that they are not enforced so strictly as to be an obstacle to reporting the very novel data that brings value to systemic integration. In this presentation the emerging use of semantic web technologies is highlighted as regards its practical implications for experimental biology and translational biomedical research. The new integrative technologies create tremendous opportunities for a wider participation by both individual and national initiatives into large scale international research efforts. They also create the challenge of locally developing fluid multidisciplinary capabilities which are still not the norm in the life sciences. A prototypic integrative infrastructure will be demonstrated to illustrate the obstacles and potential of ontology driven data processing that can be freely downloaded open source from

Modelos simples com tempo discreto de circuitos de regulação genética

04/28/2006 - 14:00
04/28/2006 - 15:00

Descreve-se a modulação de redes de regulação genética através de sistemas dinâmicos seccionalmente afins com tempo discreto. Apresentam-se os resultados da análise desta modelação ao caso de circuitos simples, nomeadamente os circuitos positivos e negativos com um e dois genes.

Hierarchical linear subspace indexing method

03/10/2006 - 14:00
03/10/2006 - 15:00

Traditional multimedia indexing methods are based on the principle of hierarchical clustering of the data space, in which metric properties are used to build a tree that then can be used to prune branches while processing the queries. However, the performance of these methods will deteriorate rapidly when the dimensionality of the data space is increased.

Based on the generic multimedia indexing (GEMINI) approach and lower bounding methods a hierarchical linear subspace indexing method will be described, which does not suffer from the dimensionality problem. The hierarchical subspace approach offers a fast searching method for large content-based multimedia databases.

The approach will be demonstrated on image indexing, in which the subspaces correspond to different resolutions of the images. During content-based image retrieval the search starts in the subspace with the lowest resolution of the images. In this subspace the set off all possible similar images is determined. In the next subspace additional information corresponding to a higher resolution is used to reduce this set. This procedure is repeated until the similar images can be determined eliminating the false candidates.

The developed methods of analysis can be generalized for all means of content-based access methods that are based on information loss techniques, like for example hierarchical clustering which relies on stepwise digitalization of the space rather then the reduction of its dimension.

Individual based modelling of multispecies biofilms- algorithms and applications

02/17/2006 - 15:30

Bacterial activity in nature is predominantly associated with surface-bound microbial communities that form complex and heterogeneous assemblages consisting of single or multiplespecies biofilms. These aggregates are involved in several human activities, ranging from the detrimental effects of unwanted biofilms in human health and industry to beneficial uses in environmental treatment processes. In the present day, sophisticated 2D/3D biofilm models include first-principle based descriptions of several processes involved in biofilm formation: attachment of cells to a surface, growth, diffusion and reaction of solutes throughout the biofilm matrix, production of extracellular polymeric substances (EPS), biomass detachment resultant from shear stress, etc. The latest generation of multidimensional models allows the description of multispecies biofilms, at the same time including any number of solute species (e.g. carbon source, dissolved oxygen, soluble metabolites, etc.) reacting. In this talk, the newest advances on biofilm modelling carried out at Technical University of Delft will be presented, with emphasis on the individual based modelling of multispecies biofilms. Case studies will be shown, illustrating how modelling provides surprising insight into many aspects of biofilms, including:
-The relationship between biofilm structure and its activity
-The occurrence of sloughing, i.e. losses of large amount of biomass from a biofilm
-Reproducibility of the biofilm structure
-Competition and cooperation between microbial species in multispecies biofilms
-The role of the EPS matrix and of internal storage compounds

Understanding Bio-Complexity with Signal Processing

02/10/2006 - 15:00
02/10/2006 - 16:00

Using signal processing, we wish to gain knowledge about biological complexity, as well as using this knowledge to engineer better technology. Three areas are identified as critical to understanding bio-complexity: 1) understanding DNA, 2) understanding protein pathways, and 3) evaluating overall biological function subject to external conditions. First, DNA is investigated for coding structure and redundancy, and a new tandem repeat region, an indicator of a neurodegenerative disease, is discovered. Second, the way a single-cell mobilizes in response to a chemical gradient, known as chemotaxis, is examined. Inspiration from chemotaxis receptor clustering is shown to improve sensor array performance of a gradient-source (chemical/thermal) localization algorithm. Implementation of the array is evaluated in diffusive and turbulent environments. We also show how to improve sensor array localization in turbulence by using the cross-correlation method. The work illustrates how signal processing is a tool to reverse engineer complex biological systems, and how our better understanding of biology can improve sensor network localization.

Using a More Powerful Teacher to Reduce the Number of Queries of the L* Algorithm in Practical Applications

02/01/2006 - 16:30
02/01/2006 - 17:30

We propose to use a more powerful teacher to effectively apply query learning algorithms for regular languages in practical, real-world problems. More specifically, we define a more powerful set of replies to the membership queries posed by the L* algorithm that reduces the number of such queries by several orders of magnitude. The basic idea is to avoid the needless repetition of membership queries in cases where the reply will be negative as long as a particular condition is met by the string in the membership query. We present an example of the application of this method to a real problem, that of inferring a grammar for the structure of technical articles.