Data standardization is fundamentally prescriptive because no information system can solve the data integration issue without enforcing certain rules. The question is, therefore, where the rules should be prescribed. Most existing data standards prescribe the rules over the data itself. However, excessive use of such an approach can easily lead to inefficient data representation. An alternative approach enforces the conforming rules over the description of the data. Under such data standardization, data producers are free to choose the representation of their data but should describe the representation in a standard manner. By developing software libraries that can understand the data description, this descriptive approach will give maximal flexibility in data representation while still ensuring the data interoperability.
Actualmente, verifica-se um interesse crescente no desenvolvimento de métodos computacionais para a descoberta dos mecanismos de regulação dos genes. Neste âmbito, a identificação de motivos em genomas assume particular relevância. De igual modo, a computação segundo o paradigma de malha de computadores (Grid) tem emergido como a tecnologia de eleição em problemas que requerem computação paralela de alto desempenho. Neste relatório, abordase a aplicação de tecnologias Grid ao problema de identificação de motivos em regiões promotoras de genes. Começa-se por se fazer uma descrição do estado da arte relativo aos métodos computacionais de descoberta de motivos, analisando detalhadamente alguns algoritmos e a sua classificação. Em seguida, analisam-se as principais tecnologias Grid existentes e apresentam-se algumas referências a projectos que tiveram como objectivo aplicar tecnologias Grid a problemas biológicos semelhantes ao descrito.
The advent of genomics into malarial research is significantly accelerating the discovery of control strategies. Dynamical global gene expression measures of the intraerythrocytic developmental cycle (IDC) of the parasite at 1h-scale resolution were recently reported. Moreover, by using Discrete Fourier Transform based techniques, it was demonstrated that many genes are regulated in a single periodic manner which allowed to order genes according to the phase of expression. In this work we present a framework to construct genetic
networks from dynamical expression signals. The adopted model to represent these networks is the Probabilistic Genetic Network (PGN). This network is a Markov chain with some additional properties. This model mimics the properties of a gene as a non-linear stochastic gate and the systems are built by coupling of these gates. The PGN estimation is made through the mean conditional entropy minimization to discover subsets of genes which perform the best predictions of the target gene in the posterior time instant. Moreover, a tool that integrates mining of dynamical expression signals by PGN design techniques, different databases and biological knowledge, has been developed. The applicability of this tool for discovering gene networks of the malaria expression regulation system has been validated for simulated data and also for real microarray data using the glycolytic pathway as a gold-standard, as well as by creating an apicoplast as PGN network. Also, a negative control between these two modules was confirmed through construction of PGN networks using four genes from glycolysis
and four from apicoplast organele as seed genes. Together, this data demonstrates the value of the PGN model in generating biologically meaningful networks and which include genes not included by the Fourier approach. Currently, we are applying the same technique for three malarial strains (3D7, Dd2, HB3) in order to analyze similarities and differences among them and to discover whether or not these three data sets may be joint, which would improve the PGN estimation.
The combination of high-throughput methods of molecular biology with advanced mathematical and computational techniques has propelled the emergent field of systems biology into a position of prominence. Unthinkable only a decade ago, it has become possible to screen and analyze the expression of entire genomes, simultaneously assess large numbers of proteins and their prevalence, and characterize in detail the metabolic state of a cell population. While very important, the focus on comprehensive networks of biological components is only one side of systems biology. Complementing large-scale assessments, and sometimes at risk of being forgotten, are more subtle analyses that rationalize the design and functioning of biological modules in exquisite detail. This intricate side of systems biology aims at identifying the specific roles of processes and signals in smaller, fully regulated systems by computing what would happen if these signals were lacking or organized in a different fashion. We exemplify this type of approach with a detailed analysis of the regulation of glucose utilization in Lactococcus lactis. This organism is exposed to alternating periods of glucose availability and starvation. During starvation, it accumulates an intermediate of glycolysis, which allows it to take up glucose immediately upon availability. This notable accumulation poses a non-trivial control task that is solved with an unusual, yet ingeniously designed and timed feedforward activation system. The elucidation of this control system required high-precision in vivo data on the dynamics of intracellular metabolite pools, combined with methods of nonlinear systems analysis, and may serve as a paradigm for multidisciplinary approaches to fine-scaled systems biology.
The adaptation of living organisms to their environment is controlled at the molecular level by large and complex networks of genes, mRNAs, proteins, metabolites, and their mutual interactions. In order to understand the overall behavior of an organism, we must complement molecular biology with the dynamic analysis of cellular interaction networks, by constructing mathematical models derived from experimental data, and using simulation tools to predict the behavior of the system under a variety of conditions. Following this methodology, we have started the analysis of the network of global transcription regulators controlling the adaptation of the bacterium Escherichia coli to environmental stress conditions. Even though E. coli is one of the best studied organisms, it is currently little understood how a stress signal is sensed and propagated throughout the network of global regulators, so as to enable the cell to respond in an adequate way. Using a qualitative method that is able to overcome the current lack of quantitative data on kinetic parameters and molecular concentrations, we have modeled the carbon starvation response network and simulated the response of E. coli cells to carbon deprivation. This has allowed us to identify essential features of the transition between exponential and stationary phase and to make new predictions on the qualitative system behavior following a carbon upshift. The model predictions have been tested experimentally by means of gene reporter systems.
Genetic Programming (GP) is the automated learning of computer programs. Basically a search process, it is capable of solving complex problems by evolving populations of computer programs, using Darwinian evolution and Mendelian genetics as inspiration. GPLAB is a Genetic Programming toolbox for MATLAB. Besides most of the traditional functionalities used in GP, it also implements two additional features: (1) a method for automatically adapting the genetic operator probabilities in runtime, allowing the use of the toolbox as a test bench for new genetic operators; (2) several of the best state-of-the-art techniques for controlling the well known bloat problem, including some that result in automatic resizing of the population in runtime to save computational resources. Combining a highly modular and adaptable structure with the concern for automatic setting of most parameters, GPLAB suits all kinds of users, from the layman who wants to use it as a "black box", to the advanced researcher who intends to build and test new functionalities. The toolbox and its documentation are freely available for download at http://gplab.sourceforge.net. The latest version ensures minimal compatibility with Octave.
Typing methods are major tools for the epidemiological characterization of bacterial pathogens, allowing the determination of the clonal relationships between isolates based on their genotypic or phenotypic characteristics. Recent technological advances have resulted in a shift from classical phenotypic typing methods, such as serotyping, biotyping and antibiotic resistance typing, to molecular methods such as restriction fragment length polymorphisms (RFLP), pulsed-field gel electrophoresis (PFGE), and PCR serotyping . With the availability of affordable sequencing methods, another shift occurred towards sequence based typing methods such as multilocus sequence typing (MLST) and emm sequence typing. Sequence based methods have a large appeal since they provide unambiguous data and are intrinsically portable, allowing the creation of databases that, if publicly available through the internet, enable the comparison of local data with that of previous studies in different geographical locations. Ideally an analysis of each typing method, in terms of discriminatory power, reproducibility, typeability, feasibility, and other characteristics, should be performed to better determine which method is appropriate in a given setting. Several molecular epidemiology studies of clinically relevant microorganisms provide a characterization of isolates based on different typing methods. Frequently these studies focus on a comparison between the assigned types of different typing methods, from a qualitative point of view, i.e., indicating correspondences between the types of the different methods. Although this may be useful for the comparison of the genetic backgrounds of the particular set of isolates under study, it does not allow for a broader view of how the results of the different typing methods are related. In this seminar we present the recent work on a online database for a new sequence-based typing method for Staphylococcus aureus and an online tool that implements a framework of measures that allow the quantitative assessment of the congruence for different typing methods results.
A ferramenta BiGGEsTS - Biclustering Gene Expression Time-Series, tem como objectivo a integração de algoritmos de biclustering para análise de séries temporais de expressão genética. Estes algoritmos abordam o problema de biclustering em dados provenientes de séries temporais de expressão genética de forma directa, isto é, permitem identificar biclusters formados por um conjunto de genes com expressão coerente num subconjunto contíguo dos instantes temporais em análise. Os biclusters identificados poderão posteriormente ser visualizados e estudados usando a aplicação, e segundo várias dimensões de análise, de forma a identificar aqueles que são relevantes do ponto de vista biológico e podem depois ajudar na identificação de módulos regulatórios. Embora existam já ferramentas que integram algoritmos de biclustering aplicados a dados de expressão genética em geral, o desenvolvimento de uma ferramenta para o caso específico das séries temporais, dada a particularidade dos algoritmos e biclustering integrados e dos resultados obtidos é inovador. Neste seminário será apresentada a versão actual da ferramenta e discutidas direcções para trabalho futuro.
Ab Initio Protein Structure Prediction using Conformational Search and Information from Known Protein StructuresSubmitted by aml on Sun, 02/10/2008 - 13:00.
Most of the protein folding methods use information from known proteins to predict protein structure. For homology and fold recognition methods this information is used directly and good results can be obtained if a sufficient similar protein with known structure is found. However, if no such protein is available or for large unmatched regions, ab initio methods can be of great help (specially for small proteins). Our method uses a fragment library and a search technique to create possible structures from which a high scoring set can then be analysed. The search alternates between testing for possible fragments, and choosing stochastically one of the fragments using a score based on current and previous search information. Backtrack is performed if no fragments are available. When a structure is completed, a score is calculated using frequencies of contacts and buried state derived from known proteins. The score information is saved for use in the next structure searches and a new point in the search tree is stochastically chosen for constructing a new structure. The algorithm chooses points in previous constructed structures that had lower scores, trying to improve that structure.
This presentation will show the application of data mining techniques, in particular of machine learning, for discovery of knowledge in a protein database. The main problem we address is the determination whether an amino acid is exposed or buried in a protein for five exposition levels: 2%, 10%, 20%, 25% and 30%. First we introduce the baseline classifier for this problem which, although very simple (only takes into account the amino acid type), already achieves good prediction results. Then we explain how, by making a local PDB database, retrieving DSSP and SCOP data, we build our classifier to improve the baseline prediction. Finally we test and compare several classifiers (Neural Networks, C5.0, CART and Chaid), and parameters that might influence the prediction accuracy. Namely the level of information per amino acid, the SCOP class of the protein and the neighbourhood of the current amino acid (i.e.: the sliding window size).