INESC-ID   Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed


Knowledge Discovery and Bioinformatics
Inesc-ID Lisboa

A Data Mining Approach for the detection of High-Risk Breast Cancer Groups

04/09/2010 - 14:00
04/09/2010 - 15:00

It is widely agreed that complex diseases are typically caused by the joint effects of multiple instead of a single genetic variation. These genetic variations may show very little effect individually but strong effect if they occur jointly, a phenomenon known as epistasis or multilocus interaction. In this seminar, we explore the applicability of decision trees to this problem. A case-control study was performed, composed of 164 controls and 94 cases with 32 SNPs available from the BRCA1, BRCA2 and TP53 genes. There was also information about tobacco and alcohol consumption. We used a Decision Tree to find a group with high susceptibility of suffering from breast cancer. Our goal was to find one or more leaves with a high percentage of cases and small percentage of controls. To statistically validate the association found, permutation tests were used. We found a high-risk breast cancer group composed of 13 cases and only 1 control, with a Fisher Exact Test value of 9:7 * 10^-6. After running 10000 permutation tests we obtained a p-value of 0.017. These results show that it is possible to find statistically significant associations with breast cancer by deriving a decision tree and selecting the best leaf.