INESC-ID   Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed


Knowledge Discovery and Bioinformatics
Inesc-ID Lisboa

Dotted Suffix Trees: A Structure for Approximate Text Indexing

11/18/2005 - 14:30
11/18/2005 - 15:30

The problem we address is text indexing for approximate matching. We consider that we are given a text T which undergoes some preprocessing to generate an index. We can later query this index to identify the places where a string occurs up to a certain number of allowed substitutions k. We present a structure for indexing which occupies space O(n log^k n) in the average case, independent of alphabet size, n being the text size. This structure support searching in O(m^{k+1}) time, for patterns of size m, again independent of alphabet size. The construction of the structure has time bound by O(kN|S|), where N is the number of nodes in the index and |S| the alphabet size.