Journal of Software, Vol 6, No 3 (2011), 386-394, Mar 2011
doi:10.4304/jsw.6.3.386-394

A New Model for Finding Approximate Tandem Repeats in DNA Sequences

Qingshan Jiang, Sheng Li, Shun Guo, Dan Wei

Abstract


In gene analysis, finding approximate tandem repeats in DNA sequence is an important issue. SUA_SATR is one of the latest methods for finding those repetitions, which suffers deficiencies of runtime cost and poor result quality. In order to detect approximate tandem repeats in genomic sequences more efficiently, we propose a new model based on a novel algorithm MSATR and an optimized algorithm mMSATR in this paper. The model uses the Motif-Divide method to improve the performance, which results in the proposal of algorithm MSATR. By introducing the definition of CASM to reduce the searching scope and optimizing the original mechanism adopted by MSATR, the mMSATR algorithm makes the detecting process more efficient and improves the result quality. The theoretical analysis and experiment results indicate that MSATR and mMSATR is able to get more results within less runtime. These algorithms are superior to other methods in finding results, and it greatly reduces the runtime cost, which is of benefit when the gene data becomes larger.


Keywords


DNA sequence mining; approximate tandem repeat; motif-similarity

References


[1] NM. Luscombe, D. Greenbaum, M. Gerste. What is bioinformatics? A proposed definition and overview of the field. Methods Information in Medicine, 2001, 40(4):346−358.
PMid:11552348

[2] Guo Shun, Guan Heshan, Jiang Qingshan. A novel algorithm for finding approximate tandem repeats in DNA sequences. Journal of Computer Research and Development. 2008, 45(Suppl.):175-179 (in Chinese).

[3] Zhu Yangyong, Xiong Yun. DNA Sequence Data Mining Technique. Journal of Software, 2007, 18(11): 2766-2781(in Chinese).
doi:10.1360/jos182766

[4] Y. Li, A. Korol, T. Fahima, A. Beiles, and E. Nevo. Microsatellites: genomic distribution, putative functions and mutational mechanisms. Molecular Ecology, 2002, 11(12): 2453 -2465.
doi:10.1046/j.1365-294X.2002.01643.x
PMid:12453231

[5] Y. Kashi, D. King, and M. Soller. Simple sequence repeats as a source of quantitative genetic variation. Trends in Genetics, 1997, 13(2): 74–78.
doi:10.1016/S0168-9525(97)01008-1

[6] S.Beleza, C.Alves,A. Gonzalez-Neira. Extending STR markers in Y chromosome haplotypes. International Journal of Legal Medicine, 2003, 117(1):27-33.
PMid:12592592

[7] S.Gilmore, R. Peakall, J.Robertson, Short tandem repeat (STR) DNA markers are hypervariable and informative in Cannabis sativa: implications for forensic investigations. Forensic science international, 2003, 131(1): 65-74.
doi:10.1016/S0379-0738(02)00397-3

[8] A. Apostolico and F. Prefarata. Optimal off-line detection of repetitions in a string. Theoretical Computer Science, 1983, 22(3):297–315.
doi:10.1016/0304-3975(83)90109-3

[9] Kolpakov R, Kucherov G. Finding maximal repetitions in a word in linear time. In: Proc. of the 1999 Symposium On Foundations of Computer Science. Washington: IEEE Computer Society, 1999. 596−604.

[10] M. Crochemore. An optimal algorithm for computing the repetitions in a word. Information Processing Letters, 1981, 12(5):244–250.
doi:10.1016/0020-0190(81)90024-7

[11] M. Main and R. Lorentz. An O(nlogn) algorithm for finding all repetitions in a string. Journal of Algorithms, 1984, 5(3):422–432.
doi:10.1016/0196-6774(84)90021-X

[12] Delgrange O, Rivals E. STAR: An algorithm to search for tandem approximate repeats. Bioinformatics, 2004, 20(16):2812−2820.
doi:10.1093/bioinformatics/bth335
PMid:15180940

[13] G. Benson. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acid Research, 1999, 27(2):573–580.
doi:10.1093/nar/27.2.573
PMid:9862982    PMCid:148217

[14] S. Kurtz, JV. Choudhuri, E. Ohleb usch, C.Schleiermacher, J. Stoye, R. Giegerich. REPuter: The manifold applications of repeat analysis on a genomic scale. Nucleic Acid Research, 2001, 29(22): 4633–4642.
doi:10.1093/nar/29.22.4633
PMid:11713313    PMCid:92531

[15] Wang D, Wang G, Wu QQ, Chen BC. Finding LPRs in DNA sequence based on a new index SUA [C]. In: Proc. of the IEEE 5th Symp. On Bioinformatics and Bioenginerering (BIBE 2005). Washington: IEEE Computer Science, 2005. 281−284.

[16] Li Sheng, Jiang Qingshan, Wei Dan. An optimized algorithm for finding approximate tandem repeats in DNA sequences. Second International Workshop on Education Technology and Computer Science. Wuhan, 2010. 68-71.

[17] Wang D, Zhao Y, Chen BC, Wang GR. SUA-Based algorithm for finding SATRs in DNA sequence. Journal of Northeastern University (Natural Science), 2007, 28(2):209−212 (in Chinese).


Full Text: PDF


Journal of Software (JSW, ISSN 1796-217X)

Copyright @ 2006-2012 by ACADEMY PUBLISHER – All rights reserved.