A Library for Computational Biology Programs

    Contents

      A Program for Approximate Tandem Repeat 

      The Matcher

      The Linguistic Complexity Program


    A Program for Approximate Tandem Repeat

    Introduction

    In recent years much work has been invested into developing computer algorithms to facilitate researches in the field of molecular biology. A large focus has been on string algorithms, since the data for molecular genetics is naturally represented as sequences of characters.
    Multiple repeat occur frequently in both DNA and protein sequences. Some multiple repeats have been associated with human genetic diseases.
    Another important application for finding multiple repeats in biological sequences is related to the multiple sequence alignment. Producing multiple alignments becomes very complicated when the sequences to be aligned contain multiple repeat, because matches may be present in numerous places. As a precursor to multiple alignment, it is helpful to recognize all multiple repeat within the set of strings that must be aligned.

    The Program For Approximate Tandem Repeat finds all similar substrings within a sequence S with up to k errors.

    References

      Gad M.Landau ,Jeanette P.Schmidt & Dina Sokol,An Algorithm for Approximate Tandem Repeat.

      Related papers.



    back to top

    The Matcher

    Introduction

    The program Matcher was developed to analyze whether a protein (or DNA sequence) contains a periodically repeated pattern of amino acids (or nucleotides). It has been adapted to be used to determine whether a given protein contains the 7 residue periodicity (a,b,c,d,e,f,g)n associated with coiled-coil proteins.
    The program can be used to give the periodic alignment of a given protein, or be used to analyze an entire file of amino acid sequences. In the latter case, the program will first rank the sequences with respect to their potential alignment to (a,b,c,d,e,f,g)n , without giving the alignment.


    References

      V.A. Fischetti,Gad M.Landau ,Jeanette P.Schmidt & P. Sellers, Identifying periodic occurrences of a template with applications to protein structure, Information Processing Letters 45, pp. 11-18, 1993.

      V.A. Fischetti, V. Pancholi, P. Sellers, J. Schmidt,Gad M.Landau, X. Xu & O. Schneewind, Streptococcal M protein: A common structural motif used by Gram-positive bacteria for biological active surface molecules, Molecular Recognition in Host-Parasite Interactions: Mechanisms in viral, bacterial and parasite infections, Published by Plenum Publishing, in print.


    back to top

    Linguistic Complexity

    Introduction

    Let 'S' be a finite string over a finite alphabet 'Σ', (the most typical case is a DNA sequence, where the alphabet is {A, C, G, T}). We define the linguistic complexity of 'S' as ratio between the number of substrings (of lengths 1 to |S|) that actually appear in 'S' and the number of all possible substrings (of length 1 to |S|) over 'Σ'. I.e., as the number of unique substrings in 'S' rises (which also means that the number of substring repetitions in 'S' diminishes) the complexity of 'S' rises. 

    The linguistic complexity of an entire string (DNA sequence) is usually meaningless. More interesting is the linguistic complexity of certain areas of a sequence in comparison with others. 

    The program uses Larsson's (see [2]) sliding window procedures and calculates the linguistic complexity of each window. The size of the window is determined by the user and so is the alphabet size, thus, the program isn't bound to analyzing DNA sequences alone. The program's running time is linear in the length of the input sequence (independant of window size). 

    Motivation (taken from [1]): Genomic DNA contains a variety of repeats: high-copy-number short tandem repeats, low-copy-number tandem repeats and interspersed repetitive DNA sequences. These repeats often represent important control regions, such as upstream promoter sequences, and are often non-consecutive and sometimes not exact. The above program detects practically all such sequences.

     

    References

    [1] Olga G. Troyanskaya, Gad M.Landau, Alexander Bolshoy,Ora Arbell, Yair Koren, Sequence Complexity Profiles of Prokaryotic Genomic Sequences: A fast Algorithm for Calculating Linguistic Complexity.

    [2] N. Jesper Larsson (doctoral dissertation) - Structures of String Matching and Data Compression, (Department of Computer Science, Lund University). home page: larsson.dogma.net

      The program (which uses larsson's (see [2]) sliding window scheme) was written by Yair Koren (yairk99@netvision.net.il).

      The user interface of  the linguistic complexity program was created by Yossi Atas and Michael Gorin.


    back to top



      created by
    Mika Shoshani