The Linguistic
Complexity Program
In recent years much work has been invested into developing computer algorithms to facilitate researches in the field of molecular biology.
A large focus has been on string algorithms, since the data for molecular genetics is naturally represented as sequences of characters.
Multiple repeat occur frequently in both DNA and protein sequences. Some multiple repeats have been associated with human genetic diseases.
Another important application for finding multiple repeats in biological sequences
is related to the multiple sequence alignment.
Producing multiple alignments becomes very complicated when the sequences to be aligned contain multiple repeat, because matches may be
present in numerous places. As a precursor to multiple alignment, it is helpful to recognize all multiple repeat
within the set of strings that must be aligned.
The program Matcher was developed to analyze whether a protein
(or DNA sequence) contains a periodically repeated pattern of amino acids (or nucleotides).
It has been adapted to be used to determine whether a given protein contains the
7 residue periodicity
associated with coiled-coil proteins.
The program can be used to give the periodic alignment of a given protein,
or be used to analyze an entire file of amino acid sequences. In the
latter case, the program will first rank the sequences with respect to
their potential alignment to , without giving
the alignment.
Let 'S' be a finite string over a finite alphabet 'Σ', (the most typical case is a DNA sequence, where the alphabet is {A, C, G, T}). We define the linguistic complexity of 'S' as ratio between the number of substrings (of lengths 1 to |S|) that actually appear in 'S' and the number of all possible substrings (of length 1 to |S|) over 'Σ'. I.e., as the number of unique substrings in 'S' rises (which also means that the number of substring repetitions in 'S' diminishes) the complexity of 'S' rises.
The linguistic complexity of an entire string (DNA sequence) is usually meaningless. More interesting is the linguistic complexity of certain areas of a sequence in comparison with others.
The program uses Larsson's (see [2]) sliding window procedures and calculates the linguistic complexity of each window. The size of the window is determined by the user and so is the alphabet size, thus, the program isn't bound to analyzing DNA sequences alone. The program's running time is linear in the length of the input sequence (independant of window size).
Motivation (taken from [1]): Genomic DNA contains a variety of repeats: high-copy-number short tandem repeats, low-copy-number tandem repeats and interspersed repetitive DNA sequences. These repeats often represent important control regions, such as upstream promoter sequences, and are often non-consecutive and sometimes not exact. The above program detects practically all such sequences.
[2] N. Jesper Larsson (doctoral dissertation)
- Structures of String Matching and Data Compression, (Department of Computer
Science, Lund University). home page: larsson.dogma.net
The program (which uses larsson's (see [2])
sliding window scheme) was written by Yair Koren (yairk99@netvision.net.il).
The user interface of the linguistic complexity program was
created by Yossi Atas and Michael
Gorin.
back to top