Our approach
-
Classic motif finding algorithms only consider sequences that are known (or believed) to contain the common patterns (call it the set T). This may lead to error in finding patterns that appear repeatedly in all the sequences, therefore leading to poor or inaccurate results. Our approach overcomes this problem by introducing an extra input to the problem -- the background sequences (call it the set set F). Our algorithm search for patterns that frequently appears in the set T but rarely appear in the set F.
- Sometimes, the pattern that we are searching for may appear more than once in a DNA sequence. Our algorithm consider that the sequences with repeated patterns.
Problem
Definition
Given two sets of DNA sequences T and F, where T contain sequences that
are believed to contain the motifs and F as a set of background
sequences. Find those short patterns (motif)
occur frequently in set T but
rarely occur in set F.
To
be precise, given T and F as defined previously,
we want to find a probability matrix M (presenting a motif
pattern) and a threshold α, such that the p-value
is minimized.
The EM-like Approach
ALSE finds
out those probability matrices and thresholds
which result in the lowest p-value
by applying an EM-like iterations:
- Find a set of patterns (represented as
a probability matrix) with the lowest p-values,
known as the set Seeds (say
100).
- For each M0 (with
corresponding threshold α0) in
the set Seeds,
refine it to a matrix M1 (with α1)
such that M1 yieids
a lower probabilty p-value(M1,α1)
- Continue Step 2 until there is no improvement
in the p-value, or the maximum number of iterations
is reached.
- Output the motifs with smallest p-values
(say 50).
|