ALSE - Motif Finding Tool

Our approach

Classic motif finding algorithms only consider sequences that are known (or believed) to contain the common patterns (call it the set T). This may lead to error in finding patterns that appear repeatedly in all the sequences, therefore leading to poor or inaccurate results. Our approach overcomes this problem by introducing an extra input to the problem -- the background sequences (call it the set set F). Our algorithm search for patterns that frequently appears in the set T but rarely appear in the set F.
Sometimes, the pattern that we are searching for may appear more than once in a DNA sequence. Our algorithm consider that the sequences with repeated patterns.

Problem Definition

Given two sets of DNA sequences T and F, where T contain sequences that are believed to contain the motifs and F as a set of background sequences. Find those short patterns (motif) occur frequently in set T but rarely occur in set F.

To be precise, given T and F as defined previously, we want to find a probability matrix M (presenting a motif pattern) and a threshold α, such that the p-value is minimized.

The EM-like Approach

ALSE finds out those probability matrices and thresholds which result in the lowest p-value by applying an EM-like iterations:

Find a set of patterns (represented as a probability matrix) with the lowest p-values, known as the set Seeds (say 100).
For each M₀ (with corresponding threshold α₀) in the set Seeds, refine it to a matrix M₁ (with α₁) such that M₁ yieids a lower probabilty p-value(M₁,α₁)
Continue Step 2 until there is no improvement in the p-value, or the maximum number of iterations is reached.
Output the motifs with smallest p-values (say 50).

ALSE Algorithm - An EM-like approach

Finding Common Patterns

Motif Finding Problem

Our Algorithm

Our approach

Problem Definition

The EM-like Approach