Expectation-Maximization for Motif Finding

C# Implementation December 2007 by Richard H. Simon, M.D.

The program has 2 phases:

Phase 1 - Set up the test model

Overview: You will set up a test model of several (up to 20) DNA sequences (random DNA), each 80 bases long. In each sequence, somewhere, you will embed a motif. The motif will be 4 to 12 bases long. You will start by specifying the base sequence of your canonical motif. You will then determine how many of these motifs you wish to test the algorithm against (up to 20). Only one motif per 80-base DNA sequence is allowed, with NO GAPS, since we would need a Cray or other supercomputer otherwise.

Using a grid with drop-downs, you can then alter any of the bases in each motif to make it vary from the canonical motif. The short sequence of bases, representing either the canonical motif or some variant that you have created, together with its starting location within the random DNA sequence, is called a 'site'.

So, you must determine starting positions of each site in its own larger DNA sequence. The set-up program will do that randomly, but you can then go on to alter the starting position of each site if you wish.

Phase 2 - Run the EM algorithm

The job of the algorithm is to discover the motif that you have just set up. To do that, it must make an assumption about the base composition of each site in each sequence. The model will assume that all the correct composition can be represented by all sites that start at position 35 (you can change that start position). Alternatively, you can ask the program to choose random sites for your initial composition assumption.

Now, obviously, your starting base composition will not look anything like the true composition, but the power of the EM algorithm is that it will refine the estimate until it converges on the true base composition of the sites.

Ideally it will find the motif that you created in the model set-up.

You will run the algorithm in alternating steps:

1. You will step through the computation of the expectation of the motif composition
2. You will maximize your estimate, then start over with a new expectation

The procedure fails unless there is some overlap of at least 2 sites.

SETUP

Enter the base sequence (4 to 12 bases) for the first site, the canonical motif (Upper case A,C,G,T only)
GARBAGE ALERT! Can only be A,C,G,T (upper case only)
How many sites do you wish to enter? 1 to 20