Sample input files for Ron Shamir's CS workshop 0368-3500-07 (Fall 2007-8)
Input #1: input1.fasta
The first input file is small - 100 sequences of length 200 (the real
input files will be much larger, of course).
It contains a pair of simple (string) motifs of length 6 with a significant order bias:
motif A: ACCTTT , motif B: GGGAAG
There should be 15 occurrences of the pair A->B (i.e., A upstream of B)
with a gap of length 10-20 between them;
there are 5 additional occurrences on the reverse-complement strand.
Input #2: input2.fasta
The second sample is larger - 2,000 sequences of length 1,000.
It contains several pairs of motifs of length 8 with an order bias:
- A pair of string motifs:
motif A: TAAAAAAT , motif B: CCCCGGGG
This pair appears on both strands with a gap of length 20-50.
There are 54 occurrences of the pair A->B with the above gap lengths,
and 13 occurrences of the reverse order B->A.
- A pair of consensus motifs:
motif A: TA[ACT]AA[AG]AT , motif B: GGAA[AT]TTT
This pair appears only on the "+" (=original) strand with a gap of length 10-30.
There are 160 occurrences of the pair A->B.
This pair is also localized (i.e., its hits aren't distributed uniformly along the sequences).
- Another pair of consensus motifs:
motif A: GAGA[CG][AT]CC , motif B: CTATACC[CG]
This pair appears on both strands with a gap of length 40-45 (the sequence of the gap is quite conserved).
There are roughly 370 occurrences of the pair A->B.
A third motif (AACGTTCC) appears 60-80 bases upstream of this pair (only on the "+" strand).