Some genes in DNA sequences are very short, for example 15 bases. They are too short to sequence. So, we can add some adapters at 5' and 3' to extend it to 34 bases. Besides, we have a huge pool of sequences from different sources and we mixed them up for sequencing. So we can also use the adapter as an indicator to mark the source of the gene.
Here I wanna briefly explain my algorithm. I will use this sequence as an example: CATTCATGGACGTTGATAAGATCTTTCGTATGC, from a big pool of 5 million gene segments, costing us $3,000 for sequencing.
The 5' adapter can be one of the four combinations: AC, CA, GT or TG. The 3' adapter is TCGTATGCCGTCTTCTGCTTG.
Classifying sequences by the first 2 bases at 5'.
Let CA be class 1 and this sequence occurs 901 times in the sequence pool
Sequences doesn't start in any 5' adapter, will be put into Trash 1.
Truncating first 2 bases at the 5'
So now the sequence becomes: TTCATGGACGTTGATAAGATCTTTCGTATGC
Searching and truncating the 3' adapter from the 3' of the sequence
We set a slide window of length between 3 to 15, denoted as i. The initial value of i is 3.
We will try to compare the last i bases of the sequence with the first i bases of the 3' adapter. If they are identical, then truncate the last i bases of the sequence. If they are not identical, increase i by 1 and repeat the comparison.
3.1 If they are identical before i increases to 15, then truncate the last i bases of the sequence.
3.2 If they become identical after i reaches 15, then dump this sequence to Trash 2.
3.3 If they are still not identical after increasing the window length to the length of 3' adapter, dump this sequence to Trash 3.
Now this sequence becomes: TTCATGGACGTTGATAAGATCTT
Making sure there isn't segment of 3' adapter at other part of the remain sequence
Search the first 9 bases of 3' adapter in the sequence. If can't find it, then leave it alone. Otherwise, dump the sequence to Trash 4.
Step 5: Ending up
If a sequence hasn't been put into any trashes, then we call it a "useful sequence".
Sum up the occurrence of sequences in each pool (Trash 1, Trash 2, Trash 3, Trash 4 and useful sequences) respectively. The occurrence of sequences are different.
- There are 256, 209 sequences that are useful, occupying 76%.
- 30, 576 sequences don't match any 5' adapters, occupying 9%
- 34, 900 sequences are in Trash 2 and 12, 032 sequences are in Trash 4, occupying 10% and 3% respectively.
- There is no sequences in Trash 3.