2008-07-05

Searching phased siRNAs from mapping result over genomes in linear (O(n)) time complexity

by Forrest Sheng Bao http://fsbao.net

In bioinformatics, we prefer constant or linear algorithms. The reason is obvious. But, sometimes, something strange just happened in some papers.

In 2006, Ho-Ming Chen, Yi-Hang Li and Shu-Hsing Wu published a paper on PNAS, "Bioinformatic prediction and experimental validation of a microRNA-directed tandem trans-acting siRNA cascade in Arabidopsis", http://www.pnas.org/content/104/9/3318.abstract.

They used a p-test to find potential areas with phased siRNAs, small RNAs that are successive to each other. I think this is rather an inefficient way. First, the genome of Arabidopsis is not quite long, 100 million nucleotides, not 1 billion nts. If you could develop an O(n) algorithm, considering the high speed of contemporary computers, it's a snap. Second, I am not quite sure about the accuracy of their algorithm.

I think the precise search would also work and very easy. For example, if you wanna know whether the 21-nt siRNA starting from i-th nt succeeds 21-nt siRNA starting from i-21-th nt , just search whether the later siRNA can be mapped on to the Arabidopsis genome, while the mapping data is already available. The time complexity of worst case is O(n), where n is the length of the genome. Plus, my algorithm has nothing to do how many successive siRNAs you wanna find. I can tell you all cases in an O(n) run.

I think their algorithm is inaccuracy, coz it's just a prediction, and not of linear time complexity. I really got confused why they develop an sophisticated and inaccurate algorithm with higher time complexity. Just simple search is ok. Take a look at their code. It's not simple, right? http://www.pnas.org/content/104/9/3318/suppl/DC1#SC.

Of course, the mapping data format is the same as mentioned in that PNAS paper. The start position of each small RNA signature in terms of coordinate, and strand (1: Watson; -1: Crick) which are separated by Tab as shown below.

# 11927 -1 TGGCGATGATGATCAAT
# 28152 1 CATCCTTCGATGTTGTG
# 42408 1 CTCTTAGCTAAGAGCCA

Here comes my code. You can save it as phase.py and run it as

python phase.py test.csv 21
after downloading the test later.
#This software is a free software.
#Thus, it is licensed under GNU General Public License.
#Forrest Bao, Dept. of Computer Science/Electrical Engineering, Texas Tech University
#Jul. 5, 2008 <> <>

# to find phased small RNAs
# run it in "phase.siRNA" folder as python phase.siRNA sense_WT_TMV_21.csv 21
# last modified Jul. 5

import string,os
from os.path import *
from numpy import *
from pylab import *
import cPickle

f1=open(sys.argv[1], 'r')
pool=[]
totalseq = 0;
for line in f1.readlines():
pool.append(line.split());
f1.close()

count = list(pool)
for i in xrange(0,len(count)):
count[i].append(0);

for i in xrange(len(pool)-1,-1,-1):
k = int(pool[i][1])
phase = int(sys.argv[2]);
for j in xrange(i-1,i-phase-1,-1): # fixed time loop, $phase times
if k-phase == int(pool[j][1]):
count[j][3]+=count[i][3]+1

output='siRNA\t\tlocation\tstrand\tphase\n'
for i in xrange(0,len(count)):
if count[i][3]>0:
output+=pool[i][0]+'\t'+pool[i][1]+'\t'+pool[i][2]+'\t'+str(count[i][3])+'\n'

print output

And this is my test data, please save the data as test.csv.

ACGTCACCCTTACGGATTTAC 6134 -1
TTTACGTCACCCTTACGGATT 6137 -1
TAATTTGGTTTACGTCACCCT 6145 -1
TTACGTAATTTGGTTTACGTC 6150 -1
ACATTACGTAATTTGGTTTAC 6153 -1
AACATTACGTAATTTGGTTTA 6154 -1
AAACATTACGTAATTTGGTTT 6155 -1
TCGATTTAAATGGAACCTAAA 6174 -1
TTCGATTTAAATGGAACCTAA 6175 -1
TTTCGATTTAAATGGAACCTA 6176 -1
AGGTTTCGATTTAAATGGAAC 6179 -1
ACAGGTTTCGATTTAAATGGA 6181 -1
TCCAGGAAATAACAGGTTTCG 6192 -1
GTTAACAGGTGATCCAGGAAA 6204 -1
CGTTAACAGGTGATCCAGGAA 6205 -1
CGCGTACGTTAACAGGTGATC 6211 -1
ACGCCACGCGTACGTTAACAG 6217 -1
TACGCCACGCGTACGTTAACA 6218 -1
TGTAATATACGCCACGCGTAC 6225 -1
CCCACTGTAATATACGCCACG 6230 -1
AGTTATTCCCACTGTAATATA 6237 -1
TAGTTATTCCCACTGTAATAT 6238 -1
TTAGTTATTCCCACTGTAATA 6239 -1
CACTTTTAGTTATTCCCACTG 6244 -1
AGGATTCGAACCTCTCACTTT 6259 -1
GAGGATTCGAACCTCTCACTT 6260 -1
Comments are welcomed. I will write a short article to arXiv later.

No comments: