Think, Forrest! Think!: bioinformatics

2015-11-22

Bowtie index building problem for U's in sequences

I am having a weird problem with Bowtie index builder.

I have two entries in my FASTA file:

forrest@narnia:/bioinfo/out-house/miRBase$ cat 157.fa
>ath-miR157a-5p-U
UUGACAGAAGAUAGAGAGCAC
>ath-miR157a-5p_T
TTGACAGAAGATAGAGAGCAC

After building the index, I use bowtie-inspector to check the index.

forrest@narnia:/bioinfo/out-house/miRBase$ bowtie-build 157.fa  157 -q
forrest@narnia:/bioinfo/out-house/miRBase$ bowtie-inspect -s 157
Colorspace 0
SA-Sample 1 in 32
FTab-Chars 10
Sequence-1 ath-miR157a-5p-U 18
Sequence-2 ath-miR157a-5p_T 21

Strangely, the length of ath-miR157a-5p-U becomes 18 instead of 21. The 3 U's of it are missed.

Even more strangely, not all U's in all sequences are ignored. This problem happens for some but not all.

2014-12-28

FASTA header parsing problem from FASTA36.3.5d to FASTA36.3.7

I just upgraded the FASTA program on my computer from FASTA36.3.5d to FASTA36.3.7 via http://faculty.virginia.edu/wrpearson/fasta/fasta36/

Then I see a very strange problem on FASTA header of my library file.

This is one entry in my library file:

>gi|54118649|gb|CO494158.1|CO494158 G.h.fbr-sw03548 G.h.fbr-sw Gossypium hirsutum cDNA, mRNA sequence
GCTTAGCGTGGTCGCGGCCGAGGTACTTTTTTTTTTTTTTTTTTTTTTTGGGAAACTTTCACAGTCTTGC
CATTTCCATAGTATTTAAATGATGACAAATTGGAGCAGGAATAACATTACAGTGCATGATACAAACAATT
AAGCTATAGGACTCTATTAAGTTATTCATTCTATGAAGATGATGCTAGTTTCCAATAGCAAATAAAGGCT

I simply use Smith-Waterman algorithm.

If I use FASTA36.3.7, the header of it in results will look like this:

The best scores are:                                                                              s-w bits E(1)
gb|CO494158.1|CO494158 G.h.fbr-sw03548 G.h.fbr-sw Gossypium hirsutum cDNA, mRNA sequen ( 569) [r]  315 30.8 6.4e-06

>>gb|CO494158.1|CO494158 G.h.fbr-sw03548 G.h.fbr-sw Gossypium hirsutum cDNA, mRNA se              (569 nt)
rev-comp s-w opt: 315  Z-score: 138.7  bits: 30.8 E(1): 6.4e-06
Smith-Waterman score: 315; 100.0% identity (100.0% similar) in 21 nt overlap (21-1:534-554)

Not that the "gi" part in header is missing.

But, if I reverse back to FASTA36.3.5d, the header looks fine:

The best scores are:                                                                              s-w bits E(1)
gi|54118649|gb|CO494158.1|CO494158 G.h.fbr-sw03548 G.h.fbr-sw Gossypium hirsutum cDNA, ( 569) [r]  315 32.8 1.6e-06

>>gi|54118649|gb|CO494158.1|CO494158 G.h.fbr-sw03548 G.h.fbr-sw Gossypium hirsutum cDNA, mRNA se  (569 nt)
rev-comp s-w opt: 315  Z-score: 149.3  bits: 32.8 E(1): 1.6e-06
Smith-Waterman score: 315; 100.0% identity (100.0% similar) in 21 nt overlap (21-1:534-554)

This causes my downstream parsing program to work weirdly. I am not sure whether someone has reported this issue.