2012-11-04

A possible data bug in Arabidopsis thaliana's transposable elements sequence

Arabidopsis thaliana (rockcress) is a plant intensively studied by biologists around the world. Its genome and genome-related information are released at TAIR (www.arabidopsis.org) and repetitively used by researchers. It shouldn't contain any apparent mistake. But I might just found one.

I obtained TE sequences from TAIR at ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_transposable_elements/TAIR10_TE.fas (Do not click unless needed. It may freeze your computer for a while. It is a huge FASTA-format file.)

If you search the TE AT5TE28285, you will get this one:

>AT5TE28285|-|7818034|7819050|SIMPLEGUY1|DNA/Harbinger|54 bp
AATTATTGTAATGTATTTTCAAATTTGACAATGAATTTAGAAGAAACACGAGAT

The first line (FASTA header line) says this TE is at 7918034bp-7819050bp on the reverse strand (of chromosome 5), and the length is 54 bps. And the sequence below is really 54 bp.

However, 7819050 - 7918034 +1 = 1017 which is not 54.

I tried to confirm the coordinates using another sheet at TAIR: ftp://ftp.arabidopsis.org/Genes/TAIR10_genome_release/TAIR10_transposable_elements/TAIR10_Transposable_Elements.txt

Over there, the coordinate of this TE AT5TE28285 is also from 7918034bp to 7819050bp.

I tried to find more information from arabidopsis.org. On GBrowser, it gives me the result consistent with coordinate http://www.arabidopsis.org/servlets/TairObject?type=sequence&id=2503988071  while on sequence details page, it says the TE is 54bp http://www.arabidopsis.org/servlets/TairObject?type=sequence&id=2503988071 .

I hope I am not drunk due to intensive study on the crazy post-storm NYC transit schedule.

Anyone has any idea?

PS: I decide to write a program to verify the coordinates and sequence lengths of all sequences now.

No comments: