Re: [Gmod-gbrowse] Can I load GFF3 to DB but keep DNA sequence as file?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Re: [Gmod-gbrowse] Can I load GFF3 to DB but keep DNA sequence as file?

CHAN, KENNETH 1 [AG/7721]

Thanks Scott for your quick and easy solution.

Thanks James for sharing such valuable experience.

In my case, I am using gbrowse1.7 (The reason of not using gbrowse2.0 is that our IT team only support IE6 so everyone here is still using IE6. However IE6 doesn’t like the javascript in gbrowse2.0 and some functions are not working properly in IE6, eg. the File and Help menus are not working). Therefore it seems I only have the choice of putting all data (inc. ref seq) in database or in files, or leaving all tracks required the sequence information out of the current release. As James’ suggestion, I may try to see if all in files work fine or not. Thanks.

 

Regards,

Kenneth

 

 

 

From: James M. Ward [mailto:[hidden email]]
Sent: Thursday, May 13, 2010 11:32 PM
To: [hidden email]
Cc: Scott Cain; CHAN, KENNETH 1 [AG/7721]
Subject: Re: Can I load GFF3 to DB but keep DNA sequence as file?

 

Kenneth,

In case it helps you decide a path forward, I discovered the same requirement Scott described.  The requirement is for the reference DNA sequence (chromosomes) to be in the database.  But upon thinking about it just now, I think the only reason we have the full genome sequences loaded into a database is for GBrowse.  Hmm...

The sequences for GBrowse tracks do not have to be in the database, provided you have the coordinate data outside the database too.  For that reason, we put almost any datasource that has a reasonably high content of sequence data (e.g. numerous sequencing projects' reads) into BAM format, with sequence embedded into it.  For rendering speed, I don't think anyone questions the efficiency gains.  And I haven't even tried the BigBed format which Lincoln says is even faster! (I can't wait.) But I don't think BigBed can hold a sequence. The exceptions are tracks where the sequences or the spans of the features are extremely large (>1 MB) in which case I suspect the BAM format falters, and our server's memory fills up with whipped cream.  Not as fun as it sounds.  (We haven't had that problem in a while.)

As Scott said, the methods which fetch and assemble the reference sequence work fine for GBrowse's purposes.  But in my [imperfect] hands, for almost any other batch DNA segment fetching routine, there are other methods which are hugely more efficient.  I think in log scale, so when I say 2-3 times faster, I mean 10-100 times faster.  But it may make my life easier to be wrong, so someone interject if you have seen different.  :-)

For example, if you wanted to export sequence representing all exon spans in a whole genome, Dr. Aaron Quinlan's BEDtools will accomplish that task in several seconds.  (I just tested it again and created a FASTA file of 167,000 zebra finch exons in just under 10 seconds.)  I could be missing an obvious trick that others are using to do the same from a database, but I can't touch that.  Similarly, if you wanted to get the intersections of all exons with any transcriptome sequencing reads (using the coordinates of the overlaps), BEDtools can do that in a couple seconds too.  Again, I can't compete if using a relational database there either.  No knock on the tools, because I actually love it when several toolsets can complement each other this much!

In theory, a database stores data in files, and someone could write an API which is exactly as fast, but I couldn't find it.  Closest I found were blazing-fast prototype functions for time-duration overlap queries, which I think can be adapted for coordinate-segmented overlaps.  But the easiest thing may be writing database functions which secretly call commandline tools in the background.  (Shh!)  It may not be pretty on an architecture slide, but think of how much more time you'd have to try!

All that to say, I think you're still better off storing feature track sequences (and coordinates) in files than the database.  And probably better off performing coordinate-based queries on those files as well.

Best regards,

James

On Thu, 2010-05-13 at 02:10 +0000, [hidden email] wrote:

Hi Kenneth,

If you are using GBrowse2, you can have separate data sources for
different tracks, so if you want DNA and translation tracks, you could
define a database track that is a fasta file (probably--I haven't test
this).  If you did do that, you couldn't use the DNA for anything else
though (like showing translations in CDS tracks or mismatches in
hits).

BUT--I really don't think you want to do that anyway.  The DNA data
are stored "shredded" in the database, and the algorithm for fetching
pieces and reassembling it is pretty good.  Have you had problems with
the responsiveness related to DNA fetching?

Scott


On Wed, May 12, 2010 at 10:05 PM, CHAN, KENNETH 1 [AG/7721]
<[hidden email]> wrote:
> Hello all,
>
> ? From what I know, DNA sequence is not recommended to be stored in database
> due to its large size. Is it possible to only store the gff3 data in
> database and keep the DNA sequence in a FASTA file, but can still view the
> DNA sequences when zoom in?
>
> ? My question is actually the same as James? post
> (http://old.nabble.com/FASTA-files-alongside-relational-databases--td26647514.html#a26647514)
> but it seems no reply so far.
>
> ? Thanks in advance.
>
>
>
> Regards,
>
> Kenneth
>
>
>
> This e-mail message may contain privileged and/or confidential information,
> and is intended to be received only by persons entitled to receive such
> information. If you have received this e-mail in error, please notify the
> sender immediately. Please delete it and all attachments from any servers,
> hard drives or any other media. Other use of this e-mail by you is strictly
> prohibited.
>
> All e-mails and attachments sent and received are subject to monitoring,
> reading and archival by Monsanto, including its subsidiaries. The recipient
> of this e-mail is solely responsible for checking for the presence of
> "Viruses" or other "Malware". Monsanto, along with its subsidiaries, accepts
> no liability for any damage caused by any such code transmitted by or
> accompanying this e-mail or any attachment.
>
> ------------------------------------------------------------------------------
>
>
> _______________________________________________
> Gmod-gbrowse mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
>
>



--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research

James M. Ward
Bioinformatics and Computational Biology
Department of Neurobiology
Duke University Medical Center
(919) 699-3631
[hidden email]
[hidden email]

 

This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware". Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment.


------------------------------------------------------------------------------


_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse