[Gmod-ajax] GTF parsing challenge

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[Gmod-ajax] GTF parsing challenge

Andrew Warren
Hi all,

So I have started work on trying to add a GTF parser. Since it is very similar (but annoyingly different) from the GFF format I have started off by trying to modify Robert's existing GFF parser.

Everything seems to parse ok, and the features are created, but then when I try and click on a feature or do enough panning and zooming things seem to lock up (making it difficult to debug). I am not sure whats going on here.

The changes I have so far can be found here

https://github.com/aswarren/jbrowse/commit/fe1251c50e855fb1562a716b53c7701669c12543


GTF format here: http://mblab.wustl.edu/GTF22.html

I also try to account for the Cufflinks variation (with actual transcript lines): http://cufflinks.cbcb.umd.edu/manual.html#gtfout


An example of cufflinks variation: http://rnaseq.pathogenportal.org/dataset/display?dataset_id=5c1eed587ff05ae7&to_ext=gtf


Any suggestions or ideas on what could be going wrong would be greatly appreciated.


Thanks,

Andrew




------------------------------------------------------------------------------
Android™ apps run on BlackBerry®10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience.  Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-ajax mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-ajax
Reply | Threaded
Open this post in threaded view
|

Re: GTF parsing challenge

Andrew Warren
Update: I wrote to soon. I had some basic bugs that needed fixing.

The new parsing is slow however, really slow. I'm not sure why at this point. I am trying to do some profiling.

For GTF, every feature line has a gene_id and a transcript_id but there are no ids that uniquely id each feature. Representing GTF features is a little different because: there is no unique id per feature, no parent relation, and the only "unique IDs" are gene_id and transcript_id which are only unique in aggregate -- meaning multiple lines all refer to the same conceptual unique gene and transcript but there is no line [or coordinates] for them specifically. Here I am creating transcript features from the first 'child' feature and let 'gene_ids' aka genes simply be attributes not a feature in themselves. As more "children" are added to the parent transcript (through the transcript_id) I expand its coordinates.

The reason I am leaving out genes is that in eukaryotes a gene can have multiple transcripts. In prokaryotes a transcript can have multiple genes. Since there is no parent relation to resolve the matter, one top-level attribute needed to be chosen.


On Mon, Feb 10, 2014 at 1:03 PM, Andrew Warren <[hidden email]> wrote:
Hi all,

So I have started work on trying to add a GTF parser. Since it is very similar (but annoyingly different) from the GFF format I have started off by trying to modify Robert's existing GFF parser.

Everything seems to parse ok, and the features are created, but then when I try and click on a feature or do enough panning and zooming things seem to lock up (making it difficult to debug). I am not sure whats going on here.

The changes I have so far can be found here

https://github.com/aswarren/jbrowse/commit/fe1251c50e855fb1562a716b53c7701669c12543


GTF format here: http://mblab.wustl.edu/GTF22.html

I also try to account for the Cufflinks variation (with actual transcript lines): http://cufflinks.cbcb.umd.edu/manual.html#gtfout


An example of cufflinks variation: http://rnaseq.pathogenportal.org/dataset/display?dataset_id=5c1eed587ff05ae7&to_ext=gtf


Any suggestions or ideas on what could be going wrong would be greatly appreciated.


Thanks,

Andrew





------------------------------------------------------------------------------
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience.  Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-ajax mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-ajax
Reply | Threaded
Open this post in threaded view
|

Re: GTF parsing challenge

Robert Buels-2
I'm not all that familiar with GTF right now, but it seems like the GFF3
parser might be a lot more complex than a GTF parser would need to be.
It might actually be easier and faster to write one from scratch.

Also, for performance profiling, give Devel::NYTProf a try if you
haven't already.  It's swell.


Robert Buels
Lead Developer
JBrowse - http://jbrowse.org

On 02/10/2014 07:06 PM, Andrew Warren wrote:

> Update: I wrote to soon. I had some basic bugs that needed fixing.
>
> The new parsing is slow however, really slow. I'm not sure why at this
> point. I am trying to do some profiling.
>
> For GTF, every feature line has a gene_id and a transcript_id but there
> are no ids that uniquely id each feature. Representing GTF features is a
> little different because: there is no unique id per feature, no parent
> relation, and the only "unique IDs" are gene_id and transcript_id which
> are only unique in aggregate -- meaning multiple lines all refer to the
> same conceptual unique gene and transcript but there is no line [or
> coordinates] for them specifically. Here I am creating transcript
> features from the first 'child' feature and let 'gene_ids' aka genes
> simply be attributes not a feature in themselves. As more "children" are
> added to the parent transcript (through the transcript_id) I expand its
> coordinates.
>
> The reason I am leaving out genes is that in eukaryotes a gene can have
> multiple transcripts. In prokaryotes a transcript can have multiple
> genes. Since there is no parent relation to resolve the matter, one
> top-level attribute needed to be chosen.
>
>
> On Mon, Feb 10, 2014 at 1:03 PM, Andrew Warren <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Hi all,
>
>     So I have started work on trying to add a GTF parser. Since it is
>     very similar (but annoyingly different) from the GFF format I have
>     started off by trying to modify Robert's existing GFF parser.
>
>     Everything seems to parse ok, and the features are created, but then
>     when I try and click on a feature or do enough panning and zooming
>     things seem to lock up (making it difficult to debug). I am not sure
>     whats going on here.
>
>     The changes I have so far can be found here
>
>     https://github.com/aswarren/jbrowse/commit/fe1251c50e855fb1562a716b53c7701669c12543
>
>
>     GTF format here: http://mblab.wustl.edu/GTF22.html
>
>     I also try to account for the Cufflinks variation (with actual
>     transcript lines): http://cufflinks.cbcb.umd.edu/manual.html#gtfout
>
>
>     An example of cufflinks variation:
>     http://rnaseq.pathogenportal.org/dataset/display?dataset_id=5c1eed587ff05ae7&to_ext=gtf
>
>
>     Any suggestions or ideas on what could be going wrong would be
>     greatly appreciated.
>
>
>     Thanks,
>
>     Andrew
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> Android apps run on BlackBerry 10
> Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
> Now with support for Jelly Bean, Bluetooth, Mapview and more.
> Get your Android app in front of a whole new audience.  Start now.
> http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
>
>
>
> _______________________________________________
> Gmod-ajax mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-ajax
>

------------------------------------------------------------------------------
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience.  Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-ajax mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-ajax
Reply | Threaded
Open this post in threaded view
|

Re: GTF parsing challenge

Andrew Warren
Ah maybe you have given me a clue to what I'm doing wrong.
Devel::NYTProf is a perl profiler (which I will checkout next time I get a chance).

So far I have been concentrating on building out/replicating/modifying GTF equivalents for the following GFF parsing files (which are all JS):
JBrowse/Util/GFF3.js
JBrowse/Store/SeqFeature/GFF/Parser.js
JBrowse/Store/SeqFeature/GFF.js
JBrowse/View/FileDialog/TrackList/GFFDriver.js

Is there some low-level Perl that gets called in the course of loading a GFF from URL or local storage? Also does it look like I am missing any major files in the processing?

Thanks!
Andrew



On Mon, Feb 10, 2014 at 7:15 PM, Robert Buels <[hidden email]> wrote:
I'm not all that familiar with GTF right now, but it seems like the GFF3 parser might be a lot more complex than a GTF parser would need to be. It might actually be easier and faster to write one from scratch.

Also, for performance profiling, give Devel::NYTProf a try if you haven't already.  It's swell.


Robert Buels
Lead Developer
JBrowse - http://jbrowse.org


On 02/10/2014 07:06 PM, Andrew Warren wrote:
Update: I wrote to soon. I had some basic bugs that needed fixing.

The new parsing is slow however, really slow. I'm not sure why at this
point. I am trying to do some profiling.

For GTF, every feature line has a gene_id and a transcript_id but there
are no ids that uniquely id each feature. Representing GTF features is a
little different because: there is no unique id per feature, no parent
relation, and the only "unique IDs" are gene_id and transcript_id which
are only unique in aggregate -- meaning multiple lines all refer to the
same conceptual unique gene and transcript but there is no line [or
coordinates] for them specifically. Here I am creating transcript
features from the first 'child' feature and let 'gene_ids' aka genes
simply be attributes not a feature in themselves. As more "children" are
added to the parent transcript (through the transcript_id) I expand its
coordinates.

The reason I am leaving out genes is that in eukaryotes a gene can have
multiple transcripts. In prokaryotes a transcript can have multiple
genes. Since there is no parent relation to resolve the matter, one
top-level attribute needed to be chosen.


On Mon, Feb 10, 2014 at 1:03 PM, Andrew Warren <[hidden email]
<mailto:[hidden email]>> wrote:

    Hi all,

    So I have started work on trying to add a GTF parser. Since it is
    very similar (but annoyingly different) from the GFF format I have
    started off by trying to modify Robert's existing GFF parser.

    Everything seems to parse ok, and the features are created, but then
    when I try and click on a feature or do enough panning and zooming
    things seem to lock up (making it difficult to debug). I am not sure
    whats going on here.

    The changes I have so far can be found here

    https://github.com/aswarren/jbrowse/commit/fe1251c50e855fb1562a716b53c7701669c12543


    GTF format here: http://mblab.wustl.edu/GTF22.html

    I also try to account for the Cufflinks variation (with actual
    transcript lines): http://cufflinks.cbcb.umd.edu/manual.html#gtfout


    An example of cufflinks variation:
    http://rnaseq.pathogenportal.org/dataset/display?dataset_id=5c1eed587ff05ae7&to_ext=gtf


    Any suggestions or ideas on what could be going wrong would be
    greatly appreciated.


    Thanks,

    Andrew






------------------------------------------------------------------------------
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience.  Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk



_______________________________________________
Gmod-ajax mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-ajax



------------------------------------------------------------------------------
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience.  Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-ajax mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-ajax
Reply | Threaded
Open this post in threaded view
|

Re: GTF parsing challenge

Robert Buels-2
Oooh, sorry I misunderstood.  Thought you were talking about adding GTF
support to flatfile-to-json.pl.  But doing it on the client side is good
too.

Well anyway, if you have a good understanding of the JS GFF3 parser, I'm
sure you'll figure it out.

Are you sure that in prokaryotes a transcript can belong to multiple
genes?  This state of things would probably cause trouble with a lot of
tools.


Robert Buels
Lead Developer
JBrowse - http://jbrowse.org

On 02/12/2014 12:21 PM, Andrew Warren wrote:

> Ah maybe you have given me a clue to what I'm doing wrong.
> Devel::NYTProf is a perl profiler (which I will checkout next time I get
> a chance).
>
> So far I have been concentrating on building out/replicating/modifying
> GTF equivalents for the following GFF parsing files (which are all JS):
> JBrowse/Util/GFF3.js
> JBrowse/Store/SeqFeature/GFF/Parser.js
> JBrowse/Store/SeqFeature/GFF.js
> JBrowse/View/FileDialog/TrackList/GFFDriver.js
>
> Is there some low-level Perl that gets called in the course of loading a
> GFF from URL or local storage? Also does it look like I am missing any
> major files in the processing?
>
> Thanks!
> Andrew
>
>
>
> On Mon, Feb 10, 2014 at 7:15 PM, Robert Buels <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     I'm not all that familiar with GTF right now, but it seems like the
>     GFF3 parser might be a lot more complex than a GTF parser would need
>     to be. It might actually be easier and faster to write one from scratch.
>
>     Also, for performance profiling, give Devel::NYTProf a try if you
>     haven't already.  It's swell.
>
>
>     Robert Buels
>     Lead Developer
>     JBrowse - http://jbrowse.org
>
>
>     On 02/10/2014 07:06 PM, Andrew Warren wrote:
>
>         Update: I wrote to soon. I had some basic bugs that needed fixing.
>
>         The new parsing is slow however, really slow. I'm not sure why
>         at this
>         point. I am trying to do some profiling.
>
>         For GTF, every feature line has a gene_id and a transcript_id
>         but there
>         are no ids that uniquely id each feature. Representing GTF
>         features is a
>         little different because: there is no unique id per feature, no
>         parent
>         relation, and the only "unique IDs" are gene_id and
>         transcript_id which
>         are only unique in aggregate -- meaning multiple lines all refer
>         to the
>         same conceptual unique gene and transcript but there is no line [or
>         coordinates] for them specifically. Here I am creating transcript
>         features from the first 'child' feature and let 'gene_ids' aka genes
>         simply be attributes not a feature in themselves. As more
>         "children" are
>         added to the parent transcript (through the transcript_id) I
>         expand its
>         coordinates.
>
>         The reason I am leaving out genes is that in eukaryotes a gene
>         can have
>         multiple transcripts. In prokaryotes a transcript can have multiple
>         genes. Since there is no parent relation to resolve the matter, one
>         top-level attribute needed to be chosen.
>
>
>         On Mon, Feb 10, 2014 at 1:03 PM, Andrew Warren <[hidden email]
>         <mailto:[hidden email]>
>         <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
>
>              Hi all,
>
>              So I have started work on trying to add a GTF parser. Since
>         it is
>              very similar (but annoyingly different) from the GFF format
>         I have
>              started off by trying to modify Robert's existing GFF parser.
>
>              Everything seems to parse ok, and the features are created,
>         but then
>              when I try and click on a feature or do enough panning and
>         zooming
>              things seem to lock up (making it difficult to debug). I am
>         not sure
>              whats going on here.
>
>              The changes I have so far can be found here
>
>         https://github.com/aswarren/__jbrowse/commit/__fe1251c50e855fb1562a716b53c770__1669c12543
>         <https://github.com/aswarren/jbrowse/commit/fe1251c50e855fb1562a716b53c7701669c12543>
>
>
>              GTF format here: http://mblab.wustl.edu/GTF22.__html
>         <http://mblab.wustl.edu/GTF22.html>
>
>              I also try to account for the Cufflinks variation (with actual
>              transcript lines):
>         http://cufflinks.cbcb.umd.edu/__manual.html#gtfout
>         <http://cufflinks.cbcb.umd.edu/manual.html#gtfout>
>
>
>              An example of cufflinks variation:
>         http://rnaseq.pathogenportal.__org/dataset/display?dataset___id=5c1eed587ff05ae7&to_ext=gtf
>         <http://rnaseq.pathogenportal.org/dataset/display?dataset_id=5c1eed587ff05ae7&to_ext=gtf>
>
>
>              Any suggestions or ideas on what could be going wrong would be
>              greatly appreciated.
>
>
>              Thanks,
>
>              Andrew
>
>
>
>
>
>
>         ------------------------------__------------------------------__------------------
>         Android apps run on BlackBerry 10
>         Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
>         Now with support for Jelly Bean, Bluetooth, Mapview and more.
>         Get your Android app in front of a whole new audience.  Start now.
>         http://pubads.g.doubleclick.__net/gampad/clk?id=124407151&__iu=/4140/ostg.clktrk
>         <http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk>
>
>
>
>         _________________________________________________
>         Gmod-ajax mailing list
>         [hidden email].__net
>         <mailto:[hidden email]>
>         https://lists.sourceforge.net/__lists/listinfo/gmod-ajax
>         <https://lists.sourceforge.net/lists/listinfo/gmod-ajax>
>
>

------------------------------------------------------------------------------
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience.  Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-ajax mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-ajax
Reply | Threaded
Open this post in threaded view
|

Re: GTF parsing challenge

Scott Cain
Hi Rob,

Polycistronic genes do occur in Euks (I know for sure they happen in fruit flies--the weird stuff always happens in fruit flies), but we just like to pretend they don't happen :-)

Scott



On Wed, Feb 12, 2014 at 9:59 AM, Robert Buels <[hidden email]> wrote:
Oooh, sorry I misunderstood.  Thought you were talking about adding GTF
support to flatfile-to-json.pl.  But doing it on the client side is good
too.

Well anyway, if you have a good understanding of the JS GFF3 parser, I'm
sure you'll figure it out.

Are you sure that in prokaryotes a transcript can belong to multiple
genes?  This state of things would probably cause trouble with a lot of
tools.


Robert Buels
Lead Developer
JBrowse - http://jbrowse.org

On 02/12/2014 12:21 PM, Andrew Warren wrote:
> Ah maybe you have given me a clue to what I'm doing wrong.
> Devel::NYTProf is a perl profiler (which I will checkout next time I get
> a chance).
>
> So far I have been concentrating on building out/replicating/modifying
> GTF equivalents for the following GFF parsing files (which are all JS):
> JBrowse/Util/GFF3.js
> JBrowse/Store/SeqFeature/GFF/Parser.js
> JBrowse/Store/SeqFeature/GFF.js
> JBrowse/View/FileDialog/TrackList/GFFDriver.js
>
> Is there some low-level Perl that gets called in the course of loading a
> GFF from URL or local storage? Also does it look like I am missing any
> major files in the processing?
>
> Thanks!
> Andrew
>
>
>
> On Mon, Feb 10, 2014 at 7:15 PM, Robert Buels <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     I'm not all that familiar with GTF right now, but it seems like the
>     GFF3 parser might be a lot more complex than a GTF parser would need
>     to be. It might actually be easier and faster to write one from scratch.
>
>     Also, for performance profiling, give Devel::NYTProf a try if you
>     haven't already.  It's swell.
>
>
>     Robert Buels
>     Lead Developer
>     JBrowse - http://jbrowse.org
>
>
>     On 02/10/2014 07:06 PM, Andrew Warren wrote:
>
>         Update: I wrote to soon. I had some basic bugs that needed fixing.
>
>         The new parsing is slow however, really slow. I'm not sure why
>         at this
>         point. I am trying to do some profiling.
>
>         For GTF, every feature line has a gene_id and a transcript_id
>         but there
>         are no ids that uniquely id each feature. Representing GTF
>         features is a
>         little different because: there is no unique id per feature, no
>         parent
>         relation, and the only "unique IDs" are gene_id and
>         transcript_id which
>         are only unique in aggregate -- meaning multiple lines all refer
>         to the
>         same conceptual unique gene and transcript but there is no line [or
>         coordinates] for them specifically. Here I am creating transcript
>         features from the first 'child' feature and let 'gene_ids' aka genes
>         simply be attributes not a feature in themselves. As more
>         "children" are
>         added to the parent transcript (through the transcript_id) I
>         expand its
>         coordinates.
>
>         The reason I am leaving out genes is that in eukaryotes a gene
>         can have
>         multiple transcripts. In prokaryotes a transcript can have multiple
>         genes. Since there is no parent relation to resolve the matter, one
>         top-level attribute needed to be chosen.
>
>
>         On Mon, Feb 10, 2014 at 1:03 PM, Andrew Warren <[hidden email]
>         <mailto:[hidden email]>
>         <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
>
>              Hi all,
>
>              So I have started work on trying to add a GTF parser. Since
>         it is
>              very similar (but annoyingly different) from the GFF format
>         I have
>              started off by trying to modify Robert's existing GFF parser.
>
>              Everything seems to parse ok, and the features are created,
>         but then
>              when I try and click on a feature or do enough panning and
>         zooming
>              things seem to lock up (making it difficult to debug). I am
>         not sure
>              whats going on here.
>
>              The changes I have so far can be found here
>
>         https://github.com/aswarren/__jbrowse/commit/__fe1251c50e855fb1562a716b53c770__1669c12543
>         <https://github.com/aswarren/jbrowse/commit/fe1251c50e855fb1562a716b53c7701669c12543>
>
>
>              GTF format here: http://mblab.wustl.edu/GTF22.__html
>         <http://mblab.wustl.edu/GTF22.html>
>
>              I also try to account for the Cufflinks variation (with actual
>              transcript lines):
>         http://cufflinks.cbcb.umd.edu/__manual.html#gtfout
>         <http://cufflinks.cbcb.umd.edu/manual.html#gtfout>
>
>
>              An example of cufflinks variation:
>         http://rnaseq.pathogenportal.__org/dataset/display?dataset___id=5c1eed587ff05ae7&to_ext=gtf
>         <http://rnaseq.pathogenportal.org/dataset/display?dataset_id=5c1eed587ff05ae7&to_ext=gtf>
>
>
>              Any suggestions or ideas on what could be going wrong would be
>              greatly appreciated.
>
>
>              Thanks,
>
>              Andrew
>
>
>
>
>
>
>         ------------------------------__------------------------------__------------------
>         Android apps run on BlackBerry 10
>         Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
>         Now with support for Jelly Bean, Bluetooth, Mapview and more.
>         Get your Android app in front of a whole new audience.  Start now.
>         http://pubads.g.doubleclick.__net/gampad/clk?id=124407151&__iu=/4140/ostg.clktrk
>         <http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk>
>
>
>
>         _________________________________________________
>         Gmod-ajax mailing list
>         [hidden email].__net
>         <mailto:[hidden email]>
>         https://lists.sourceforge.net/__lists/listinfo/gmod-ajax
>         <https://lists.sourceforge.net/lists/listinfo/gmod-ajax>
>
>

------------------------------------------------------------------------------
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience.  Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-ajax mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-ajax



--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research

------------------------------------------------------------------------------
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience.  Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-ajax mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-ajax
Reply | Threaded
Open this post in threaded view
|

Re: GTF parsing challenge

Andrew Warren
In reply to this post by Robert Buels-2
Ok good to know I haven't left something out completely. Most of my mutations to the existing GFF parsing are encapsulated in one squashed commit though there were some follow on bug fixes and streamlining. Since I copied all the GFF files in an initial commit then changed them from there its pretty easy to look at.  I'm still working on it but have yet to get speed comparable to the GFF parser (which is strange since it is not that drastic).

Transcripts with multiple genes is the norm in bacteria/prokaryotes. They use operons http://en.wikipedia.org/wiki/Operon

Stuff a bunch of genes next to each other on the same transcript.

Thanks,
Andrew



On Wed, Feb 12, 2014 at 12:59 PM, Robert Buels <[hidden email]> wrote:
Oooh, sorry I misunderstood.  Thought you were talking about adding GTF support to flatfile-to-json.pl.  But doing it on the client side is good too.

Well anyway, if you have a good understanding of the JS GFF3 parser, I'm sure you'll figure it out.

Are you sure that in prokaryotes a transcript can belong to multiple genes?  This state of things would probably cause trouble with a lot of tools.



Robert Buels
Lead Developer
JBrowse - http://jbrowse.org

On 02/12/2014 12:21 PM, Andrew Warren wrote:
Ah maybe you have given me a clue to what I'm doing wrong.
Devel::NYTProf is a perl profiler (which I will checkout next time I get
a chance).

So far I have been concentrating on building out/replicating/modifying
GTF equivalents for the following GFF parsing files (which are all JS):
JBrowse/Util/GFF3.js
JBrowse/Store/SeqFeature/GFF/Parser.js
JBrowse/Store/SeqFeature/GFF.js
JBrowse/View/FileDialog/TrackList/GFFDriver.js

Is there some low-level Perl that gets called in the course of loading a
GFF from URL or local storage? Also does it look like I am missing any
major files in the processing?

Thanks!
Andrew



On Mon, Feb 10, 2014 at 7:15 PM, Robert Buels <[hidden email]
<mailto:[hidden email]>> wrote:

    I'm not all that familiar with GTF right now, but it seems like the
    GFF3 parser might be a lot more complex than a GTF parser would need
    to be. It might actually be easier and faster to write one from scratch.

    Also, for performance profiling, give Devel::NYTProf a try if you
    haven't already.  It's swell.


    Robert Buels
    Lead Developer
    JBrowse - http://jbrowse.org


    On 02/10/2014 07:06 PM, Andrew Warren wrote:

        Update: I wrote to soon. I had some basic bugs that needed fixing.

        The new parsing is slow however, really slow. I'm not sure why
        at this
        point. I am trying to do some profiling.

        For GTF, every feature line has a gene_id and a transcript_id
        but there
        are no ids that uniquely id each feature. Representing GTF
        features is a
        little different because: there is no unique id per feature, no
        parent
        relation, and the only "unique IDs" are gene_id and
        transcript_id which
        are only unique in aggregate -- meaning multiple lines all refer
        to the
        same conceptual unique gene and transcript but there is no line [or
        coordinates] for them specifically. Here I am creating transcript
        features from the first 'child' feature and let 'gene_ids' aka genes
        simply be attributes not a feature in themselves. As more
        "children" are
        added to the parent transcript (through the transcript_id) I
        expand its
        coordinates.

        The reason I am leaving out genes is that in eukaryotes a gene
        can have
        multiple transcripts. In prokaryotes a transcript can have multiple
        genes. Since there is no parent relation to resolve the matter, one
        top-level attribute needed to be chosen.


        On Mon, Feb 10, 2014 at 1:03 PM, Andrew Warren <[hidden email]
        <mailto:[hidden email]>
        <mailto:[hidden email] <mailto:[hidden email]>>> wrote:

             Hi all,

             So I have started work on trying to add a GTF parser. Since
        it is
             very similar (but annoyingly different) from the GFF format
        I have
             started off by trying to modify Robert's existing GFF parser.

             Everything seems to parse ok, and the features are created,
        but then
             when I try and click on a feature or do enough panning and
        zooming
             things seem to lock up (making it difficult to debug). I am
        not sure
             whats going on here.

             The changes I have so far can be found here

        https://github.com/aswarren/__jbrowse/commit/__fe1251c50e855fb1562a716b53c770__1669c12543
        <https://github.com/aswarren/jbrowse/commit/fe1251c50e855fb1562a716b53c7701669c12543>


             GTF format here: http://mblab.wustl.edu/GTF22.__html

        <http://mblab.wustl.edu/GTF22.html>

             I also try to account for the Cufflinks variation (with actual
             transcript lines):
        http://cufflinks.cbcb.umd.edu/__manual.html#gtfout

        <http://cufflinks.cbcb.umd.edu/manual.html#gtfout>


             An example of cufflinks variation:
        http://rnaseq.pathogenportal.__org/dataset/display?dataset___id=5c1eed587ff05ae7&to_ext=gtf

        <http://rnaseq.pathogenportal.org/dataset/display?dataset_id=5c1eed587ff05ae7&to_ext=gtf>


             Any suggestions or ideas on what could be going wrong would be
             greatly appreciated.


             Thanks,

             Andrew






        ------------------------------__------------------------------__------------------

        Android apps run on BlackBerry 10
        Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
        Now with support for Jelly Bean, Bluetooth, Mapview and more.
        Get your Android app in front of a whole new audience.  Start now.
        http://pubads.g.doubleclick.__net/gampad/clk?id=124407151&__iu=/4140/ostg.clktrk
        <http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk>



        _________________________________________________
        Gmod-ajax mailing list
        [hidden email].__net
        <mailto:[hidden email]>
        https://lists.sourceforge.net/__lists/listinfo/gmod-ajax
        <https://lists.sourceforge.net/lists/listinfo/gmod-ajax>




------------------------------------------------------------------------------
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience.  Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-ajax mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-ajax
Reply | Threaded
Open this post in threaded view
|

Re: GTF parsing challenge

Robert Buels-2
Oh, right, operons.  In the SO, they are a subclass of gene_group
(http://www.sequenceontology.org/browser/current_svn/term/SO:0000178),
so no problem there.

Make any progress since my last mail?  Any questions?

Robert Buels
Lead Developer
JBrowse - http://jbrowse.org

On 02/12/2014 02:51 PM, Andrew Warren wrote:

> Ok good to know I haven't left something out completely. Most of my
> mutations to the existing GFF parsing are encapsulated in one squashed
> commit though there were some follow on bug fixes and streamlining.
> Since I copied all the GFF files in an initial commit then changed them
> from there its pretty easy to look at.  I'm still working on it but have
> yet to get speed comparable to the GFF parser (which is strange since it
> is not that drastic).
> https://github.com/aswarren/jbrowse/commit/fe1251c50e855fb1562a716b53c7701669c12543
>
> Transcripts with multiple genes is the norm in bacteria/prokaryotes.
> They use operons http://en.wikipedia.org/wiki/Operon
>
> Stuff a bunch of genes next to each other on the same transcript.
>
> Thanks,
> Andrew
>
>
>
> On Wed, Feb 12, 2014 at 12:59 PM, Robert Buels <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Oooh, sorry I misunderstood.  Thought you were talking about adding
>     GTF support to flatfile-to-json.pl <http://flatfile-to-json.pl>.
>       But doing it on the client side is good too.
>
>     Well anyway, if you have a good understanding of the JS GFF3 parser,
>     I'm sure you'll figure it out.
>
>     Are you sure that in prokaryotes a transcript can belong to multiple
>     genes?  This state of things would probably cause trouble with a lot
>     of tools.
>
>
>
>     Robert Buels
>     Lead Developer
>     JBrowse - http://jbrowse.org
>
>     On 02/12/2014 12:21 PM, Andrew Warren wrote:
>
>         Ah maybe you have given me a clue to what I'm doing wrong.
>         Devel::NYTProf is a perl profiler (which I will checkout next
>         time I get
>         a chance).
>
>         So far I have been concentrating on building
>         out/replicating/modifying
>         GTF equivalents for the following GFF parsing files (which are
>         all JS):
>         JBrowse/Util/GFF3.js
>         JBrowse/Store/SeqFeature/GFF/__Parser.js
>         JBrowse/Store/SeqFeature/GFF.__js
>         JBrowse/View/FileDialog/__TrackList/GFFDriver.js
>
>         Is there some low-level Perl that gets called in the course of
>         loading a
>         GFF from URL or local storage? Also does it look like I am
>         missing any
>         major files in the processing?
>
>         Thanks!
>         Andrew
>
>
>
>         On Mon, Feb 10, 2014 at 7:15 PM, Robert Buels <[hidden email]
>         <mailto:[hidden email]>
>         <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
>
>              I'm not all that familiar with GTF right now, but it seems
>         like the
>              GFF3 parser might be a lot more complex than a GTF parser
>         would need
>              to be. It might actually be easier and faster to write one
>         from scratch.
>
>              Also, for performance profiling, give Devel::NYTProf a try
>         if you
>              haven't already.  It's swell.
>
>
>              Robert Buels
>              Lead Developer
>              JBrowse - http://jbrowse.org
>
>
>              On 02/10/2014 07:06 PM, Andrew Warren wrote:
>
>                  Update: I wrote to soon. I had some basic bugs that
>         needed fixing.
>
>                  The new parsing is slow however, really slow. I'm not
>         sure why
>                  at this
>                  point. I am trying to do some profiling.
>
>                  For GTF, every feature line has a gene_id and a
>         transcript_id
>                  but there
>                  are no ids that uniquely id each feature. Representing GTF
>                  features is a
>                  little different because: there is no unique id per
>         feature, no
>                  parent
>                  relation, and the only "unique IDs" are gene_id and
>                  transcript_id which
>                  are only unique in aggregate -- meaning multiple lines
>         all refer
>                  to the
>                  same conceptual unique gene and transcript but there is
>         no line [or
>                  coordinates] for them specifically. Here I am creating
>         transcript
>                  features from the first 'child' feature and let
>         'gene_ids' aka genes
>                  simply be attributes not a feature in themselves. As more
>                  "children" are
>                  added to the parent transcript (through the
>         transcript_id) I
>                  expand its
>                  coordinates.
>
>                  The reason I am leaving out genes is that in eukaryotes
>         a gene
>                  can have
>                  multiple transcripts. In prokaryotes a transcript can
>         have multiple
>                  genes. Since there is no parent relation to resolve the
>         matter, one
>                  top-level attribute needed to be chosen.
>
>
>                  On Mon, Feb 10, 2014 at 1:03 PM, Andrew Warren
>         <[hidden email] <mailto:[hidden email]>
>                  <mailto:[hidden email] <mailto:[hidden email]>>
>                  <mailto:[hidden email] <mailto:[hidden email]>
>         <mailto:[hidden email] <mailto:[hidden email]>>>> wrote:
>
>                       Hi all,
>
>                       So I have started work on trying to add a GTF
>         parser. Since
>                  it is
>                       very similar (but annoyingly different) from the
>         GFF format
>                  I have
>                       started off by trying to modify Robert's existing
>         GFF parser.
>
>                       Everything seems to parse ok, and the features are
>         created,
>                  but then
>                       when I try and click on a feature or do enough
>         panning and
>                  zooming
>                       things seem to lock up (making it difficult to
>         debug). I am
>                  not sure
>                       whats going on here.
>
>                       The changes I have so far can be found here
>
>         https://github.com/aswarren/____jbrowse/commit/____fe1251c50e855fb1562a716b53c770____1669c12543
>         <https://github.com/aswarren/__jbrowse/commit/__fe1251c50e855fb1562a716b53c770__1669c12543>
>
>         <https://github.com/aswarren/__jbrowse/commit/__fe1251c50e855fb1562a716b53c770__1669c12543
>         <https://github.com/aswarren/jbrowse/commit/fe1251c50e855fb1562a716b53c7701669c12543>>
>
>
>                       GTF format here:
>         http://mblab.wustl.edu/GTF22.____html
>         <http://mblab.wustl.edu/GTF22.__html>
>
>                  <http://mblab.wustl.edu/GTF22.__html
>         <http://mblab.wustl.edu/GTF22.html>>
>
>                       I also try to account for the Cufflinks variation
>         (with actual
>                       transcript lines):
>         http://cufflinks.cbcb.umd.edu/____manual.html#gtfout
>         <http://cufflinks.cbcb.umd.edu/__manual.html#gtfout>
>
>                  <http://cufflinks.cbcb.umd.__edu/manual.html#gtfout
>         <http://cufflinks.cbcb.umd.edu/manual.html#gtfout>>
>
>
>                       An example of cufflinks variation:
>         http://rnaseq.pathogenportal.____org/dataset/display?dataset_____id=5c1eed587ff05ae7&to_ext=__gtf
>
>
>         <http://rnaseq.pathogenportal.__org/dataset/display?dataset___id=5c1eed587ff05ae7&to_ext=gtf
>         <http://rnaseq.pathogenportal.org/dataset/display?dataset_id=5c1eed587ff05ae7&to_ext=gtf>__>
>
>
>                       Any suggestions or ideas on what could be going
>         wrong would be
>                       greatly appreciated.
>
>
>                       Thanks,
>
>                       Andrew
>
>
>
>
>
>
>
>         ------------------------------____----------------------------__--__------------------
>
>                  Android apps run on BlackBerry 10
>                  Introducing the new BlackBerry 10.2.1 Runtime for
>         Android apps.
>                  Now with support for Jelly Bean, Bluetooth, Mapview and
>         more.
>                  Get your Android app in front of a whole new audience.
>           Start now.
>         http://pubads.g.doubleclick.____net/gampad/clk?id=124407151&____iu=/4140/ostg.clktrk
>
>         <http://pubads.g.doubleclick.__net/gampad/clk?id=124407151&__iu=/4140/ostg.clktrk
>         <http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk>>
>
>
>
>                  ___________________________________________________
>                  Gmod-ajax mailing list
>                  [hidden email].____net
>                  <mailto:[hidden email]
>         <mailto:[hidden email]>>
>         https://lists.sourceforge.net/____lists/listinfo/gmod-ajax
>         <https://lists.sourceforge.net/__lists/listinfo/gmod-ajax>
>
>         <https://lists.sourceforge.__net/lists/listinfo/gmod-ajax
>         <https://lists.sourceforge.net/lists/listinfo/gmod-ajax>>
>
>
>

------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer
Customize your own dashboards, set traffic alerts and generate reports.
Network behavioral analysis & security monitoring. All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-ajax mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-ajax
Reply | Threaded
Open this post in threaded view
|

Re: GTF parsing challenge

Andrew Warren
Hey Robert,

I fixed the performance issue I had with the GTF parser and issued a pull request.

One thing I noticed when testing with:


was that when I used the full gtf file (large at 434Mb) was that addLine in GFF3/Parser (and now its GTF equivalent) is called on every line in the GFF file.

I was wondering if it would make sense to store some sort of reference to the file-path/URL of the GFF file so that only the features for the current sequence need be parsed out and the file/URL can be reparsed and other features loaded if the sequence is changed. As you might expect I got timeouts when trying to load such a big GTF file but it worked well when greping it down to just one sequence worth of features.

Cheers,
Andrew


On Thu, Feb 27, 2014 at 8:44 AM, Robert Buels <[hidden email]> wrote:
Oh, right, operons.  In the SO, they are a subclass of gene_group (http://www.sequenceontology.org/browser/current_svn/term/SO:0000178), so no problem there.

Make any progress since my last mail?  Any questions?


Robert Buels
Lead Developer
JBrowse - http://jbrowse.org

On 02/12/2014 02:51 PM, Andrew Warren wrote:
Ok good to know I haven't left something out completely. Most of my
mutations to the existing GFF parsing are encapsulated in one squashed
commit though there were some follow on bug fixes and streamlining.
Since I copied all the GFF files in an initial commit then changed them
from there its pretty easy to look at.  I'm still working on it but have
yet to get speed comparable to the GFF parser (which is strange since it
is not that drastic).
https://github.com/aswarren/jbrowse/commit/fe1251c50e855fb1562a716b53c7701669c12543

Transcripts with multiple genes is the norm in bacteria/prokaryotes.
They use operons http://en.wikipedia.org/wiki/Operon

Stuff a bunch of genes next to each other on the same transcript.

Thanks,
Andrew



On Wed, Feb 12, 2014 at 12:59 PM, Robert Buels <[hidden email]
<mailto:[hidden email]>> wrote:

    Oooh, sorry I misunderstood.  Thought you were talking about adding
    GTF support to flatfile-to-json.pl <http://flatfile-to-json.pl>.

      But doing it on the client side is good too.

    Well anyway, if you have a good understanding of the JS GFF3 parser,
    I'm sure you'll figure it out.

    Are you sure that in prokaryotes a transcript can belong to multiple
    genes?  This state of things would probably cause trouble with a lot
    of tools.



    Robert Buels
    Lead Developer
    JBrowse - http://jbrowse.org

    On 02/12/2014 12:21 PM, Andrew Warren wrote:

        Ah maybe you have given me a clue to what I'm doing wrong.
        Devel::NYTProf is a perl profiler (which I will checkout next
        time I get
        a chance).

        So far I have been concentrating on building
        out/replicating/modifying
        GTF equivalents for the following GFF parsing files (which are
        all JS):
        JBrowse/Util/GFF3.js
        JBrowse/Store/SeqFeature/GFF/__Parser.js
        JBrowse/Store/SeqFeature/GFF.__js
        JBrowse/View/FileDialog/__TrackList/GFFDriver.js


        Is there some low-level Perl that gets called in the course of
        loading a
        GFF from URL or local storage? Also does it look like I am
        missing any
        major files in the processing?

        Thanks!
        Andrew



        On Mon, Feb 10, 2014 at 7:15 PM, Robert Buels <[hidden email]
        <mailto:[hidden email]>
        <mailto:[hidden email] <mailto:[hidden email]>>> wrote:

             I'm not all that familiar with GTF right now, but it seems
        like the
             GFF3 parser might be a lot more complex than a GTF parser
        would need
             to be. It might actually be easier and faster to write one
        from scratch.

             Also, for performance profiling, give Devel::NYTProf a try
        if you
             haven't already.  It's swell.


             Robert Buels
             Lead Developer
             JBrowse - http://jbrowse.org


             On 02/10/2014 07:06 PM, Andrew Warren wrote:

                 Update: I wrote to soon. I had some basic bugs that
        needed fixing.

                 The new parsing is slow however, really slow. I'm not
        sure why
                 at this
                 point. I am trying to do some profiling.

                 For GTF, every feature line has a gene_id and a
        transcript_id
                 but there
                 are no ids that uniquely id each feature. Representing GTF
                 features is a
                 little different because: there is no unique id per
        feature, no
                 parent
                 relation, and the only "unique IDs" are gene_id and
                 transcript_id which
                 are only unique in aggregate -- meaning multiple lines
        all refer
                 to the
                 same conceptual unique gene and transcript but there is
        no line [or
                 coordinates] for them specifically. Here I am creating
        transcript
                 features from the first 'child' feature and let
        'gene_ids' aka genes
                 simply be attributes not a feature in themselves. As more
                 "children" are
                 added to the parent transcript (through the
        transcript_id) I
                 expand its
                 coordinates.

                 The reason I am leaving out genes is that in eukaryotes
        a gene
                 can have
                 multiple transcripts. In prokaryotes a transcript can
        have multiple
                 genes. Since there is no parent relation to resolve the
        matter, one
                 top-level attribute needed to be chosen.


                 On Mon, Feb 10, 2014 at 1:03 PM, Andrew Warren
        <[hidden email] <mailto:[hidden email]>
                 <mailto:[hidden email] <mailto:[hidden email]>>
                 <mailto:[hidden email] <mailto:[hidden email]>
        <mailto:[hidden email] <mailto:[hidden email]>>>> wrote:

                      Hi all,

                      So I have started work on trying to add a GTF
        parser. Since
                 it is
                      very similar (but annoyingly different) from the
        GFF format
                 I have
                      started off by trying to modify Robert's existing
        GFF parser.

                      Everything seems to parse ok, and the features are
        created,
                 but then
                      when I try and click on a feature or do enough
        panning and
                 zooming
                      things seem to lock up (making it difficult to
        debug). I am
                 not sure
                      whats going on here.

                      The changes I have so far can be found here

        https://github.com/aswarren/____jbrowse/commit/____fe1251c50e855fb1562a716b53c770____1669c12543
        <https://github.com/aswarren/__jbrowse/commit/__fe1251c50e855fb1562a716b53c770__1669c12543>         http://mblab.wustl.edu/GTF22.____html
        <http://mblab.wustl.edu/GTF22.__html>


                 <http://mblab.wustl.edu/GTF22.__html
        <http://mblab.wustl.edu/GTF22.html>>

                      I also try to account for the Cufflinks variation
        (with actual
                      transcript lines):
        http://cufflinks.cbcb.umd.edu/____manual.html#gtfout
        <http://cufflinks.cbcb.umd.edu/__manual.html#gtfout>

                 <http://cufflinks.cbcb.umd.__edu/manual.html#gtfout

        <http://cufflinks.cbcb.umd.edu/manual.html#gtfout>>


                      An example of cufflinks variation:
        http://rnaseq.pathogenportal.____org/dataset/display?dataset_____id=5c1eed587ff05ae7&to_ext=__gtf


        <http://rnaseq.pathogenportal.__org/dataset/display?dataset___id=5c1eed587ff05ae7&to_ext=gtf
        <http://rnaseq.pathogenportal.org/dataset/display?dataset_id=5c1eed587ff05ae7&to_ext=gtf>__>



                      Any suggestions or ideas on what could be going
        wrong would be
                      greatly appreciated.


                      Thanks,

                      Andrew







        ------------------------------____----------------------------__--__------------------


                 Android apps run on BlackBerry 10
                 Introducing the new BlackBerry 10.2.1 Runtime for
        Android apps.
                 Now with support for Jelly Bean, Bluetooth, Mapview and
        more.
                 Get your Android app in front of a whole new audience.
          Start now.
        http://pubads.g.doubleclick.____net/gampad/clk?id=124407151&____iu=/4140/ostg.clktrk

        <http://pubads.g.doubleclick.__net/gampad/clk?id=124407151&__iu=/4140/ostg.clktrk
        <http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk>>



                 ___________________________________________________
                 Gmod-ajax mailing list
                 [hidden email].____net
                 <mailto:[hidden email]__sourceforge.net
        <mailto:[hidden email]>>
        https://lists.sourceforge.net/____lists/listinfo/gmod-ajax
        <https://lists.sourceforge.net/__lists/listinfo/gmod-ajax>

        <https://lists.sourceforge.__net/lists/listinfo/gmod-ajax
        <https://lists.sourceforge.net/lists/listinfo/gmod-ajax>>





------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-ajax mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-ajax
Reply | Threaded
Open this post in threaded view
|

Re: GTF parsing challenge

Robert Buels-2
Well, when you start talking about very large files like this, it makes
more sense to start using either Tabix or maybe BigBed.

Maybe you might look into doing a Tabix-indexed GTF backend, similar to
the tabix-indexed VCF backend that is already in place?  Have a look at
JBrowse/Store/SeqFeature/VCFTabix, and see how it's constructed.  If you
have a GTF parser already, you could probably use the components that
are in place already to make a tabix-indexed GTF backend without too
much trouble.  I would think.

Robert Buels
Lead Developer
JBrowse - http://jbrowse.org

On 03/03/2014 05:04 PM, Andrew Warren wrote:

> Hey Robert,
>
> I fixed the performance issue I had with the GTF parser and issued a
> pull request.
>
> One thing I noticed when testing with:
>
> ftp://ftp.ensembl.org/pub/current_fasta/mus_musculus/dna/Mus_musculus.GRCm38.75.dna.chromosome.1.fa.gz
> ftp://ftp.ensembl.org/pub/current_gtf/mus_musculus/Mus_musculus.GRCm38.75.gtf.gz (chromosome
> 1 portion)
>
> was that when I used the full gtf file (large at 434Mb) was that addLine
> in GFF3/Parser (and now its GTF equivalent) is called on every line in
> the GFF file.
>
> I was wondering if it would make sense to store some sort of reference
> to the file-path/URL of the GFF file so that only the features for the
> current sequence need be parsed out and the file/URL can be reparsed and
> other features loaded if the sequence is changed. As you might expect I
> got timeouts when trying to load such a big GTF file but it worked well
> when greping it down to just one sequence worth of features.
>
> Cheers,
> Andrew
>
>
> On Thu, Feb 27, 2014 at 8:44 AM, Robert Buels <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Oh, right, operons.  In the SO, they are a subclass of gene_group
>     (http://www.sequenceontology.__org/browser/current_svn/term/__SO:0000178
>     <http://www.sequenceontology.org/browser/current_svn/term/SO:0000178>),
>     so no problem there.
>
>     Make any progress since my last mail?  Any questions?
>
>
>     Robert Buels
>     Lead Developer
>     JBrowse - http://jbrowse.org
>
>     On 02/12/2014 02:51 PM, Andrew Warren wrote:
>
>         Ok good to know I haven't left something out completely. Most of my
>         mutations to the existing GFF parsing are encapsulated in one
>         squashed
>         commit though there were some follow on bug fixes and streamlining.
>         Since I copied all the GFF files in an initial commit then
>         changed them
>         from there its pretty easy to look at.  I'm still working on it
>         but have
>         yet to get speed comparable to the GFF parser (which is strange
>         since it
>         is not that drastic).
>         https://github.com/aswarren/__jbrowse/commit/__fe1251c50e855fb1562a716b53c770__1669c12543
>         <https://github.com/aswarren/jbrowse/commit/fe1251c50e855fb1562a716b53c7701669c12543>
>
>         Transcripts with multiple genes is the norm in bacteria/prokaryotes.
>         They use operons http://en.wikipedia.org/wiki/__Operon
>         <http://en.wikipedia.org/wiki/Operon>
>
>         Stuff a bunch of genes next to each other on the same transcript.
>
>         Thanks,
>         Andrew
>
>
>
>         On Wed, Feb 12, 2014 at 12:59 PM, Robert Buels <[hidden email]
>         <mailto:[hidden email]>
>         <mailto:[hidden email] <mailto:[hidden email]>>> wrote:
>
>              Oooh, sorry I misunderstood.  Thought you were talking
>         about adding
>              GTF support to flatfile-to-json.pl
>         <http://flatfile-to-json.pl> <http://flatfile-to-json.pl>.
>
>                But doing it on the client side is good too.
>
>              Well anyway, if you have a good understanding of the JS
>         GFF3 parser,
>              I'm sure you'll figure it out.
>
>              Are you sure that in prokaryotes a transcript can belong to
>         multiple
>              genes?  This state of things would probably cause trouble
>         with a lot
>              of tools.
>
>
>
>              Robert Buels
>              Lead Developer
>              JBrowse - http://jbrowse.org
>
>              On 02/12/2014 12:21 PM, Andrew Warren wrote:
>
>                  Ah maybe you have given me a clue to what I'm doing wrong.
>                  Devel::NYTProf is a perl profiler (which I will
>         checkout next
>                  time I get
>                  a chance).
>
>                  So far I have been concentrating on building
>                  out/replicating/modifying
>                  GTF equivalents for the following GFF parsing files
>         (which are
>                  all JS):
>                  JBrowse/Util/GFF3.js
>                  JBrowse/Store/SeqFeature/GFF/____Parser.js
>                  JBrowse/Store/SeqFeature/GFF.____js
>                  JBrowse/View/FileDialog/____TrackList/GFFDriver.js
>
>
>                  Is there some low-level Perl that gets called in the
>         course of
>                  loading a
>                  GFF from URL or local storage? Also does it look like I am
>                  missing any
>                  major files in the processing?
>
>                  Thanks!
>                  Andrew
>
>
>
>                  On Mon, Feb 10, 2014 at 7:15 PM, Robert Buels
>         <[hidden email] <mailto:[hidden email]>
>                  <mailto:[hidden email] <mailto:[hidden email]>>
>                  <mailto:[hidden email] <mailto:[hidden email]>
>         <mailto:[hidden email] <mailto:[hidden email]>>>> wrote:
>
>                       I'm not all that familiar with GTF right now, but
>         it seems
>                  like the
>                       GFF3 parser might be a lot more complex than a GTF
>         parser
>                  would need
>                       to be. It might actually be easier and faster to
>         write one
>                  from scratch.
>
>                       Also, for performance profiling, give
>         Devel::NYTProf a try
>                  if you
>                       haven't already.  It's swell.
>
>
>                       Robert Buels
>                       Lead Developer
>                       JBrowse - http://jbrowse.org
>
>
>                       On 02/10/2014 07:06 PM, Andrew Warren wrote:
>
>                           Update: I wrote to soon. I had some basic bugs
>         that
>                  needed fixing.
>
>                           The new parsing is slow however, really slow.
>         I'm not
>                  sure why
>                           at this
>                           point. I am trying to do some profiling.
>
>                           For GTF, every feature line has a gene_id and a
>                  transcript_id
>                           but there
>                           are no ids that uniquely id each feature.
>         Representing GTF
>                           features is a
>                           little different because: there is no unique
>         id per
>                  feature, no
>                           parent
>                           relation, and the only "unique IDs" are
>         gene_id and
>                           transcript_id which
>                           are only unique in aggregate -- meaning
>         multiple lines
>                  all refer
>                           to the
>                           same conceptual unique gene and transcript but
>         there is
>                  no line [or
>                           coordinates] for them specifically. Here I am
>         creating
>                  transcript
>                           features from the first 'child' feature and let
>                  'gene_ids' aka genes
>                           simply be attributes not a feature in
>         themselves. As more
>                           "children" are
>                           added to the parent transcript (through the
>                  transcript_id) I
>                           expand its
>                           coordinates.
>
>                           The reason I am leaving out genes is that in
>         eukaryotes
>                  a gene
>                           can have
>                           multiple transcripts. In prokaryotes a
>         transcript can
>                  have multiple
>                           genes. Since there is no parent relation to
>         resolve the
>                  matter, one
>                           top-level attribute needed to be chosen.
>
>
>                           On Mon, Feb 10, 2014 at 1:03 PM, Andrew Warren
>                  <[hidden email] <mailto:[hidden email]>
>         <mailto:[hidden email] <mailto:[hidden email]>>
>                           <mailto:[hidden email]
>         <mailto:[hidden email]> <mailto:[hidden email]
>         <mailto:[hidden email]>>>
>                           <mailto:[hidden email]
>         <mailto:[hidden email]> <mailto:[hidden email]
>         <mailto:[hidden email]>>
>                  <mailto:[hidden email] <mailto:[hidden email]>
>         <mailto:[hidden email] <mailto:[hidden email]>>>>> wrote:
>
>                                Hi all,
>
>                                So I have started work on trying to add a GTF
>                  parser. Since
>                           it is
>                                very similar (but annoyingly different)
>         from the
>                  GFF format
>                           I have
>                                started off by trying to modify Robert's
>         existing
>                  GFF parser.
>
>                                Everything seems to parse ok, and the
>         features are
>                  created,
>                           but then
>                                when I try and click on a feature or do
>         enough
>                  panning and
>                           zooming
>                                things seem to lock up (making it
>         difficult to
>                  debug). I am
>                           not sure
>                                whats going on here.
>
>                                The changes I have so far can be found here
>
>         https://github.com/aswarren/______jbrowse/commit/______fe1251c50e855fb1562a716b53c770______1669c12543
>         <https://github.com/aswarren/____jbrowse/commit/____fe1251c50e855fb1562a716b53c770____1669c12543>
>
>         <https://github.com/aswarren/____jbrowse/commit/____fe1251c50e855fb1562a716b53c770____1669c12543
>         <https://github.com/aswarren/__jbrowse/commit/__fe1251c50e855fb1562a716b53c770__1669c12543>>
>
>
>
>         <https://github.com/aswarren/____jbrowse/commit/____fe1251c50e855fb1562a716b53c770____1669c12543
>         <https://github.com/aswarren/__jbrowse/commit/__fe1251c50e855fb1562a716b53c770__1669c12543>
>
>         <https://github.com/aswarren/__jbrowse/commit/__fe1251c50e855fb1562a716b53c770__1669c12543
>         <https://github.com/aswarren/jbrowse/commit/fe1251c50e855fb1562a716b53c7701669c12543>>>
>
>
>                                GTF format here:
>         http://mblab.wustl.edu/GTF22.______html
>         <http://mblab.wustl.edu/GTF22.____html>
>                  <http://mblab.wustl.edu/GTF22.____html
>         <http://mblab.wustl.edu/GTF22.__html>>
>
>
>                           <http://mblab.wustl.edu/GTF22.____html
>         <http://mblab.wustl.edu/GTF22.__html>
>                  <http://mblab.wustl.edu/GTF22.__html
>         <http://mblab.wustl.edu/GTF22.html>>>
>
>                                I also try to account for the Cufflinks
>         variation
>                  (with actual
>                                transcript lines):
>         http://cufflinks.cbcb.umd.edu/______manual.html#gtfout
>         <http://cufflinks.cbcb.umd.edu/____manual.html#gtfout>
>                  <http://cufflinks.cbcb.umd.__edu/__manual.html#gtfout
>         <http://cufflinks.cbcb.umd.edu/__manual.html#gtfout>>
>
>
>           <http://cufflinks.cbcb.umd.____edu/manual.html#gtfout
>
>                  <http://cufflinks.cbcb.umd.__edu/manual.html#gtfout
>         <http://cufflinks.cbcb.umd.edu/manual.html#gtfout>>>
>
>
>                                An example of cufflinks variation:
>         http://rnaseq.pathogenportal.______org/dataset/display?__dataset_____id=__5c1eed587ff05ae7&to_ext=__gtf
>
>
>
>         <http://rnaseq.pathogenportal.____org/dataset/display?dataset_____id=5c1eed587ff05ae7&to_ext=__gtf
>
>         <http://rnaseq.pathogenportal.__org/dataset/display?dataset___id=5c1eed587ff05ae7&to_ext=gtf
>         <http://rnaseq.pathogenportal.org/dataset/display?dataset_id=5c1eed587ff05ae7&to_ext=gtf>__>__>
>
>
>
>                                Any suggestions or ideas on what could be
>         going
>                  wrong would be
>                                greatly appreciated.
>
>
>                                Thanks,
>
>                                Andrew
>
>
>
>
>
>
>
>
>         ------------------------------______--------------------------__--__--__------------------
>
>
>                           Android apps run on BlackBerry 10
>                           Introducing the new BlackBerry 10.2.1 Runtime for
>                  Android apps.
>                           Now with support for Jelly Bean, Bluetooth,
>         Mapview and
>                  more.
>                           Get your Android app in front of a whole new
>         audience.
>                    Start now.
>         http://pubads.g.doubleclick.______net/gampad/clk?id=124407151&______iu=/4140/ostg.clktrk
>
>
>         <http://pubads.g.doubleclick.____net/gampad/clk?id=124407151&____iu=/4140/ostg.clktrk
>
>         <http://pubads.g.doubleclick.__net/gampad/clk?id=124407151&__iu=/4140/ostg.clktrk
>         <http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk>>>
>
>
>
>
>           _____________________________________________________
>                           Gmod-ajax mailing list
>                           [hidden email].______net
>                           <mailto:Gmod-ajax@lists.
>         <mailto:Gmod-ajax@lists.>__sour__ceforge.net
>         <http://sourceforge.net>
>                  <mailto:[hidden email]
>         <mailto:[hidden email]>>>
>         https://lists.sourceforge.net/______lists/listinfo/gmod-ajax
>         <https://lists.sourceforge.net/____lists/listinfo/gmod-ajax>
>
>         <https://lists.sourceforge.__net/__lists/listinfo/gmod-ajax
>         <https://lists.sourceforge.net/__lists/listinfo/gmod-ajax>__>
>
>                  <https://lists.sourceforge.____net/lists/listinfo/gmod-ajax
>
>         <https://lists.sourceforge.__net/lists/listinfo/gmod-ajax
>         <https://lists.sourceforge.net/lists/listinfo/gmod-ajax>>>
>
>
>
>

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries.  Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-ajax mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-ajax