GFF3 Bulk Loading Time

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

GFF3 Bulk Loading Time

Travis Wrightsman
Hello,

I've been waiting on Tripal to bulk load a GFF3 file from MAKER (after tidying up with genometools) of approximately 510,000 lines of 32,000 genes for about 24 hours now at full CPU usage. Is this normal for a dataset of that size?

-Travis

------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal
Reply | Threaded
Open this post in threaded view
|

Re: GFF3 Bulk Loading Time

Stephen Ficklin-2

Hi Travis,

Apologies for the slow reply. I'm out of the office and checking email irregularly.  Yes, depending on the size of the GFF file it may take quite a while to load. For an entire genome I think you're experience is typical.  It takes quite a while because there are so many checks that must be performed for each entry that it causes slowness in the loader.

Stephen


On 7/7/2016 2:20 PM, Travis Wrightsman wrote:
Hello,

I've been waiting on Tripal to bulk load a GFF3 file from MAKER (after tidying up with genometools) of approximately 510,000 lines of 32,000 genes for about 24 hours now at full CPU usage. Is this normal for a dataset of that size?

-Travis


------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape


_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal


------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal
Reply | Threaded
Open this post in threaded view
|

Re: GFF3 Bulk Loading Time

Travis Wrightsman
Hey Stephen,

It took about 36 hours to get all the features loaded into Chado, is there a way to parallelize this process across many cores to decrease the execution time? Or is Postgres the bottleneck? Ideally it could be executed faster on a compute cluster because I have one available to run the web server on.

-Travis

On Sat, Jul 9, 2016 at 4:51 PM, Stephen Ficklin <[hidden email]> wrote:

Hi Travis,

Apologies for the slow reply. I'm out of the office and checking email irregularly.  Yes, depending on the size of the GFF file it may take quite a while to load. For an entire genome I think you're experience is typical.  It takes quite a while because there are so many checks that must be performed for each entry that it causes slowness in the loader.

Stephen


On 7/7/2016 2:20 PM, Travis Wrightsman wrote:
Hello,

I've been waiting on Tripal to bulk load a GFF3 file from MAKER (after tidying up with genometools) of approximately 510,000 lines of 32,000 genes for about 24 hours now at full CPU usage. Is this normal for a dataset of that size?

-Travis


------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape


_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal


------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal
Reply | Threaded
Open this post in threaded view
|

Re: GFF3 Bulk Loading Time

Stephen Ficklin-2

Hi Travis,

Thanks for the update!

I believe the bottleneck with parallel loading  is with the number of parallel transactions that the PostgreSQL server can handle at a time, which will be dependent on the hardware available to PostgreSQL and the table/row locking.  And that will depend on the memory/disk speed/CPUs you have on the database server.  But in any event making these loader support parallel loading would help. We have not explored loading these files in parallel and that is a great idea.  I wish we could jump into looking into that.  But, alas, we've got some other items on the immediate agenda. I've added a feature request on our Issue queue so we can remember:

https://www.drupal.org/node/2764463

Stephen


On 7/11/2016 3:00 PM, Travis Wrightsman wrote:
Hey Stephen,

It took about 36 hours to get all the features loaded into Chado, is there a way to parallelize this process across many cores to decrease the execution time? Or is Postgres the bottleneck? Ideally it could be executed faster on a compute cluster because I have one available to run the web server on.

-Travis

On Sat, Jul 9, 2016 at 4:51 PM, Stephen Ficklin <[hidden email]> wrote:

Hi Travis,

Apologies for the slow reply. I'm out of the office and checking email irregularly.  Yes, depending on the size of the GFF file it may take quite a while to load. For an entire genome I think you're experience is typical.  It takes quite a while because there are so many checks that must be performed for each entry that it causes slowness in the loader.

Stephen


On 7/7/2016 2:20 PM, Travis Wrightsman wrote:
Hello,

I've been waiting on Tripal to bulk load a GFF3 file from MAKER (after tidying up with genometools) of approximately 510,000 lines of 32,000 genes for about 24 hours now at full CPU usage. Is this normal for a dataset of that size?

-Travis


------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape


_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal



------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal
Reply | Threaded
Open this post in threaded view
|

Re: GFF3 Bulk Loading Time

Sofia Robb
In reply to this post by Travis Wrightsman
Hi Travis,

I don't think there is a way to parallelize this, but I have figured out how to make the gff load faster. I ALWAYS begin by making a dump of my db, then when I load I unselect the option to use a "Use a transaction". This will greatly improve the load speed. I went from 4 days to 24hrs.

Sofia

On Mon, Jul 11, 2016 at 4:00 PM, Travis Wrightsman <[hidden email]> wrote:
Hey Stephen,

It took about 36 hours to get all the features loaded into Chado, is there a way to parallelize this process across many cores to decrease the execution time? Or is Postgres the bottleneck? Ideally it could be executed faster on a compute cluster because I have one available to run the web server on.

-Travis

On Sat, Jul 9, 2016 at 4:51 PM, Stephen Ficklin <[hidden email]> wrote:

Hi Travis,

Apologies for the slow reply. I'm out of the office and checking email irregularly.  Yes, depending on the size of the GFF file it may take quite a while to load. For an entire genome I think you're experience is typical.  It takes quite a while because there are so many checks that must be performed for each entry that it causes slowness in the loader.

Stephen


On 7/7/2016 2:20 PM, Travis Wrightsman wrote:
Hello,

I've been waiting on Tripal to bulk load a GFF3 file from MAKER (after tidying up with genometools) of approximately 510,000 lines of 32,000 genes for about 24 hours now at full CPU usage. Is this normal for a dataset of that size?

-Travis


------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape


_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal


------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal



------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal
Reply | Threaded
Open this post in threaded view
|

Re: GFF3 Bulk Loading Time

Stephen Ficklin-2

Thanks Sofia for your input.  It seems there may be a few other tricks we can do to speed things up.  We could potentially have an option to remove indexes before loading and recreating afterwards, and turning off foreign key constraints (we check them with the loader anyway)....


On 7/12/2016 10:56 AM, Sofia Robb wrote:
Hi Travis,

I don't think there is a way to parallelize this, but I have figured out how to make the gff load faster. I ALWAYS begin by making a dump of my db, then when I load I unselect the option to use a "Use a transaction". This will greatly improve the load speed. I went from 4 days to 24hrs.

Sofia

On Mon, Jul 11, 2016 at 4:00 PM, Travis Wrightsman <[hidden email]> wrote:
Hey Stephen,

It took about 36 hours to get all the features loaded into Chado, is there a way to parallelize this process across many cores to decrease the execution time? Or is Postgres the bottleneck? Ideally it could be executed faster on a compute cluster because I have one available to run the web server on.

-Travis

On Sat, Jul 9, 2016 at 4:51 PM, Stephen Ficklin <[hidden email]> wrote:

Hi Travis,

Apologies for the slow reply. I'm out of the office and checking email irregularly.  Yes, depending on the size of the GFF file it may take quite a while to load. For an entire genome I think you're experience is typical.  It takes quite a while because there are so many checks that must be performed for each entry that it causes slowness in the loader.

Stephen


On 7/7/2016 2:20 PM, Travis Wrightsman wrote:
Hello,

I've been waiting on Tripal to bulk load a GFF3 file from MAKER (after tidying up with genometools) of approximately 510,000 lines of 32,000 genes for about 24 hours now at full CPU usage. Is this normal for a dataset of that size?

-Travis


------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape


_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal


------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal




------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal
Reply | Threaded
Open this post in threaded view
|

Re: GFF3 Bulk Loading Time

Sofia Robb
That sounds great! Speeding up the GFF load anyway possible would be a great help.

On Tue, Jul 12, 2016 at 2:16 PM, Stephen Ficklin <[hidden email]> wrote:

Thanks Sofia for your input.  It seems there may be a few other tricks we can do to speed things up.  We could potentially have an option to remove indexes before loading and recreating afterwards, and turning off foreign key constraints (we check them with the loader anyway)....


On 7/12/2016 10:56 AM, Sofia Robb wrote:
Hi Travis,

I don't think there is a way to parallelize this, but I have figured out how to make the gff load faster. I ALWAYS begin by making a dump of my db, then when I load I unselect the option to use a "Use a transaction". This will greatly improve the load speed. I went from 4 days to 24hrs.

Sofia

On Mon, Jul 11, 2016 at 4:00 PM, Travis Wrightsman <[hidden email]> wrote:
Hey Stephen,

It took about 36 hours to get all the features loaded into Chado, is there a way to parallelize this process across many cores to decrease the execution time? Or is Postgres the bottleneck? Ideally it could be executed faster on a compute cluster because I have one available to run the web server on.

-Travis

On Sat, Jul 9, 2016 at 4:51 PM, Stephen Ficklin <[hidden email]> wrote:

Hi Travis,

Apologies for the slow reply. I'm out of the office and checking email irregularly.  Yes, depending on the size of the GFF file it may take quite a while to load. For an entire genome I think you're experience is typical.  It takes quite a while because there are so many checks that must be performed for each entry that it causes slowness in the loader.

Stephen


On 7/7/2016 2:20 PM, Travis Wrightsman wrote:
Hello,

I've been waiting on Tripal to bulk load a GFF3 file from MAKER (after tidying up with genometools) of approximately 510,000 lines of 32,000 genes for about 24 hours now at full CPU usage. Is this normal for a dataset of that size?

-Travis


------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape


_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal


------------------------------------------------------------------------------
Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San
Francisco, CA to explore cutting-edge tech and listen to tech luminaries
present their vision of the future. This family event has something for
everyone, including kids. Get more information and register today.
http://sdm.link/attshape
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal





------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are
consuming the most bandwidth. Provides multi-vendor support for NetFlow,
J-Flow, sFlow and other flows. Make informed decisions using capacity planning
reports.http://sdm.link/zohodev2dev
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal