GBrowe is slow

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

GBrowe is slow

Samy Jeevan Karloss Antony
Hi there,

I have installed GBrowse. I am using Chado and PostGresSQL.

It became very slow after I loaded a GFF3 file with the size of ~80 MB. Which contains Gene, mRNA, CDS, 3` UTR and 5’  UTR.  

Its running on 6 CPUs with 12 GB RAM (test server).

I have few questions.

1. Is that CPU and RAM enough for this data size?

2. Feature table shows that its size is ~1.2GB and the whole database size is 6GB. I feel that some of my OLD test data would not have been deleted properly.  Is there way to check it? or how to delete all the instances of test data loaded into the database?

3. What is the actual background Postgres query when you search a gene in the browser? so that I can check whether my database is slow.  

Thanks,
Jeevan
------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|

Re: GBrowe is slow

Scott Cain
Hi Jeevan,

Is there a reason you're using Chado?  It is a much slower option for driving GBrowse than Bio::DB::SeqFeature::Store.  Even if you want "live" data for one track, I'd suggest setting up a 'Store database for the rest of the tracks and just get the live data track from Chado.

Also, using fastcgi will make a difference in speed for GBrowse regardless of what database adaptor you're using, but it makes a big difference for Chado, because there is some start up overhead that fastcgi caches for you, so after the first query it's noticeably faster.

That said, there are options for tuning a PostgreSQL database to get better performance, since it isn't particularly tuned out of the box.  Have you looked at this for instance:

https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

In the past (ie, pre Postgres 9), I've had lots of success adjusting the shared_buffers parameter, but there are other tuning parameters as well.

Now about your questions:

1. Are the cpu/ram enough?  Probably.

2. While it's possible to delete feature data via GFF, it is slow and not perfect.  Typically, what I do when I need to do something like that is identify the first feature_id of the "real" data and then delete everything below it:

  DELETE FROM feature WHERE feature_id < ?

That will cause a cascading delete, which will not only delete the feature rows in the database but all data in linked tables as well.  That said, I'm not sure that the increase in database size is that out of the ordinary: there are several indexes that get built when you load data, so it will multiply the disk size by a fair bit (though I don't know off hand how much).  Additionally, the number and size of ontologies you have loaded will have an effect as well.

3. The long running query will look something like this:

    select child.feature_id, child.name, child.type_id, child.uniquename, parent.name as pname, child.is_obsolete, childloc.fmin, childloc.fmax, childloc.strand, childloc.locgroup, childloc.phase, COALESCE(af.significance, af.identity, af.normscore, af.rawscore) as score, childloc.srcfeature_id
    from feature as parent
    inner join
      feature_relationship as fr0 on
        (parent.feature_id = fr0.object_id)
    inner join
      feature as child on
        (child.feature_id = fr0.subject_id)
    inner join
      featureloc as childloc on
        (child.feature_id = childloc.feature_id)
     left join
        analysisfeature as af on
        (child.feature_id = af.feature_id)
    inner join featureloc as parentloc on (parent.feature_id = parentloc.feature_id)
    where parent.feature_id = $parent_id
          and childloc.rank = 0
          and fr0.type_id in ($partof)
          and childloc.srcfeature_id = $refclass_feature_id 
          and parentloc.srcfeature_id = $refclass_feature_id 
          and child.type_id in ($typelist)

where you'd need to sub in these values:

$part_of: a comma separated list of cvterm_ids of part_of and derived_from terms in the relationship ontology.
$typelist: a comma separated list of cvterm_ids of the sequence ontology terms to be searched for.
$parent_id: the feature_id of the parent feature (typically of a gene or match feature).
$refclass_feature_id: the feature_id of the reference feature (ie, the chromosome).

This query has to run for each feature that might have subparts, so for a given view in GBrowse, it could run a hundred times or more.  This is the main reason the Chado adaptor doesn't scale well.

Scott



On Tue, May 13, 2014 at 4:07 AM, Samy Jeevan Karloss Antony <[hidden email]> wrote:
>
> Hi there,
>
> I have installed GBrowse. I am using Chado and PostGresSQL.
>
> It became very slow after I loaded a GFF3 file with the size of ~80 MB. Which contains Gene, mRNA, CDS, 3` UTR and 5’  UTR.
>
> Its running on 6 CPUs with 12 GB RAM (test server).
>
> I have few questions.
>
> 1. Is that CPU and RAM enough for this data size?
>
> 2. Feature table shows that its size is ~1.2GB and the whole database size is 6GB. I feel that some of my OLD test data would not have been deleted properly.  Is there way to check it? or how to delete all the instances of test data loaded into the database?
>
> 3. What is the actual background Postgres query when you search a gene in the browser? so that I can check whether my database is slow.
>
> Thanks,
> Jeevan
> ------------------------------------------------------------------------------
> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
> Instantly run your Selenium tests across 300+ browser/OS combos.
> Get unparalleled scalability from the best Selenium testing platform available
> Simple to use. Nothing to install. Get started now for free."
> http://p.sf.net/sfu/SauceLabs
> _______________________________________________
> Gmod-gbrowse mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse




--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research

------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|

Re: GBrowe is slow

Samy Jeevan Karloss Antony
Hi Scott, Thanks. 

I installed FastCGI. 

And, I moved from Chado+PostGres to MySQL. The result is amazing. Its very fast now. Thanks again. 

Jeevan

On 13 May 2014, at 16:00, Scott Cain <[hidden email]> wrote:

Hi Jeevan,

Is there a reason you're using Chado?  It is a much slower option for driving GBrowse than Bio::DB::SeqFeature::Store.  Even if you want "live" data for one track, I'd suggest setting up a 'Store database for the rest of the tracks and just get the live data track from Chado.

Also, using fastcgi will make a difference in speed for GBrowse regardless of what database adaptor you're using, but it makes a big difference for Chado, because there is some start up overhead that fastcgi caches for you, so after the first query it's noticeably faster.

That said, there are options for tuning a PostgreSQL database to get better performance, since it isn't particularly tuned out of the box.  Have you looked at this for instance:

https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

In the past (ie, pre Postgres 9), I've had lots of success adjusting the shared_buffers parameter, but there are other tuning parameters as well.

Now about your questions:

1. Are the cpu/ram enough?  Probably.

2. While it's possible to delete feature data via GFF, it is slow and not perfect.  Typically, what I do when I need to do something like that is identify the first feature_id of the "real" data and then delete everything below it:

  DELETE FROM feature WHERE feature_id < ?

That will cause a cascading delete, which will not only delete the feature rows in the database but all data in linked tables as well.  That said, I'm not sure that the increase in database size is that out of the ordinary: there are several indexes that get built when you load data, so it will multiply the disk size by a fair bit (though I don't know off hand how much).  Additionally, the number and size of ontologies you have loaded will have an effect as well.

3. The long running query will look something like this:

    select child.feature_id, child.name, child.type_id, child.uniquename, parent.name as pname, child.is_obsolete, childloc.fmin, childloc.fmax, childloc.strand, childloc.locgroup, childloc.phase, COALESCE(af.significance, af.identity, af.normscore, af.rawscore) as score, childloc.srcfeature_id
    from feature as parent
    inner join
      feature_relationship as fr0 on
        (parent.feature_id = fr0.object_id)
    inner join
      feature as child on
        (child.feature_id = fr0.subject_id)
    inner join
      featureloc as childloc on
        (child.feature_id = childloc.feature_id)
     left join
        analysisfeature as af on
        (child.feature_id = af.feature_id)
    inner join featureloc as parentloc on (parent.feature_id = parentloc.feature_id)
    where parent.feature_id = $parent_id
          and childloc.rank = 0
          and fr0.type_id in ($partof)
          and childloc.srcfeature_id = $refclass_feature_id 
          and parentloc.srcfeature_id = $refclass_feature_id 
          and child.type_id in ($typelist)

where you'd need to sub in these values:

$part_of: a comma separated list of cvterm_ids of part_of and derived_from terms in the relationship ontology.
$typelist: a comma separated list of cvterm_ids of the sequence ontology terms to be searched for.
$parent_id: the feature_id of the parent feature (typically of a gene or match feature).
$refclass_feature_id: the feature_id of the reference feature (ie, the chromosome).

This query has to run for each feature that might have subparts, so for a given view in GBrowse, it could run a hundred times or more.  This is the main reason the Chado adaptor doesn't scale well.

Scott



On Tue, May 13, 2014 at 4:07 AM, Samy Jeevan Karloss Antony <[hidden email]> wrote:
>
> Hi there,
>
> I have installed GBrowse. I am using Chado and PostGresSQL.
>
> It became very slow after I loaded a GFF3 file with the size of ~80 MB. Which contains Gene, mRNA, CDS, 3` UTR and 5’  UTR.
>
> Its running on 6 CPUs with 12 GB RAM (test server).
>
> I have few questions.
>
> 1. Is that CPU and RAM enough for this data size?
>
> 2. Feature table shows that its size is ~1.2GB and the whole database size is 6GB. I feel that some of my OLD test data would not have been deleted properly.  Is there way to check it? or how to delete all the instances of test data loaded into the database?
>
> 3. What is the actual background Postgres query when you search a gene in the browser? so that I can check whether my database is slow.
>
> Thanks,
> Jeevan
> ------------------------------------------------------------------------------
> "Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
> Instantly run your Selenium tests across 300+ browser/OS combos.
> Get unparalleled scalability from the best Selenium testing platform available
> Simple to use. Nothing to install. Get started now for free."
> http://p.sf.net/sfu/SauceLabs
> _______________________________________________
> Gmod-gbrowse mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse




--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research


------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse