multi-chado schema feature request - who would be interested?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

multi-chado schema feature request - who would be interested?

valentin.guignon
Hi Stephen and the list,

first of all, thank you for Tripal which is simply awesome! :)

Recently, we had a discussion with colleagues about how to store several
genomes within CHADO. Most people recommend 1 CHADO instance per genome.
You can achieve that in 2 ways:
1) having separate CHADO databases;
2) having separate CHADO schema inside a same database.

The second way (ie. separate schema) offers several advantages, one of
them being cross-schema querying. It would be useful for us to be able
to compare genome directly in one query. Let me give you a simple use
case...

Let's say you have 2 genomes "A" and "B", A being stored in the
"chado_a" schema and B in "chado_b". You have loaded BLAST matches for
both genomes with the genome "C". Now, if you want to get the all the
sequences of genome A and B that matched the same sequences of genome C,
you could issue the following query:

SELECT
   fma.uniquename AS "C match",
   fa.feature_id,
   fa.uniquename AS "A-C match",
   fb.feature_id,
   fb.uniquename AS "B-C match"
FROM
   chado_a.feature fa
     JOIN chado_a.featureloc fla ON fa.feature_id = fla.srcfeature_id
     JOIN chado_a.feature fma ON fla.feature_id = fma.feature_id,
   chado_b.feature fb
     JOIN chado_b.featureloc flb ON fb.feature_id = flb.srcfeature_id
     JOIN chado_b.feature fmb ON flb.feature_id = fmb.feature_id
WHERE
   fma.uniquename = fmb.uniquename
;

I didn't try that query and of course the one I could use for real would
be more complicated (ie. with is_obsolete, type_id, etc.) but that's
just to give you a simple idea of what it could look-like. And that's a
very simple use case but you can imagine many others where you'd like to
be able to compare multiple genome data.

If you store your genomes in several databases (ie. "way 1)") then, you
would have to "programmatically" compare many many records of each
database against each other. If your database is stored on another
server than the web one, then all those records would have to be
transferred from the database server to the web server while if you used
the "way 2)", only relevant records would have been transferred.




So, here is my proposal/feature request to achieve that with Tripal:

"there should be a field in Tripal setup that allows us to specify the
name of the (main) CHADO schema that is used by current tripal instance.
By default (as currently), it would be 'chado'."




It doesn't sound very complicated since it would basically mean to
replace the hard-coded 'chado' string in the source code by a global
"$main_chado_schema" variable... but I know it usually is more
complicated than it seems! ;-)

That would be the first step. Then you can imagine many other
improvements on Tripal interface and API that would deal with multiple
CHADO instances on a same site but so far, we just need to be able to
specify the name of the chado schema to use. Then we can setup multiple
Drupal instances in a same database in the same public schema using the
native Drupal feature that allows us to prefix Drupal tables. We don't
need Tripal to do the cross-schema querying (but I think it could do it
already though...) but we just need to have all our CHADO instances in a
same database in order to run our other analysis tools (and we don't
want to maintain several copies since each copy could evolve on its side
and that would mean sync issues...).

To sum up, 1 postgresql database, 2 genomes A and B stored in 2 distinct
shcema chado_a and chado_b, 1 public schema for 2 drupal/tripal website
instances, one using the 'a_' table prefix and the other one using the
'b_' table prefix --> 2 distinct websites, one for genome A with default
tripal chado schema name option set to "chado_a" and another one for
genome B with the default tripal chado schema name option set to
"chado_b". Cross-schema querying achieved outside Drupal/Tripal (but
could be, in a far future, integrated to Tripal ;-) of course!).



-Who would be interested by such a feature?
-Stephen, how hard/long do you think it would be to implement that?
-Who would implement that?*
-Should I post this feature request on the Drupal project site?

Please give you opinions! Regards,


    Valentin Guignon



*: I could do it but unfortunately not for 2014 and we need this feature
ASAP for a 3rd year PhD student project... :-s

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal
Reply | Threaded
Open this post in threaded view
|

Re: multi-chado schema feature request - who would be interested?

Kucheran, Lacey Sanderson
Hi Valentin,

First off, thanks for the compliment on Tripal :)

I'm curious as to why you don't feel storing all these genomes within the same chado instance will work for you. I myself work on a Tripal site that houses 4+ legume species including full genomes for 3 of them and I know of a Monsanto group that has 100+ organisms in a single chado (although they're not using Tripal).  Not only does storing your data this way allow Tripal to work for you out-of-the-box but it also allows you to easily store relationships between the genomes using chado (i.e. feature_relationship to link homologous genes) and it can greatly simplify your queries resulting in faster load times. Even if you have multiple genomes for the same genus/species you can always differentiate the genomes by attaching each to an analysis record stating the differences. Furthermore, by adding a single non-standard chado table (feature_stock) you could link each feature record to the specific germplasm that was sequenced. Anyway, that's how I'm handling this particular problem ;-)

As far as extending Tripal to handle multiple chado instances, I think this would take more changes than you think since we have an entire API (Chado Query API) that interacts with chado and is used extensively throughout Tripal that would need quite a few changes. Although this depends on how much interaction you want between the chado instances.

If even after my convincing argument above, you still feel you will need multiple chado instances to handle your data, I will think a little more in-depth on how one might accomplish such a thing. That said, it's going to take a lot more work than changing the built-in "chado" schema name to not loose a lot of the built-in Tripal functionality and to tell you the truth I'm not sure it's even possible without a major refactoring of code :( 

Just looking at one part of Tripal: pages for chado content. You would need to add a schema column to each chado_ table and update the sync'ing code to allow specifying of the schema. You would also need to change chado_generate_var(), chado_query() and really the whole Chado Query API to accept schema specification and change the templates for the pages so they know to look for the schema when retrieving the content.

Furthermore, I know for a fact that Drupal Views would not handle this gracefully (it was never meant for more than the single Drupal schema and a patch, as well as, an entire Tripal module has been needed to get it to handle a single second schema… More than that sounds like crazy talk :p And unfortunately a lot of Tripal functionality including administration and searching is now tied to Drupal Views to allow us to give you the kind of flexibility that we've been able to. I'm not saying its impossible but it's certainly non-trivial even with Tripal Views which will at least give you a single (large) place to make the changes.

Another, potentially easier way to go the multiple chado route would be to make a Tripal site per genome and then use web-services and links to make connections between them. You could actually do this while still having them in the same postgresql database by just pointing each site to the same database and using Drupal's built-in table prefixing with a different prefix per site -then we'd just need to change the hard-coded name of the schema which wouldn't be too much work and should likely be done anyway. Tripal doesn't yet have it's own web services although it will in the future (not soon enough for your PhD student though :s) but the built-in Drupal web services might be enough…

Anyway, please consider my proposal of using a single chado instance since it would greatly simplify things on the Tripal end and I think it would provide you with more functionality in the long run…

~Lacey

PS. I'd be very interested to hear what organisms you're working on and if the "multiple genomes" are of the same species since we are going to need to handle this in the near future… Chickpea already has 2 genomes for the same species just different market classes and although we are currently only housing one, I would love to host both and compare!)

PSS. I would be better able to quantify the number of changes needed if you could elaborate a little more on how much interaction you need between chado instances. Additionally, are you going to need to make pages for data from different schema? Views Data listings that span chado instances?

PSSS. I'm the other main developer of Tripal and have done most of the work with Drupal Views integration -just for informations sake :)

------------------------------------------------------
Lacey-Anne Sanderson
Bioinformaticist
Pulse Crop Breeding and Genetics
Phone: (306) 966-3208
Room 2C33 Agriculture
Department of Plant Sciences
University of Saskatchewan

On Sep 17, 2014, at 8:52 AM, valentin.guignon <[hidden email]> wrote:

Hi Stephen and the list,

first of all, thank you for Tripal which is simply awesome! :)

Recently, we had a discussion with colleagues about how to store several
genomes within CHADO. Most people recommend 1 CHADO instance per genome.
You can achieve that in 2 ways:
1) having separate CHADO databases;
2) having separate CHADO schema inside a same database.

The second way (ie. separate schema) offers several advantages, one of
them being cross-schema querying. It would be useful for us to be able
to compare genome directly in one query. Let me give you a simple use
case...

Let's say you have 2 genomes "A" and "B", A being stored in the
"chado_a" schema and B in "chado_b". You have loaded BLAST matches for
both genomes with the genome "C". Now, if you want to get the all the
sequences of genome A and B that matched the same sequences of genome C,
you could issue the following query:

SELECT
  fma.uniquename AS "C match",
  fa.feature_id,
  fa.uniquename AS "A-C match",
  fb.feature_id,
  fb.uniquename AS "B-C match"
FROM
  chado_a.feature fa
    JOIN chado_a.featureloc fla ON fa.feature_id = fla.srcfeature_id
    JOIN chado_a.feature fma ON fla.feature_id = fma.feature_id,
  chado_b.feature fb
    JOIN chado_b.featureloc flb ON fb.feature_id = flb.srcfeature_id
    JOIN chado_b.feature fmb ON flb.feature_id = fmb.feature_id
WHERE
  fma.uniquename = fmb.uniquename
;

I didn't try that query and of course the one I could use for real would
be more complicated (ie. with is_obsolete, type_id, etc.) but that's
just to give you a simple idea of what it could look-like. And that's a
very simple use case but you can imagine many others where you'd like to
be able to compare multiple genome data.

If you store your genomes in several databases (ie. "way 1)") then, you
would have to "programmatically" compare many many records of each
database against each other. If your database is stored on another
server than the web one, then all those records would have to be
transferred from the database server to the web server while if you used
the "way 2)", only relevant records would have been transferred.




So, here is my proposal/feature request to achieve that with Tripal:

"there should be a field in Tripal setup that allows us to specify the
name of the (main) CHADO schema that is used by current tripal instance.
By default (as currently), it would be 'chado'."




It doesn't sound very complicated since it would basically mean to
replace the hard-coded 'chado' string in the source code by a global
"$main_chado_schema" variable... but I know it usually is more
complicated than it seems! ;-)

That would be the first step. Then you can imagine many other
improvements on Tripal interface and API that would deal with multiple
CHADO instances on a same site but so far, we just need to be able to
specify the name of the chado schema to use. Then we can setup multiple
Drupal instances in a same database in the same public schema using the
native Drupal feature that allows us to prefix Drupal tables. We don't
need Tripal to do the cross-schema querying (but I think it could do it
already though...) but we just need to have all our CHADO instances in a
same database in order to run our other analysis tools (and we don't
want to maintain several copies since each copy could evolve on its side
and that would mean sync issues...).

To sum up, 1 postgresql database, 2 genomes A and B stored in 2 distinct
shcema chado_a and chado_b, 1 public schema for 2 drupal/tripal website
instances, one using the 'a_' table prefix and the other one using the
'b_' table prefix --> 2 distinct websites, one for genome A with default
tripal chado schema name option set to "chado_a" and another one for
genome B with the default tripal chado schema name option set to
"chado_b". Cross-schema querying achieved outside Drupal/Tripal (but
could be, in a far future, integrated to Tripal ;-) of course!).



-Who would be interested by such a feature?
-Stephen, how hard/long do you think it would be to implement that?
-Who would implement that?*
-Should I post this feature request on the Drupal project site?

Please give you opinions! Regards,


   Valentin Guignon



*: I could do it but unfortunately not for 2014 and we need this feature
ASAP for a 3rd year PhD student project... :-s

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal


------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal
Reply | Threaded
Open this post in threaded view
|

Re: multi-chado schema feature request - who would be interested?

Mara Kim
I agree with Lacey about using multiple schemas.  It creates a lot of problems, and doesn't scale well.  Here at Vanderbilt we have a single Chado instance which contains over 200 eukaryotic genomes, and performance is still fine.  This is done by using the organism table to identify each genome separately.  As for the problem of multiple genomes for a single organism, I have modified the organism table to include a column for strain and version, which roughly correspond to biological and computational replicates respectively (ie. different samples and different assemblies).  Currently these columns will not show up in the Tripal organisms view (although supposedly this should be an easy fix with custom views).  I am hoping that this gains more traction and makes it into the standard Chado schema.

To modify the organism table in an existing Chado instance:

SET search_path TO 'chado';
ALTER TABLE organism
  ADD COLUMN strain character varying(255) DEFAULT ''::character varying,
  ADD COLUMN version character varying(255) DEFAULT ''::character varying,
  DROP CONSTRAINT organism_c1;
ALTER TABLE organism ADD CONSTRAINT organism_c1 UNIQUE (genus, species, strain, version);

The resulting organism table will look like this:

                                         Table "chado.organism"
    Column    |          Type          |                           Modifiers                            
--------------+------------------------+----------------------------------------------------------------
 organism_id  | integer                | not null default nextval('organism_organism_id_seq'::regclass)
 abbreviation | character varying(255) | 
 genus        | character varying(255) | not null
 species      | character varying(255) | not null
 common_name  | character varying(255) | 
 comment      | text                   | 
 strain       | character varying(255) | default ''::character varying
 version      | character varying(255) | default ''::character varying
Indexes:
    "organism_pkey" PRIMARY KEY, btree (organism_id)
    "organism_c1" UNIQUE CONSTRAINT, btree (genus, species, strain, version)


The nice thing about this design is that PostgreSQL supports rudimentary table partitioning (http://www.postgresql.org/docs/current/static/ddl-partitioning.html) and eventually automatic partitioning (https://wiki.postgresql.org/wiki/Table_partitioning#Active_Work_In_Progress).  That will mean that you could partition the feature table by organism_id and get the benefits of separate tables for each organism without breaking current Chado semantics.


On Wed, Sep 17, 2014 at 12:30 PM, Lacey-Anne Sanderson <[hidden email]> wrote:
Hi Valentin,

First off, thanks for the compliment on Tripal :)

I'm curious as to why you don't feel storing all these genomes within the same chado instance will work for you. I myself work on a Tripal site that houses 4+ legume species including full genomes for 3 of them and I know of a Monsanto group that has 100+ organisms in a single chado (although they're not using Tripal).  Not only does storing your data this way allow Tripal to work for you out-of-the-box but it also allows you to easily store relationships between the genomes using chado (i.e. feature_relationship to link homologous genes) and it can greatly simplify your queries resulting in faster load times. Even if you have multiple genomes for the same genus/species you can always differentiate the genomes by attaching each to an analysis record stating the differences. Furthermore, by adding a single non-standard chado table (feature_stock) you could link each feature record to the specific germplasm that was sequenced. Anyway, that's how I'm handling this particular problem ;-)

As far as extending Tripal to handle multiple chado instances, I think this would take more changes than you think since we have an entire API (Chado Query API) that interacts with chado and is used extensively throughout Tripal that would need quite a few changes. Although this depends on how much interaction you want between the chado instances.

If even after my convincing argument above, you still feel you will need multiple chado instances to handle your data, I will think a little more in-depth on how one might accomplish such a thing. That said, it's going to take a lot more work than changing the built-in "chado" schema name to not loose a lot of the built-in Tripal functionality and to tell you the truth I'm not sure it's even possible without a major refactoring of code :( 

Just looking at one part of Tripal: pages for chado content. You would need to add a schema column to each chado_ table and update the sync'ing code to allow specifying of the schema. You would also need to change chado_generate_var(), chado_query() and really the whole Chado Query API to accept schema specification and change the templates for the pages so they know to look for the schema when retrieving the content.

Furthermore, I know for a fact that Drupal Views would not handle this gracefully (it was never meant for more than the single Drupal schema and a patch, as well as, an entire Tripal module has been needed to get it to handle a single second schema… More than that sounds like crazy talk :p And unfortunately a lot of Tripal functionality including administration and searching is now tied to Drupal Views to allow us to give you the kind of flexibility that we've been able to. I'm not saying its impossible but it's certainly non-trivial even with Tripal Views which will at least give you a single (large) place to make the changes.

Another, potentially easier way to go the multiple chado route would be to make a Tripal site per genome and then use web-services and links to make connections between them. You could actually do this while still having them in the same postgresql database by just pointing each site to the same database and using Drupal's built-in table prefixing with a different prefix per site -then we'd just need to change the hard-coded name of the schema which wouldn't be too much work and should likely be done anyway. Tripal doesn't yet have it's own web services although it will in the future (not soon enough for your PhD student though :s) but the built-in Drupal web services might be enough…

Anyway, please consider my proposal of using a single chado instance since it would greatly simplify things on the Tripal end and I think it would provide you with more functionality in the long run…

~Lacey

PS. I'd be very interested to hear what organisms you're working on and if the "multiple genomes" are of the same species since we are going to need to handle this in the near future… Chickpea already has 2 genomes for the same species just different market classes and although we are currently only housing one, I would love to host both and compare!)

PSS. I would be better able to quantify the number of changes needed if you could elaborate a little more on how much interaction you need between chado instances. Additionally, are you going to need to make pages for data from different schema? Views Data listings that span chado instances?

PSSS. I'm the other main developer of Tripal and have done most of the work with Drupal Views integration -just for informations sake :)

------------------------------------------------------
Lacey-Anne Sanderson
Bioinformaticist
Pulse Crop Breeding and Genetics
Phone: <a href="tel:%28306%29%20966-3208" value="+13069663208" target="_blank">(306) 966-3208
Room 2C33 Agriculture
Department of Plant Sciences
University of Saskatchewan

On Sep 17, 2014, at 8:52 AM, valentin.guignon <[hidden email]> wrote:

Hi Stephen and the list,

first of all, thank you for Tripal which is simply awesome! :)

Recently, we had a discussion with colleagues about how to store several
genomes within CHADO. Most people recommend 1 CHADO instance per genome.
You can achieve that in 2 ways:
1) having separate CHADO databases;
2) having separate CHADO schema inside a same database.

The second way (ie. separate schema) offers several advantages, one of
them being cross-schema querying. It would be useful for us to be able
to compare genome directly in one query. Let me give you a simple use
case...

Let's say you have 2 genomes "A" and "B", A being stored in the
"chado_a" schema and B in "chado_b". You have loaded BLAST matches for
both genomes with the genome "C". Now, if you want to get the all the
sequences of genome A and B that matched the same sequences of genome C,
you could issue the following query:

SELECT
  fma.uniquename AS "C match",
  fa.feature_id,
  fa.uniquename AS "A-C match",
  fb.feature_id,
  fb.uniquename AS "B-C match"
FROM
  chado_a.feature fa
    JOIN chado_a.featureloc fla ON fa.feature_id = fla.srcfeature_id
    JOIN chado_a.feature fma ON fla.feature_id = fma.feature_id,
  chado_b.feature fb
    JOIN chado_b.featureloc flb ON fb.feature_id = flb.srcfeature_id
    JOIN chado_b.feature fmb ON flb.feature_id = fmb.feature_id
WHERE
  fma.uniquename = fmb.uniquename
;

I didn't try that query and of course the one I could use for real would
be more complicated (ie. with is_obsolete, type_id, etc.) but that's
just to give you a simple idea of what it could look-like. And that's a
very simple use case but you can imagine many others where you'd like to
be able to compare multiple genome data.

If you store your genomes in several databases (ie. "way 1)") then, you
would have to "programmatically" compare many many records of each
database against each other. If your database is stored on another
server than the web one, then all those records would have to be
transferred from the database server to the web server while if you used
the "way 2)", only relevant records would have been transferred.




So, here is my proposal/feature request to achieve that with Tripal:

"there should be a field in Tripal setup that allows us to specify the
name of the (main) CHADO schema that is used by current tripal instance.
By default (as currently), it would be 'chado'."




It doesn't sound very complicated since it would basically mean to
replace the hard-coded 'chado' string in the source code by a global
"$main_chado_schema" variable... but I know it usually is more
complicated than it seems! ;-)

That would be the first step. Then you can imagine many other
improvements on Tripal interface and API that would deal with multiple
CHADO instances on a same site but so far, we just need to be able to
specify the name of the chado schema to use. Then we can setup multiple
Drupal instances in a same database in the same public schema using the
native Drupal feature that allows us to prefix Drupal tables. We don't
need Tripal to do the cross-schema querying (but I think it could do it
already though...) but we just need to have all our CHADO instances in a
same database in order to run our other analysis tools (and we don't
want to maintain several copies since each copy could evolve on its side
and that would mean sync issues...).

To sum up, 1 postgresql database, 2 genomes A and B stored in 2 distinct
shcema chado_a and chado_b, 1 public schema for 2 drupal/tripal website
instances, one using the 'a_' table prefix and the other one using the
'b_' table prefix --> 2 distinct websites, one for genome A with default
tripal chado schema name option set to "chado_a" and another one for
genome B with the default tripal chado schema name option set to
"chado_b". Cross-schema querying achieved outside Drupal/Tripal (but
could be, in a far future, integrated to Tripal ;-) of course!).



-Who would be interested by such a feature?
-Stephen, how hard/long do you think it would be to implement that?
-Who would implement that?*
-Should I post this feature request on the Drupal project site?

Please give you opinions! Regards,


   Valentin Guignon



*: I could do it but unfortunately not for 2014 and we need this feature
ASAP for a 3rd year PhD student project... :-s

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal


------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal




--
M

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal
Reply | Threaded
Open this post in threaded view
|

Re: multi-chado schema feature request - who would be interested?

Mara Kim-2
In reply to this post by Kucheran, Lacey Sanderson
I agree with Lacey about using multiple schemas.  It creates a lot of problems, and doesn't scale well.  Here at Vanderbilt we have a single Chado instance which contains over 200 eukaryotic genomes, and performance is still fine.  This is done by using the organism table to identify each genome separately.  As for the problem of multiple genomes for a single organism, I have modified the organism table to include a column for strain and version, which roughly correspond to biological and computational replicates respectively (ie. different samples and different assemblies).  Currently these columns will not show up in the Tripal organisms view (although supposedly this should be an easy fix with custom views).  I am hoping that this gains more traction and makes it into the standard Chado schema.

To modify the organism table in an existing Chado instance:

SET search_path TO 'chado';
ALTER TABLE organism
  ADD COLUMN strain character varying(255) DEFAULT ''::character varying,
  ADD COLUMN version character varying(255) DEFAULT ''::character varying,
  DROP CONSTRAINT organism_c1;
ALTER TABLE organism ADD CONSTRAINT organism_c1 UNIQUE (genus, species, strain, version);

The resulting organism table will look like this:

                                         Table "chado.organism"
    Column    |          Type          |                           Modifiers                            
--------------+------------------------+----------------------------------------------------------------
 organism_id  | integer                | not null default nextval('organism_organism_id_seq'::regclass)
 abbreviation | character varying(255) | 
 genus        | character varying(255) | not null
 species      | character varying(255) | not null
 common_name  | character varying(255) | 
 comment      | text                   | 
 strain       | character varying(255) | default ''::character varying
 version      | character varying(255) | default ''::character varying
Indexes:
    "organism_pkey" PRIMARY KEY, btree (organism_id)
    "organism_c1" UNIQUE CONSTRAINT, btree (genus, species, strain, version)


The nice thing about this design is that PostgreSQL supports rudimentary table partitioning (http://www.postgresql.org/docs/current/static/ddl-partitioning.html) and eventually automatic partitioning (https://wiki.postgresql.org/wiki/Table_partitioning#Active_Work_In_Progress).  That will mean that you could partition the feature table by organism_id and get the benefits of separate tables for each organism without breaking current Chado semantics.


On Wed, Sep 17, 2014 at 12:30 PM, Lacey-Anne Sanderson <[hidden email]> wrote:
Hi Valentin,

First off, thanks for the compliment on Tripal :)

I'm curious as to why you don't feel storing all these genomes within the same chado instance will work for you. I myself work on a Tripal site that houses 4+ legume species including full genomes for 3 of them and I know of a Monsanto group that has 100+ organisms in a single chado (although they're not using Tripal).  Not only does storing your data this way allow Tripal to work for you out-of-the-box but it also allows you to easily store relationships between the genomes using chado (i.e. feature_relationship to link homologous genes) and it can greatly simplify your queries resulting in faster load times. Even if you have multiple genomes for the same genus/species you can always differentiate the genomes by attaching each to an analysis record stating the differences. Furthermore, by adding a single non-standard chado table (feature_stock) you could link each feature record to the specific germplasm that was sequenced. Anyway, that's how I'm handling this particular problem ;-)

As far as extending Tripal to handle multiple chado instances, I think this would take more changes than you think since we have an entire API (Chado Query API) that interacts with chado and is used extensively throughout Tripal that would need quite a few changes. Although this depends on how much interaction you want between the chado instances.

If even after my convincing argument above, you still feel you will need multiple chado instances to handle your data, I will think a little more in-depth on how one might accomplish such a thing. That said, it's going to take a lot more work than changing the built-in "chado" schema name to not loose a lot of the built-in Tripal functionality and to tell you the truth I'm not sure it's even possible without a major refactoring of code :( 

Just looking at one part of Tripal: pages for chado content. You would need to add a schema column to each chado_ table and update the sync'ing code to allow specifying of the schema. You would also need to change chado_generate_var(), chado_query() and really the whole Chado Query API to accept schema specification and change the templates for the pages so they know to look for the schema when retrieving the content.

Furthermore, I know for a fact that Drupal Views would not handle this gracefully (it was never meant for more than the single Drupal schema and a patch, as well as, an entire Tripal module has been needed to get it to handle a single second schema… More than that sounds like crazy talk :p And unfortunately a lot of Tripal functionality including administration and searching is now tied to Drupal Views to allow us to give you the kind of flexibility that we've been able to. I'm not saying its impossible but it's certainly non-trivial even with Tripal Views which will at least give you a single (large) place to make the changes.

Another, potentially easier way to go the multiple chado route would be to make a Tripal site per genome and then use web-services and links to make connections between them. You could actually do this while still having them in the same postgresql database by just pointing each site to the same database and using Drupal's built-in table prefixing with a different prefix per site -then we'd just need to change the hard-coded name of the schema which wouldn't be too much work and should likely be done anyway. Tripal doesn't yet have it's own web services although it will in the future (not soon enough for your PhD student though :s) but the built-in Drupal web services might be enough…

Anyway, please consider my proposal of using a single chado instance since it would greatly simplify things on the Tripal end and I think it would provide you with more functionality in the long run…

~Lacey

PS. I'd be very interested to hear what organisms you're working on and if the "multiple genomes" are of the same species since we are going to need to handle this in the near future… Chickpea already has 2 genomes for the same species just different market classes and although we are currently only housing one, I would love to host both and compare!)

PSS. I would be better able to quantify the number of changes needed if you could elaborate a little more on how much interaction you need between chado instances. Additionally, are you going to need to make pages for data from different schema? Views Data listings that span chado instances?

PSSS. I'm the other main developer of Tripal and have done most of the work with Drupal Views integration -just for informations sake :)

------------------------------------------------------
Lacey-Anne Sanderson
Bioinformaticist
Pulse Crop Breeding and Genetics
Phone: <a href="tel:%28306%29%20966-3208" value="+13069663208" target="_blank">(306) 966-3208
Room 2C33 Agriculture
Department of Plant Sciences
University of Saskatchewan

On Sep 17, 2014, at 8:52 AM, valentin.guignon <[hidden email]> wrote:

Hi Stephen and the list,

first of all, thank you for Tripal which is simply awesome! :)

Recently, we had a discussion with colleagues about how to store several
genomes within CHADO. Most people recommend 1 CHADO instance per genome.
You can achieve that in 2 ways:
1) having separate CHADO databases;
2) having separate CHADO schema inside a same database.

The second way (ie. separate schema) offers several advantages, one of
them being cross-schema querying. It would be useful for us to be able
to compare genome directly in one query. Let me give you a simple use
case...

Let's say you have 2 genomes "A" and "B", A being stored in the
"chado_a" schema and B in "chado_b". You have loaded BLAST matches for
both genomes with the genome "C". Now, if you want to get the all the
sequences of genome A and B that matched the same sequences of genome C,
you could issue the following query:

SELECT
  fma.uniquename AS "C match",
  fa.feature_id,
  fa.uniquename AS "A-C match",
  fb.feature_id,
  fb.uniquename AS "B-C match"
FROM
  chado_a.feature fa
    JOIN chado_a.featureloc fla ON fa.feature_id = fla.srcfeature_id
    JOIN chado_a.feature fma ON fla.feature_id = fma.feature_id,
  chado_b.feature fb
    JOIN chado_b.featureloc flb ON fb.feature_id = flb.srcfeature_id
    JOIN chado_b.feature fmb ON flb.feature_id = fmb.feature_id
WHERE
  fma.uniquename = fmb.uniquename
;

I didn't try that query and of course the one I could use for real would
be more complicated (ie. with is_obsolete, type_id, etc.) but that's
just to give you a simple idea of what it could look-like. And that's a
very simple use case but you can imagine many others where you'd like to
be able to compare multiple genome data.

If you store your genomes in several databases (ie. "way 1)") then, you
would have to "programmatically" compare many many records of each
database against each other. If your database is stored on another
server than the web one, then all those records would have to be
transferred from the database server to the web server while if you used
the "way 2)", only relevant records would have been transferred.




So, here is my proposal/feature request to achieve that with Tripal:

"there should be a field in Tripal setup that allows us to specify the
name of the (main) CHADO schema that is used by current tripal instance.
By default (as currently), it would be 'chado'."




It doesn't sound very complicated since it would basically mean to
replace the hard-coded 'chado' string in the source code by a global
"$main_chado_schema" variable... but I know it usually is more
complicated than it seems! ;-)

That would be the first step. Then you can imagine many other
improvements on Tripal interface and API that would deal with multiple
CHADO instances on a same site but so far, we just need to be able to
specify the name of the chado schema to use. Then we can setup multiple
Drupal instances in a same database in the same public schema using the
native Drupal feature that allows us to prefix Drupal tables. We don't
need Tripal to do the cross-schema querying (but I think it could do it
already though...) but we just need to have all our CHADO instances in a
same database in order to run our other analysis tools (and we don't
want to maintain several copies since each copy could evolve on its side
and that would mean sync issues...).

To sum up, 1 postgresql database, 2 genomes A and B stored in 2 distinct
shcema chado_a and chado_b, 1 public schema for 2 drupal/tripal website
instances, one using the 'a_' table prefix and the other one using the
'b_' table prefix --> 2 distinct websites, one for genome A with default
tripal chado schema name option set to "chado_a" and another one for
genome B with the default tripal chado schema name option set to
"chado_b". Cross-schema querying achieved outside Drupal/Tripal (but
could be, in a far future, integrated to Tripal ;-) of course!).



-Who would be interested by such a feature?
-Stephen, how hard/long do you think it would be to implement that?
-Who would implement that?*
-Should I post this feature request on the Drupal project site?

Please give you opinions! Regards,


   Valentin Guignon



*: I could do it but unfortunately not for 2014 and we need this feature
ASAP for a 3rd year PhD student project... :-s

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal


------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal




--
Mara Kim

Ph.D. Candidate
Computational Biology
Vanderbilt University
Nashville, TN

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal
Reply | Threaded
Open this post in threaded view
|

Re: multi-chado schema feature request - who would be interested?

valentin.guignon
Hi!

Thank you for your interesting answers and sorry for my delayed answer.
Be fore answering, we had another discussion with my colleagues
regarding your points.

To answer your first question about why we use separate instances, there
a several reasons. I'd like to point out that having multiple instances
of CHADO or just a single one is a real (separate) debate and I don't
think tripal list is the place for that. However, I understand that's
still a key question before asking the feature I described.

Here are our reasons and their associated points:
1) performances: as you pointed out, in our case (3 complete genomes of
banana, cocoa and coffee and maybe some more in a near future), that
might not be a good reason. We heard from other partners that we could
have performance issue and didn't take the time to really check that. We
also don't have any postgresql config expert and didn't know about the
table partitioning. Thanks for the trick, we are currently having a look
into that! ;)

2) historical reasons: each instance belongs to a specific project lead
by different people and sometimes, different institutes or partners. For
instance, the banana genome is managed by CIRAD while the coffee genome
is managed by IRD. We are a bioinformatics platform and collaborate so
we have all our databases stored on a common server but in fact, they
belong to different people. (nb.: I'm neither CIRAD nor IRD, I'm
Bioversity International! ...and I just give a hand :-p )

3) security/privacy reasons: before the banana genome was published we
had the cocoa genome published. Same kind of story with the coffee
genome. We can't have a public database hosting private data (nb. in
fact, we could but see my next point).
We also have manual annotators on both instances and we don't want the
coffee annotators to interfere with the banana annotators for instance.

4) CHADO Controller config: a few years ago, I create a CHADO module
called CHADO Controller that enables 3 main features: annotation history
(based on a slightly modified version of CHADO audit module), annotation
inspector (reports mistakes to the annotators), and feature access
restriction. Depending on the genomes, the needs were different and then
CHADO Controller had to be installed differently on each instance and
that couldn't be technically achieved on a single CHADO instance.

5) Maintenance: as I said before, each instance may be managed by
different people. When updating a genome data, we don't want to break
another genome at the same time. Of course, there are many ways to avoid
that (for instance, a nice development environment with
"dev-staging-prod" databases and sites, unit tests and such) but
unfortunately, "we" don't have that here in our "real" world. :-/
(...and it's really hard to change some people behaviors and bad habits...)

I may have forgotten some other few reasons explaining why we have
several CHADO instances in separate databases. Your feedback gave us the
opportunity to think again about our system and how it will evolve but
we don't think we will fusion our CHADO instances in 1.


Now, next question is "why we would need to aggregate several CHADO
schema in one database?". The original purpose of that was the project
of our PhD student. Again, we discussed the needs and came to a new
solution for her: no CHADO database at all. :) That eliminates the
problem. Instead, she will work with JSON files generated from GFF3
files and used by JBrowse and the REST API. I'd just like to add that I
am not supervising that student and I'm just consulted from times to
times for some technical solutions and advices...


However, we still think that this feature (ie. being able to choose the
name of the schema) would be a good and useful feature for Tripal. The
only thing that changed is that it's not a priority and the use cases
are different from the one I explained.


And some thoughts about Lacey-Anne anwser:

 >     Just looking at one part of Tripal: pages for chado content. You
 >     would need to add a schema column to each chado_ table and update
 >     the sync'ing code to allow specifying of the schema.

Couldn't it be simply managed using "SET search_path to ...;" (just like
in tripal_core.chado_general.api.inc in chado_set_active() function)
since there shouldn't be table name collisions between CHADO and Drupal?
(...well, there could be, I agree)

 >     You would also
 >     need to change chado_generate_var(), chado_query() and really the
 >     whole Chado Query API to accept schema specification and change
 >     the templates for the pages so they know to look for the schema
 >     when retrieving the content.
 > (...)
 >     are you going to need to
 >     make pages for data from different schema? Views Data listings
 >     that span chado instances?

For starter, I don't think I'd need those. If I needed to do
cross-schema queries, I would do it with my own PHP code using the
Tripal CHADO API (chado_query) and I would prefix myself the tables with
the appropriate schema. But anyway, I should clearly state one or more
new use cases which would show real useful use cases that we could
discuss. So far, I have some in mind but for sure, those could be solved
without that feature. The feature would just be a more convenient way
that renaming schema, using SQL dumps or...

To conclude: we are at least 2 people (! :-p ) definitely convinced that
this feature would be very useful but we have to provide you at least a
complete use case as a basis of discussion. So far, this feature is not
as "urgent" as we though at first glance. ;-)

Thank you again for your pertinent comments and advices. We'll try to
write some use cases when time (/priorities...) will allow us to do so. ;)
Regards,


    Val



On 17/09/2014 20:06, Mara Kim wrote:

> I agree with Lacey about using multiple schemas.  It creates a lot of
> problems, and doesn't scale well.  Here at Vanderbilt we have a single
> Chado instance which contains over 200 eukaryotic genomes, and
> performance is still fine.  This is done by using the organism table to
> identify each genome separately.  As for the problem of multiple genomes
> for a single organism, I have modified the organism table to include a
> column for strain and version, which roughly correspond to biological
> and computational replicates respectively (ie. different samples and
> different assemblies).  Currently these columns will not show up in the
> Tripal organisms view (although supposedly this should be an easy fix
> with custom views).  I am hoping that this gains more traction and makes
> it into the standard Chado schema.
>
> To modify the organism table in an existing Chado instance:
>
> SET search_path TO 'chado';
> ALTER TABLE organism
>    ADD COLUMN strain character varying(255) DEFAULT ''::character varying,
>    ADD COLUMN version character varying(255) DEFAULT ''::character varying,
>    DROP CONSTRAINT organism_c1;
> ALTER TABLE organism ADD CONSTRAINT organism_c1 UNIQUE (genus, species,
> strain, version);
>
> The resulting organism table will look like this:
>
>                                           Table "chado.organism"
>      Column    |          Type          |
> Modifiers
> --------------+------------------------+----------------------------------------------------------------
>   organism_id  | integer                | not null default
> nextval('organism_organism_id_seq'::regclass)
>   abbreviation | character varying(255) |
>   genus        | character varying(255) | not null
>   species      | character varying(255) | not null
>   common_name  | character varying(255) |
>   comment      | text                   |
>   strain       | character varying(255) | default ''::character varying
>   version      | character varying(255) | default ''::character varying
> Indexes:
>      "organism_pkey" PRIMARY KEY, btree (organism_id)
>      "organism_c1" UNIQUE CONSTRAINT, btree (genus, species, strain,
> version)
>
>
> The nice thing about this design is that PostgreSQL supports rudimentary
> table partitioning
> (http://www.postgresql.org/docs/current/static/ddl-partitioning.html)
> and eventually automatic partitioning
> (https://wiki.postgresql.org/wiki/Table_partitioning#Active_Work_In_Progress).
>   That will mean that you could partition the feature table by
> organism_id and get the benefits of separate tables for each organism
> without breaking current Chado semantics.
>
>
> On Wed, Sep 17, 2014 at 12:30 PM, Lacey-Anne Sanderson
> <[hidden email] <mailto:[hidden email]>> wrote:
>
>     Hi Valentin,
>
>     First off, thanks for the compliment on Tripal :)
>
>     I'm curious as to why you don't feel storing all these genomes
>     within the same chado instance will work for you. I myself work on a
>     Tripal site that houses 4+ legume species including full genomes for
>     3 of them and I know of a Monsanto group that has 100+ organisms in
>     a single chado (although they're not using Tripal).  Not only does
>     storing your data this way allow Tripal to work for you
>     out-of-the-box but it also allows you to easily store relationships
>     between the genomes using chado (i.e. feature_relationship to link
>     homologous genes) and it can greatly simplify your queries resulting
>     in faster load times. Even if you have multiple genomes for the same
>     genus/species you can always differentiate the genomes by attaching
>     each to an analysis record stating the differences. Furthermore, by
>     adding a single non-standard chado table (feature_stock) you could
>     link each feature record to the specific germplasm that was
>     sequenced. Anyway, that's how I'm handling this particular problem ;-)
>
>     As far as extending Tripal to handle multiple chado instances, I
>     think this would take more changes than you think since we have an
>     entire API (Chado Query API
>     <http://api.tripal.info/api/tripal/tripal_core!api!tripal_core.chado_query.api.inc/group/tripal_chado_query_api/2.x>)
>     that interacts with chado and is used extensively throughout Tripal
>     that would need quite a few changes. Although this depends on how
>     much interaction you want between the chado instances.
>
>     If even after my convincing argument above, you still feel you will
>     need multiple chado instances to handle your data, I will think a
>     little more in-depth on how one might accomplish such a thing. That
>     said, it's going to take a lot more work than changing the built-in
>     "chado" schema name to not loose a lot of the built-in Tripal
>     functionality and to tell you the truth I'm not sure it's even
>     possible without a major refactoring of code :(
>
>     Just looking at one part of Tripal: pages for chado content. You
>     would need to add a schema column to each chado_ table and update
>     the sync'ing code to allow specifying of the schema. You would also
>     need to change chado_generate_var(), chado_query() and really the
>     whole Chado Query API to accept schema specification and change the
>     templates for the pages so they know to look for the schema when
>     retrieving the content.
>
>     Furthermore, I know for a fact that Drupal Views would not handle
>     this gracefully (it was never meant for more than the single Drupal
>     schema and a patch, as well as, an entire Tripal module has been
>     needed to get it to handle a single second schema… More than that
>     sounds like crazy talk :p And unfortunately a lot of Tripal
>     functionality including administration and searching is now tied to
>     Drupal Views to allow us to give you the kind of flexibility that
>     we've been able to. I'm not saying its impossible but it's certainly
>     non-trivial even with Tripal Views which will at least give you a
>     single (large) place to make the changes.
>
>     Another, potentially easier way to go the multiple chado route would
>     be to make a Tripal site per genome and then use web-services and
>     links to make connections between them. You could actually do this
>     while still having them in the same postgresql database by just
>     pointing each site to the same database and using Drupal's built-in
>     table prefixing with a different prefix per site -then we'd just
>     need to change the hard-coded name of the schema which wouldn't be
>     too much work and should likely be done anyway. Tripal doesn't yet
>     have it's own web services although it will in the future (not soon
>     enough for your PhD student though :s) but the built-in Drupal web
>     services might be enough…
>
>     Anyway, please consider my proposal of using a single chado instance
>     since it would greatly simplify things on the Tripal end and I think
>     it would provide you with more functionality in the long run…
>
>     ~Lacey
>
>     PS. I'd be very interested to hear what organisms you're working on
>     and if the "multiple genomes" are of the same species since we are
>     going to need to handle this in the near future… Chickpea already
>     has 2 genomes for the same species just different market classes and
>     although we are currently only housing one, I would love to host
>     both and compare!)
>
>     PSS. I would be better able to quantify the number of changes needed
>     if you could elaborate a little more on how much interaction you
>     need between chado instances. Additionally, are you going to need to
>     make pages for data from different schema? Views Data listings that
>     span chado instances?
>
>     PSSS. I'm the other main developer of Tripal and have done most of
>     the work with Drupal Views integration -just for informations sake :)
>
>     ------------------------------------------------------
>     *Lacey-Anne Sanderson*
>     Bioinformaticist
>     Pulse Crop Breeding and Genetics
>     Phone: (306) 966-3208 <tel:%28306%29%20966-3208>
>     Email: [hidden email] <mailto:[hidden email]>
>     Room 2C33 Agriculture
>     Department of Plant Sciences
>     University of Saskatchewan
>
>     On Sep 17, 2014, at 8:52 AM, valentin.guignon
>     <[hidden email] <mailto:[hidden email]>> wrote:
>
>>     Hi Stephen and the list,
>>
>>     first of all, thank you for Tripal which is simply awesome! :)
>>
>>     Recently, we had a discussion with colleagues about how to store
>>     several
>>     genomes within CHADO. Most people recommend 1 CHADO instance per
>>     genome.
>>     You can achieve that in 2 ways:
>>     1) having separate CHADO databases;
>>     2) having separate CHADO schema inside a same database.
>>
>>     The second way (ie. separate schema) offers several advantages,
>>     one of
>>     them being cross-schema querying. It would be useful for us to be
>>     able
>>     to compare genome directly in one query. Let me give you a simple use
>>     case...
>>
>>     Let's say you have 2 genomes "A" and "B", A being stored in the
>>     "chado_a" schema and B in "chado_b". You have loaded BLAST matches
>>     for
>>     both genomes with the genome "C". Now, if you want to get the all the
>>     sequences of genome A and B that matched the same sequences of
>>     genome C,
>>     you could issue the following query:
>>
>>     SELECT
>>       fma.uniquename AS "C match",
>>       fa.feature_id,
>>       fa.uniquename AS "A-C match",
>>       fb.feature_id,
>>       fb.uniquename AS "B-C match"
>>     FROM
>>       chado_a.feature fa
>>         JOIN chado_a.featureloc fla ON fa.feature_id = fla.srcfeature_id
>>         JOIN chado_a.feature fma ON fla.feature_id = fma.feature_id,
>>       chado_b.feature fb
>>         JOIN chado_b.featureloc flb ON fb.feature_id = flb.srcfeature_id
>>         JOIN chado_b.feature fmb ON flb.feature_id = fmb.feature_id
>>     WHERE
>>       fma.uniquename = fmb.uniquename
>>     ;
>>
>>     I didn't try that query and of course the one I could use for real
>>     would
>>     be more complicated (ie. with is_obsolete, type_id, etc.) but that's
>>     just to give you a simple idea of what it could look-like. And
>>     that's a
>>     very simple use case but you can imagine many others where you'd
>>     like to
>>     be able to compare multiple genome data.
>>
>>     If you store your genomes in several databases (ie. "way 1)")
>>     then, you
>>     would have to "programmatically" compare many many records of each
>>     database against each other. If your database is stored on another
>>     server than the web one, then all those records would have to be
>>     transferred from the database server to the web server while if
>>     you used
>>     the "way 2)", only relevant records would have been transferred.
>>
>>
>>
>>
>>     So, here is my proposal/feature request to achieve that with Tripal:
>>
>>     "there should be a field in Tripal setup that allows us to specify
>>     the
>>     name of the (main) CHADO schema that is used by current tripal
>>     instance.
>>     By default (as currently), it would be 'chado'."
>>
>>
>>
>>
>>     It doesn't sound very complicated since it would basically mean to
>>     replace the hard-coded 'chado' string in the source code by a global
>>     "$main_chado_schema" variable... but I know it usually is more
>>     complicated than it seems! ;-)
>>
>>     That would be the first step. Then you can imagine many other
>>     improvements on Tripal interface and API that would deal with
>>     multiple
>>     CHADO instances on a same site but so far, we just need to be able to
>>     specify the name of the chado schema to use. Then we can setup
>>     multiple
>>     Drupal instances in a same database in the same public schema
>>     using the
>>     native Drupal feature that allows us to prefix Drupal tables. We
>>     don't
>>     need Tripal to do the cross-schema querying (but I think it could
>>     do it
>>     already though...) but we just need to have all our CHADO
>>     instances in a
>>     same database in order to run our other analysis tools (and we don't
>>     want to maintain several copies since each copy could evolve on
>>     its side
>>     and that would mean sync issues...).
>>
>>     To sum up, 1 postgresql database, 2 genomes A and B stored in 2
>>     distinct
>>     shcema chado_a and chado_b, 1 public schema for 2 drupal/tripal
>>     website
>>     instances, one using the 'a_' table prefix and the other one using
>>     the
>>     'b_' table prefix --> 2 distinct websites, one for genome A with
>>     default
>>     tripal chado schema name option set to "chado_a" and another one for
>>     genome B with the default tripal chado schema name option set to
>>     "chado_b". Cross-schema querying achieved outside Drupal/Tripal (but
>>     could be, in a far future, integrated to Tripal ;-) of course!).
>>
>>
>>
>>     -Who would be interested by such a feature?
>>     -Stephen, how hard/long do you think it would be to implement that?
>>     -Who would implement that?*
>>     -Should I post this feature request on the Drupal project site?
>>
>>     Please give you opinions! Regards,
>>
>>
>>        Valentin Guignon
>>
>>
>>
>>     *: I could do it but unfortunately not for 2014 and we need this
>>     feature
>>     ASAP for a 3rd year PhD student project... :-s
>>
>>     ------------------------------------------------------------------------------
>>     Want excitement?
>>     Manually upgrade your production database.
>>     When you want reliability, choose Perforce
>>     Perforce version control. Predictably reliable.
>>     http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>>     _______________________________________________
>>     Gmod-tripal mailing list
>>     [hidden email]
>>     <mailto:[hidden email]>
>>     https://lists.sourceforge.net/lists/listinfo/gmod-tripal
>
>
>     ------------------------------------------------------------------------------
>     Want excitement?
>     Manually upgrade your production database.
>     When you want reliability, choose Perforce
>     Perforce version control. Predictably reliable.
>     http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>     _______________________________________________
>     Gmod-tripal mailing list
>     [hidden email]
>     <mailto:[hidden email]>
>     https://lists.sourceforge.net/lists/listinfo/gmod-tripal
>
>
>
>
> --
> Mara Kim
>
> Ph.D. Candidate
> Computational Biology
> Vanderbilt University
> Nashville, TN

------------------------------------------------------------------------------
Slashdot TV.  Video for Nerds.  Stuff that Matters.
http://pubads.g.doubleclick.net/gampad/clk?id=160591471&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal
Reply | Threaded
Open this post in threaded view
|

Re: multi-chado schema feature request - who would be interested?

Stephen Ficklin-2
Hi Valentin,

We will be having a Tripal Developer's Meeting on October 7th at 6pm
Central European Summer time.   We would be happy to discuss these
questions in more depth if you would like to attend.  It's a bit hard by
email.   Or if this is an urgent need for you we can chat perhaps sooner
and setup an open meeting for discussion with anyone who is interested.
Just let me know.

At WSU we manage multiple Tripal sites for various communities and each
one has it's own Chado instance.  They are all separate Drupal
instances. Is this the same as your cacao, bannana and coffee sites?    
We do have some work that will be started in the next few months to
expand Tripal to support web services for the purpose of letting
separate sites cross-query each other.  I suspect this may help and you
are welcome to join us in our design discussions if you like.

As per multiple Chado schemas in the same PostgreSQL database, there are
some challenges with that, but I think it can be done. Lacey is right
about Drupal Views.  It's a shame to not be able to use it across
schemas, but if you are willing to program your own custom modules you
can perform the queries you want.  In some cases we do code our own
search pages when they become too complex for Drupal Views, so I think
that's a perfectly reasonable solution.

I think though, that we can make Tripal support multiple Chado schemas.
You might not have the full functionality of Drupal (i.e. Views) but I
think we can make Tripal handle it.   I have created a new Issue on
Tripal's issues list (https://www.drupal.org/node/2341657), and I've
created a new development branch named '7.x-2.x-multiple_chado-2341657',
so that there is a place to explore it.

Stephen



On 9/19/2014 11:03 AM, valentin.guignon wrote:

> Hi!
>
> Thank you for your interesting answers and sorry for my delayed answer.
> Be fore answering, we had another discussion with my colleagues
> regarding your points.
>
> To answer your first question about why we use separate instances, there
> a several reasons. I'd like to point out that having multiple instances
> of CHADO or just a single one is a real (separate) debate and I don't
> think tripal list is the place for that. However, I understand that's
> still a key question before asking the feature I described.
>
> Here are our reasons and their associated points:
> 1) performances: as you pointed out, in our case (3 complete genomes of
> banana, cocoa and coffee and maybe some more in a near future), that
> might not be a good reason. We heard from other partners that we could
> have performance issue and didn't take the time to really check that. We
> also don't have any postgresql config expert and didn't know about the
> table partitioning. Thanks for the trick, we are currently having a look
> into that! ;)
>
> 2) historical reasons: each instance belongs to a specific project lead
> by different people and sometimes, different institutes or partners. For
> instance, the banana genome is managed by CIRAD while the coffee genome
> is managed by IRD. We are a bioinformatics platform and collaborate so
> we have all our databases stored on a common server but in fact, they
> belong to different people. (nb.: I'm neither CIRAD nor IRD, I'm
> Bioversity International! ...and I just give a hand :-p )
>
> 3) security/privacy reasons: before the banana genome was published we
> had the cocoa genome published. Same kind of story with the coffee
> genome. We can't have a public database hosting private data (nb. in
> fact, we could but see my next point).
> We also have manual annotators on both instances and we don't want the
> coffee annotators to interfere with the banana annotators for instance.
>
> 4) CHADO Controller config: a few years ago, I create a CHADO module
> called CHADO Controller that enables 3 main features: annotation history
> (based on a slightly modified version of CHADO audit module), annotation
> inspector (reports mistakes to the annotators), and feature access
> restriction. Depending on the genomes, the needs were different and then
> CHADO Controller had to be installed differently on each instance and
> that couldn't be technically achieved on a single CHADO instance.
>
> 5) Maintenance: as I said before, each instance may be managed by
> different people. When updating a genome data, we don't want to break
> another genome at the same time. Of course, there are many ways to avoid
> that (for instance, a nice development environment with
> "dev-staging-prod" databases and sites, unit tests and such) but
> unfortunately, "we" don't have that here in our "real" world. :-/
> (...and it's really hard to change some people behaviors and bad habits...)
>
> I may have forgotten some other few reasons explaining why we have
> several CHADO instances in separate databases. Your feedback gave us the
> opportunity to think again about our system and how it will evolve but
> we don't think we will fusion our CHADO instances in 1.
>
>
> Now, next question is "why we would need to aggregate several CHADO
> schema in one database?". The original purpose of that was the project
> of our PhD student. Again, we discussed the needs and came to a new
> solution for her: no CHADO database at all. :) That eliminates the
> problem. Instead, she will work with JSON files generated from GFF3
> files and used by JBrowse and the REST API. I'd just like to add that I
> am not supervising that student and I'm just consulted from times to
> times for some technical solutions and advices...
>
>
> However, we still think that this feature (ie. being able to choose the
> name of the schema) would be a good and useful feature for Tripal. The
> only thing that changed is that it's not a priority and the use cases
> are different from the one I explained.
>
>
> And some thoughts about Lacey-Anne anwser:
>
>   >     Just looking at one part of Tripal: pages for chado content. You
>   >     would need to add a schema column to each chado_ table and update
>   >     the sync'ing code to allow specifying of the schema.
>
> Couldn't it be simply managed using "SET search_path to ...;" (just like
> in tripal_core.chado_general.api.inc in chado_set_active() function)
> since there shouldn't be table name collisions between CHADO and Drupal?
> (...well, there could be, I agree)
>
>   >     You would also
>   >     need to change chado_generate_var(), chado_query() and really the
>   >     whole Chado Query API to accept schema specification and change
>   >     the templates for the pages so they know to look for the schema
>   >     when retrieving the content.
>   > (...)
>   >     are you going to need to
>   >     make pages for data from different schema? Views Data listings
>   >     that span chado instances?
>
> For starter, I don't think I'd need those. If I needed to do
> cross-schema queries, I would do it with my own PHP code using the
> Tripal CHADO API (chado_query) and I would prefix myself the tables with
> the appropriate schema. But anyway, I should clearly state one or more
> new use cases which would show real useful use cases that we could
> discuss. So far, I have some in mind but for sure, those could be solved
> without that feature. The feature would just be a more convenient way
> that renaming schema, using SQL dumps or...
>
> To conclude: we are at least 2 people (! :-p ) definitely convinced that
> this feature would be very useful but we have to provide you at least a
> complete use case as a basis of discussion. So far, this feature is not
> as "urgent" as we though at first glance. ;-)
>
> Thank you again for your pertinent comments and advices. We'll try to
> write some use cases when time (/priorities...) will allow us to do so. ;)
> Regards,
>
>
>      Val
>
>
>
> On 17/09/2014 20:06, Mara Kim wrote:
>> I agree with Lacey about using multiple schemas.  It creates a lot of
>> problems, and doesn't scale well.  Here at Vanderbilt we have a single
>> Chado instance which contains over 200 eukaryotic genomes, and
>> performance is still fine.  This is done by using the organism table to
>> identify each genome separately.  As for the problem of multiple genomes
>> for a single organism, I have modified the organism table to include a
>> column for strain and version, which roughly correspond to biological
>> and computational replicates respectively (ie. different samples and
>> different assemblies).  Currently these columns will not show up in the
>> Tripal organisms view (although supposedly this should be an easy fix
>> with custom views).  I am hoping that this gains more traction and makes
>> it into the standard Chado schema.
>>
>> To modify the organism table in an existing Chado instance:
>>
>> SET search_path TO 'chado';
>> ALTER TABLE organism
>>     ADD COLUMN strain character varying(255) DEFAULT ''::character varying,
>>     ADD COLUMN version character varying(255) DEFAULT ''::character varying,
>>     DROP CONSTRAINT organism_c1;
>> ALTER TABLE organism ADD CONSTRAINT organism_c1 UNIQUE (genus, species,
>> strain, version);
>>
>> The resulting organism table will look like this:
>>
>>                                            Table "chado.organism"
>>       Column    |          Type          |
>> Modifiers
>> --------------+------------------------+----------------------------------------------------------------
>>    organism_id  | integer                | not null default
>> nextval('organism_organism_id_seq'::regclass)
>>    abbreviation | character varying(255) |
>>    genus        | character varying(255) | not null
>>    species      | character varying(255) | not null
>>    common_name  | character varying(255) |
>>    comment      | text                   |
>>    strain       | character varying(255) | default ''::character varying
>>    version      | character varying(255) | default ''::character varying
>> Indexes:
>>       "organism_pkey" PRIMARY KEY, btree (organism_id)
>>       "organism_c1" UNIQUE CONSTRAINT, btree (genus, species, strain,
>> version)
>>
>>
>> The nice thing about this design is that PostgreSQL supports rudimentary
>> table partitioning
>> (http://www.postgresql.org/docs/current/static/ddl-partitioning.html)
>> and eventually automatic partitioning
>> (https://wiki.postgresql.org/wiki/Table_partitioning#Active_Work_In_Progress).
>>    That will mean that you could partition the feature table by
>> organism_id and get the benefits of separate tables for each organism
>> without breaking current Chado semantics.
>>
>>
>> On Wed, Sep 17, 2014 at 12:30 PM, Lacey-Anne Sanderson
>> <[hidden email] <mailto:[hidden email]>> wrote:
>>
>>      Hi Valentin,
>>
>>      First off, thanks for the compliment on Tripal :)
>>
>>      I'm curious as to why you don't feel storing all these genomes
>>      within the same chado instance will work for you. I myself work on a
>>      Tripal site that houses 4+ legume species including full genomes for
>>      3 of them and I know of a Monsanto group that has 100+ organisms in
>>      a single chado (although they're not using Tripal).  Not only does
>>      storing your data this way allow Tripal to work for you
>>      out-of-the-box but it also allows you to easily store relationships
>>      between the genomes using chado (i.e. feature_relationship to link
>>      homologous genes) and it can greatly simplify your queries resulting
>>      in faster load times. Even if you have multiple genomes for the same
>>      genus/species you can always differentiate the genomes by attaching
>>      each to an analysis record stating the differences. Furthermore, by
>>      adding a single non-standard chado table (feature_stock) you could
>>      link each feature record to the specific germplasm that was
>>      sequenced. Anyway, that's how I'm handling this particular problem ;-)
>>
>>      As far as extending Tripal to handle multiple chado instances, I
>>      think this would take more changes than you think since we have an
>>      entire API (Chado Query API
>>      <http://api.tripal.info/api/tripal/tripal_core!api!tripal_core.chado_query.api.inc/group/tripal_chado_query_api/2.x>)
>>      that interacts with chado and is used extensively throughout Tripal
>>      that would need quite a few changes. Although this depends on how
>>      much interaction you want between the chado instances.
>>
>>      If even after my convincing argument above, you still feel you will
>>      need multiple chado instances to handle your data, I will think a
>>      little more in-depth on how one might accomplish such a thing. That
>>      said, it's going to take a lot more work than changing the built-in
>>      "chado" schema name to not loose a lot of the built-in Tripal
>>      functionality and to tell you the truth I'm not sure it's even
>>      possible without a major refactoring of code :(
>>
>>      Just looking at one part of Tripal: pages for chado content. You
>>      would need to add a schema column to each chado_ table and update
>>      the sync'ing code to allow specifying of the schema. You would also
>>      need to change chado_generate_var(), chado_query() and really the
>>      whole Chado Query API to accept schema specification and change the
>>      templates for the pages so they know to look for the schema when
>>      retrieving the content.
>>
>>      Furthermore, I know for a fact that Drupal Views would not handle
>>      this gracefully (it was never meant for more than the single Drupal
>>      schema and a patch, as well as, an entire Tripal module has been
>>      needed to get it to handle a single second schema… More than that
>>      sounds like crazy talk :p And unfortunately a lot of Tripal
>>      functionality including administration and searching is now tied to
>>      Drupal Views to allow us to give you the kind of flexibility that
>>      we've been able to. I'm not saying its impossible but it's certainly
>>      non-trivial even with Tripal Views which will at least give you a
>>      single (large) place to make the changes.
>>
>>      Another, potentially easier way to go the multiple chado route would
>>      be to make a Tripal site per genome and then use web-services and
>>      links to make connections between them. You could actually do this
>>      while still having them in the same postgresql database by just
>>      pointing each site to the same database and using Drupal's built-in
>>      table prefixing with a different prefix per site -then we'd just
>>      need to change the hard-coded name of the schema which wouldn't be
>>      too much work and should likely be done anyway. Tripal doesn't yet
>>      have it's own web services although it will in the future (not soon
>>      enough for your PhD student though :s) but the built-in Drupal web
>>      services might be enough…
>>
>>      Anyway, please consider my proposal of using a single chado instance
>>      since it would greatly simplify things on the Tripal end and I think
>>      it would provide you with more functionality in the long run…
>>
>>      ~Lacey
>>
>>      PS. I'd be very interested to hear what organisms you're working on
>>      and if the "multiple genomes" are of the same species since we are
>>      going to need to handle this in the near future… Chickpea already
>>      has 2 genomes for the same species just different market classes and
>>      although we are currently only housing one, I would love to host
>>      both and compare!)
>>
>>      PSS. I would be better able to quantify the number of changes needed
>>      if you could elaborate a little more on how much interaction you
>>      need between chado instances. Additionally, are you going to need to
>>      make pages for data from different schema? Views Data listings that
>>      span chado instances?
>>
>>      PSSS. I'm the other main developer of Tripal and have done most of
>>      the work with Drupal Views integration -just for informations sake :)
>>
>>      ------------------------------------------------------
>>      *Lacey-Anne Sanderson*
>>      Bioinformaticist
>>      Pulse Crop Breeding and Genetics
>>      Phone: (306) 966-3208 <tel:%28306%29%20966-3208>
>>      Email: [hidden email] <mailto:[hidden email]>
>>      Room 2C33 Agriculture
>>      Department of Plant Sciences
>>      University of Saskatchewan
>>
>>      On Sep 17, 2014, at 8:52 AM, valentin.guignon
>>      <[hidden email] <mailto:[hidden email]>> wrote:
>>
>>>      Hi Stephen and the list,
>>>
>>>      first of all, thank you for Tripal which is simply awesome! :)
>>>
>>>      Recently, we had a discussion with colleagues about how to store
>>>      several
>>>      genomes within CHADO. Most people recommend 1 CHADO instance per
>>>      genome.
>>>      You can achieve that in 2 ways:
>>>      1) having separate CHADO databases;
>>>      2) having separate CHADO schema inside a same database.
>>>
>>>      The second way (ie. separate schema) offers several advantages,
>>>      one of
>>>      them being cross-schema querying. It would be useful for us to be
>>>      able
>>>      to compare genome directly in one query. Let me give you a simple use
>>>      case...
>>>
>>>      Let's say you have 2 genomes "A" and "B", A being stored in the
>>>      "chado_a" schema and B in "chado_b". You have loaded BLAST matches
>>>      for
>>>      both genomes with the genome "C". Now, if you want to get the all the
>>>      sequences of genome A and B that matched the same sequences of
>>>      genome C,
>>>      you could issue the following query:
>>>
>>>      SELECT
>>>        fma.uniquename AS "C match",
>>>        fa.feature_id,
>>>        fa.uniquename AS "A-C match",
>>>        fb.feature_id,
>>>        fb.uniquename AS "B-C match"
>>>      FROM
>>>        chado_a.feature fa
>>>          JOIN chado_a.featureloc fla ON fa.feature_id = fla.srcfeature_id
>>>          JOIN chado_a.feature fma ON fla.feature_id = fma.feature_id,
>>>        chado_b.feature fb
>>>          JOIN chado_b.featureloc flb ON fb.feature_id = flb.srcfeature_id
>>>          JOIN chado_b.feature fmb ON flb.feature_id = fmb.feature_id
>>>      WHERE
>>>        fma.uniquename = fmb.uniquename
>>>      ;
>>>
>>>      I didn't try that query and of course the one I could use for real
>>>      would
>>>      be more complicated (ie. with is_obsolete, type_id, etc.) but that's
>>>      just to give you a simple idea of what it could look-like. And
>>>      that's a
>>>      very simple use case but you can imagine many others where you'd
>>>      like to
>>>      be able to compare multiple genome data.
>>>
>>>      If you store your genomes in several databases (ie. "way 1)")
>>>      then, you
>>>      would have to "programmatically" compare many many records of each
>>>      database against each other. If your database is stored on another
>>>      server than the web one, then all those records would have to be
>>>      transferred from the database server to the web server while if
>>>      you used
>>>      the "way 2)", only relevant records would have been transferred.
>>>
>>>
>>>
>>>
>>>      So, here is my proposal/feature request to achieve that with Tripal:
>>>
>>>      "there should be a field in Tripal setup that allows us to specify
>>>      the
>>>      name of the (main) CHADO schema that is used by current tripal
>>>      instance.
>>>      By default (as currently), it would be 'chado'."
>>>
>>>
>>>
>>>
>>>      It doesn't sound very complicated since it would basically mean to
>>>      replace the hard-coded 'chado' string in the source code by a global
>>>      "$main_chado_schema" variable... but I know it usually is more
>>>      complicated than it seems! ;-)
>>>
>>>      That would be the first step. Then you can imagine many other
>>>      improvements on Tripal interface and API that would deal with
>>>      multiple
>>>      CHADO instances on a same site but so far, we just need to be able to
>>>      specify the name of the chado schema to use. Then we can setup
>>>      multiple
>>>      Drupal instances in a same database in the same public schema
>>>      using the
>>>      native Drupal feature that allows us to prefix Drupal tables. We
>>>      don't
>>>      need Tripal to do the cross-schema querying (but I think it could
>>>      do it
>>>      already though...) but we just need to have all our CHADO
>>>      instances in a
>>>      same database in order to run our other analysis tools (and we don't
>>>      want to maintain several copies since each copy could evolve on
>>>      its side
>>>      and that would mean sync issues...).
>>>
>>>      To sum up, 1 postgresql database, 2 genomes A and B stored in 2
>>>      distinct
>>>      shcema chado_a and chado_b, 1 public schema for 2 drupal/tripal
>>>      website
>>>      instances, one using the 'a_' table prefix and the other one using
>>>      the
>>>      'b_' table prefix --> 2 distinct websites, one for genome A with
>>>      default
>>>      tripal chado schema name option set to "chado_a" and another one for
>>>      genome B with the default tripal chado schema name option set to
>>>      "chado_b". Cross-schema querying achieved outside Drupal/Tripal (but
>>>      could be, in a far future, integrated to Tripal ;-) of course!).
>>>
>>>
>>>
>>>      -Who would be interested by such a feature?
>>>      -Stephen, how hard/long do you think it would be to implement that?
>>>      -Who would implement that?*
>>>      -Should I post this feature request on the Drupal project site?
>>>
>>>      Please give you opinions! Regards,
>>>
>>>
>>>         Valentin Guignon
>>>
>>>
>>>
>>>      *: I could do it but unfortunately not for 2014 and we need this
>>>      feature
>>>      ASAP for a 3rd year PhD student project... :-s
>>>
>>>      ------------------------------------------------------------------------------
>>>      Want excitement?
>>>      Manually upgrade your production database.
>>>      When you want reliability, choose Perforce
>>>      Perforce version control. Predictably reliable.
>>>      http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>>>      _______________________________________________
>>>      Gmod-tripal mailing list
>>>      [hidden email]
>>>      <mailto:[hidden email]>
>>>      https://lists.sourceforge.net/lists/listinfo/gmod-tripal
>>
>>      ------------------------------------------------------------------------------
>>      Want excitement?
>>      Manually upgrade your production database.
>>      When you want reliability, choose Perforce
>>      Perforce version control. Predictably reliable.
>>      http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>>      _______________________________________________
>>      Gmod-tripal mailing list
>>      [hidden email]
>>      <mailto:[hidden email]>
>>      https://lists.sourceforge.net/lists/listinfo/gmod-tripal
>>
>>
>>
>>
>> --
>> Mara Kim
>>
>> Ph.D. Candidate
>> Computational Biology
>> Vanderbilt University
>> Nashville, TN
> ------------------------------------------------------------------------------
> Slashdot TV.  Video for Nerds.  Stuff that Matters.
> http://pubads.g.doubleclick.net/gampad/clk?id=160591471&iu=/4140/ostg.clktrk
> _______________________________________________
> Gmod-tripal mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-tripal


------------------------------------------------------------------------------
Slashdot TV.  Video for Nerds.  Stuff that Matters.
http://pubads.g.doubleclick.net/gampad/clk?id=160591471&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal
Reply | Threaded
Open this post in threaded view
|

Re: multi-chado schema feature request - who would be interested?

Stephen Ficklin-2
In reply to this post by Mara Kim-2
Hi Mara,

In response to the comment below about not having customizations to
Chado tables show up in Drupal Views, my recommendation would be to not
adjust Chado at all, but to add custom tables as you might have
difficulties upgrading Chado in the future.   But, I understand the need
for flexibility, so I think we can easily make it so that Tripal will
support alterations to Chado.  There is a function in Tripal that
defines the table structure of every table in Chado.  That function is a
hook which can be re-used to change the structure definition, but it
needs some adjustments to let you redefine a table.

Here's the idea of how it could work.  First, you create a custom module
that just handles your Chado customizations.  It could have an install
file that automatically added all of the fields, changed types, etc.  
The benefit for this is that you can share your module with other folks
who want to borrow what you've done, and all they have to do is install
the module.

Second, inside that module you would need to implement the following
hook for each of the tables you modified:

[your_module_name]_chado_schema_v[version number]_[table name].

So, if you are using Chado 1.2 and your have changes to the 'organism'
table you would implement this function:

[your_module_name]_chado_schema_v1.2_organism

Then, inside of that function, you redefine the organism table. Once the
table is redefined, you should be able to reset the Views Integration
and viola! the field is available for Drupal Views. And you don't need
to use a custom table.

Again, I'd caution against modifying Chado tables, but if you want to
pursue this let me know and we'll see how we can adjust the function to
help.

Stephen





On 9/17/2014 2:06 PM, Mara Kim wrote:
> I have modified the organism table to include a column for strain and
> version, which roughly correspond to biological and computational
> replicates respectively (ie. different samples and different
> assemblies).  Currently these columns will not show up in the Tripal
> organisms view (although supposedly this should be an easy fix with
> custom views).


------------------------------------------------------------------------------
Slashdot TV.  Video for Nerds.  Stuff that Matters.
http://pubads.g.doubleclick.net/gampad/clk?id=160591471&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal
Reply | Threaded
Open this post in threaded view
|

Re: multi-chado schema feature request - who would be interested?

Guignon, Valentin (Bioversity-France)
In reply to this post by Stephen Ficklin-2
Hi Stephen,

It's not urgent but I'd be pleased to attend. How/where?


> Is this the same as your cacao, banana and coffee sites?

So far, yes.


> I have created a new Issue on Tripal's issues list
> (https://www.drupal.org/node/2341657), and I've created
> a new development branch named '7.x-2.x-multiple_chado-2341657',
> so that there is a place to explore it.

I'll have a look! ;-)

Thanks for your answers and by the way, since I'm working on a new project related to Tripal (but not a genome project and more a stock-centered and phenotyping/genotyping/natural-diversity project), I may contribute to some parts of Tripal through Drupal issue reporting system and provide patches. We keep in touch!


   Val



-----Original Message-----
From: Stephen Ficklin [mailto:[hidden email]]
Sent: vendredi 19 septembre 2014 21:24
To: valentin.guignon; GMOD Tripal
Cc: Guignon, Valentin (Bioversity-France)
Subject: Re: [Gmod-tripal] multi-chado schema feature request - who would be interested?

Hi Valentin,

We will be having a Tripal Developer's Meeting on October 7th at 6pm
Central European Summer time.   We would be happy to discuss these
questions in more depth if you would like to attend.  It's a bit hard by
email.   Or if this is an urgent need for you we can chat perhaps sooner
and setup an open meeting for discussion with anyone who is interested.
Just let me know.

At WSU we manage multiple Tripal sites for various communities and each one has it's own Chado instance.  They are all separate Drupal
instances. Is this the same as your cacao, bannana and coffee sites?    
We do have some work that will be started in the next few months to expand Tripal to support web services for the purpose of letting separate sites cross-query each other.  I suspect this may help and you are welcome to join us in our design discussions if you like.

As per multiple Chado schemas in the same PostgreSQL database, there are some challenges with that, but I think it can be done. Lacey is right about Drupal Views.  It's a shame to not be able to use it across schemas, but if you are willing to program your own custom modules you can perform the queries you want.  In some cases we do code our own search pages when they become too complex for Drupal Views, so I think that's a perfectly reasonable solution.

I think though, that we can make Tripal support multiple Chado schemas.
You might not have the full functionality of Drupal (i.e. Views) but I
think we can make Tripal handle it.   I have created a new Issue on
Tripal's issues list (https://www.drupal.org/node/2341657), and I've created a new development branch named '7.x-2.x-multiple_chado-2341657', so that there is a place to explore it.

Stephen



On 9/19/2014 11:03 AM, valentin.guignon wrote:

> Hi!
>
> Thank you for your interesting answers and sorry for my delayed answer.
> Be fore answering, we had another discussion with my colleagues
> regarding your points.
>
> To answer your first question about why we use separate instances,
> there a several reasons. I'd like to point out that having multiple
> instances of CHADO or just a single one is a real (separate) debate
> and I don't think tripal list is the place for that. However, I
> understand that's still a key question before asking the feature I described.
>
> Here are our reasons and their associated points:
> 1) performances: as you pointed out, in our case (3 complete genomes
> of banana, cocoa and coffee and maybe some more in a near future),
> that might not be a good reason. We heard from other partners that we
> could have performance issue and didn't take the time to really check
> that. We also don't have any postgresql config expert and didn't know
> about the table partitioning. Thanks for the trick, we are currently
> having a look into that! ;)
>
> 2) historical reasons: each instance belongs to a specific project
> lead by different people and sometimes, different institutes or
> partners. For instance, the banana genome is managed by CIRAD while
> the coffee genome is managed by IRD. We are a bioinformatics platform
> and collaborate so we have all our databases stored on a common server
> but in fact, they belong to different people. (nb.: I'm neither CIRAD
> nor IRD, I'm Bioversity International! ...and I just give a hand :-p )
>
> 3) security/privacy reasons: before the banana genome was published we
> had the cocoa genome published. Same kind of story with the coffee
> genome. We can't have a public database hosting private data (nb. in
> fact, we could but see my next point).
> We also have manual annotators on both instances and we don't want the
> coffee annotators to interfere with the banana annotators for instance.
>
> 4) CHADO Controller config: a few years ago, I create a CHADO module
> called CHADO Controller that enables 3 main features: annotation
> history (based on a slightly modified version of CHADO audit module),
> annotation inspector (reports mistakes to the annotators), and feature
> access restriction. Depending on the genomes, the needs were different
> and then CHADO Controller had to be installed differently on each
> instance and that couldn't be technically achieved on a single CHADO instance.
>
> 5) Maintenance: as I said before, each instance may be managed by
> different people. When updating a genome data, we don't want to break
> another genome at the same time. Of course, there are many ways to
> avoid that (for instance, a nice development environment with
> "dev-staging-prod" databases and sites, unit tests and such) but
> unfortunately, "we" don't have that here in our "real" world. :-/
> (...and it's really hard to change some people behaviors and bad
> habits...)
>
> I may have forgotten some other few reasons explaining why we have
> several CHADO instances in separate databases. Your feedback gave us
> the opportunity to think again about our system and how it will evolve
> but we don't think we will fusion our CHADO instances in 1.
>
>
> Now, next question is "why we would need to aggregate several CHADO
> schema in one database?". The original purpose of that was the project
> of our PhD student. Again, we discussed the needs and came to a new
> solution for her: no CHADO database at all. :) That eliminates the
> problem. Instead, she will work with JSON files generated from GFF3
> files and used by JBrowse and the REST API. I'd just like to add that
> I am not supervising that student and I'm just consulted from times to
> times for some technical solutions and advices...
>
>
> However, we still think that this feature (ie. being able to choose
> the name of the schema) would be a good and useful feature for Tripal.
> The only thing that changed is that it's not a priority and the use
> cases are different from the one I explained.
>
>
> And some thoughts about Lacey-Anne anwser:
>
>   >     Just looking at one part of Tripal: pages for chado content. You
>   >     would need to add a schema column to each chado_ table and update
>   >     the sync'ing code to allow specifying of the schema.
>
> Couldn't it be simply managed using "SET search_path to ...;" (just
> like in tripal_core.chado_general.api.inc in chado_set_active()
> function) since there shouldn't be table name collisions between CHADO and Drupal?
> (...well, there could be, I agree)
>
>   >     You would also
>   >     need to change chado_generate_var(), chado_query() and really the
>   >     whole Chado Query API to accept schema specification and change
>   >     the templates for the pages so they know to look for the schema
>   >     when retrieving the content.
>   > (...)
>   >     are you going to need to
>   >     make pages for data from different schema? Views Data listings
>   >     that span chado instances?
>
> For starter, I don't think I'd need those. If I needed to do
> cross-schema queries, I would do it with my own PHP code using the
> Tripal CHADO API (chado_query) and I would prefix myself the tables
> with the appropriate schema. But anyway, I should clearly state one or
> more new use cases which would show real useful use cases that we
> could discuss. So far, I have some in mind but for sure, those could
> be solved without that feature. The feature would just be a more
> convenient way that renaming schema, using SQL dumps or...
>
> To conclude: we are at least 2 people (! :-p ) definitely convinced
> that this feature would be very useful but we have to provide you at
> least a complete use case as a basis of discussion. So far, this
> feature is not as "urgent" as we though at first glance. ;-)
>
> Thank you again for your pertinent comments and advices. We'll try to
> write some use cases when time (/priorities...) will allow us to do
> so. ;) Regards,
>
>
>      Val
>
>
>
> On 17/09/2014 20:06, Mara Kim wrote:
>> I agree with Lacey about using multiple schemas.  It creates a lot of
>> problems, and doesn't scale well.  Here at Vanderbilt we have a
>> single Chado instance which contains over 200 eukaryotic genomes, and
>> performance is still fine.  This is done by using the organism table
>> to identify each genome separately.  As for the problem of multiple
>> genomes for a single organism, I have modified the organism table to
>> include a column for strain and version, which roughly correspond to
>> biological and computational replicates respectively (ie. different
>> samples and different assemblies).  Currently these columns will not
>> show up in the Tripal organisms view (although supposedly this should
>> be an easy fix with custom views).  I am hoping that this gains more
>> traction and makes it into the standard Chado schema.
>>
>> To modify the organism table in an existing Chado instance:
>>
>> SET search_path TO 'chado';
>> ALTER TABLE organism
>>     ADD COLUMN strain character varying(255) DEFAULT ''::character varying,
>>     ADD COLUMN version character varying(255) DEFAULT ''::character varying,
>>     DROP CONSTRAINT organism_c1;
>> ALTER TABLE organism ADD CONSTRAINT organism_c1 UNIQUE (genus,
>> species, strain, version);
>>
>> The resulting organism table will look like this:
>>
>>                                            Table "chado.organism"
>>       Column    |          Type          |
>> Modifiers
>> --------------+------------------------+-----------------------------
>> --------------+------------------------+-----------------------------
>> --------------+------------------------+------
>>    organism_id  | integer                | not null default
>> nextval('organism_organism_id_seq'::regclass)
>>    abbreviation | character varying(255) |
>>    genus        | character varying(255) | not null
>>    species      | character varying(255) | not null
>>    common_name  | character varying(255) |
>>    comment      | text                   |
>>    strain       | character varying(255) | default ''::character varying
>>    version      | character varying(255) | default ''::character varying
>> Indexes:
>>       "organism_pkey" PRIMARY KEY, btree (organism_id)
>>       "organism_c1" UNIQUE CONSTRAINT, btree (genus, species, strain,
>> version)
>>
>>
>> The nice thing about this design is that PostgreSQL supports
>> rudimentary table partitioning
>> (http://www.postgresql.org/docs/current/static/ddl-partitioning.html)
>> and eventually automatic partitioning
>> (https://wiki.postgresql.org/wiki/Table_partitioning#Active_Work_In_Progress).
>>    That will mean that you could partition the feature table by
>> organism_id and get the benefits of separate tables for each organism
>> without breaking current Chado semantics.
>>
>>
>> On Wed, Sep 17, 2014 at 12:30 PM, Lacey-Anne Sanderson
>> <[hidden email] <mailto:[hidden email]>> wrote:
>>
>>      Hi Valentin,
>>
>>      First off, thanks for the compliment on Tripal :)
>>
>>      I'm curious as to why you don't feel storing all these genomes
>>      within the same chado instance will work for you. I myself work on a
>>      Tripal site that houses 4+ legume species including full genomes for
>>      3 of them and I know of a Monsanto group that has 100+ organisms in
>>      a single chado (although they're not using Tripal).  Not only does
>>      storing your data this way allow Tripal to work for you
>>      out-of-the-box but it also allows you to easily store relationships
>>      between the genomes using chado (i.e. feature_relationship to link
>>      homologous genes) and it can greatly simplify your queries resulting
>>      in faster load times. Even if you have multiple genomes for the same
>>      genus/species you can always differentiate the genomes by attaching
>>      each to an analysis record stating the differences. Furthermore, by
>>      adding a single non-standard chado table (feature_stock) you could
>>      link each feature record to the specific germplasm that was
>>      sequenced. Anyway, that's how I'm handling this particular
>> problem ;-)
>>
>>      As far as extending Tripal to handle multiple chado instances, I
>>      think this would take more changes than you think since we have an
>>      entire API (Chado Query API
>>      <http://api.tripal.info/api/tripal/tripal_core!api!tripal_core.chado_query.api.inc/group/tripal_chado_query_api/2.x>)
>>      that interacts with chado and is used extensively throughout Tripal
>>      that would need quite a few changes. Although this depends on how
>>      much interaction you want between the chado instances.
>>
>>      If even after my convincing argument above, you still feel you will
>>      need multiple chado instances to handle your data, I will think a
>>      little more in-depth on how one might accomplish such a thing. That
>>      said, it's going to take a lot more work than changing the built-in
>>      "chado" schema name to not loose a lot of the built-in Tripal
>>      functionality and to tell you the truth I'm not sure it's even
>>      possible without a major refactoring of code :(
>>
>>      Just looking at one part of Tripal: pages for chado content. You
>>      would need to add a schema column to each chado_ table and update
>>      the sync'ing code to allow specifying of the schema. You would also
>>      need to change chado_generate_var(), chado_query() and really the
>>      whole Chado Query API to accept schema specification and change the
>>      templates for the pages so they know to look for the schema when
>>      retrieving the content.
>>
>>      Furthermore, I know for a fact that Drupal Views would not handle
>>      this gracefully (it was never meant for more than the single Drupal
>>      schema and a patch, as well as, an entire Tripal module has been
>>      needed to get it to handle a single second schema… More than that
>>      sounds like crazy talk :p And unfortunately a lot of Tripal
>>      functionality including administration and searching is now tied to
>>      Drupal Views to allow us to give you the kind of flexibility that
>>      we've been able to. I'm not saying its impossible but it's certainly
>>      non-trivial even with Tripal Views which will at least give you a
>>      single (large) place to make the changes.
>>
>>      Another, potentially easier way to go the multiple chado route would
>>      be to make a Tripal site per genome and then use web-services and
>>      links to make connections between them. You could actually do this
>>      while still having them in the same postgresql database by just
>>      pointing each site to the same database and using Drupal's built-in
>>      table prefixing with a different prefix per site -then we'd just
>>      need to change the hard-coded name of the schema which wouldn't be
>>      too much work and should likely be done anyway. Tripal doesn't yet
>>      have it's own web services although it will in the future (not soon
>>      enough for your PhD student though :s) but the built-in Drupal web
>>      services might be enough…
>>
>>      Anyway, please consider my proposal of using a single chado instance
>>      since it would greatly simplify things on the Tripal end and I think
>>      it would provide you with more functionality in the long run…
>>
>>      ~Lacey
>>
>>      PS. I'd be very interested to hear what organisms you're working on
>>      and if the "multiple genomes" are of the same species since we are
>>      going to need to handle this in the near future… Chickpea already
>>      has 2 genomes for the same species just different market classes and
>>      although we are currently only housing one, I would love to host
>>      both and compare!)
>>
>>      PSS. I would be better able to quantify the number of changes needed
>>      if you could elaborate a little more on how much interaction you
>>      need between chado instances. Additionally, are you going to need to
>>      make pages for data from different schema? Views Data listings that
>>      span chado instances?
>>
>>      PSSS. I'm the other main developer of Tripal and have done most of
>>      the work with Drupal Views integration -just for informations
>> sake :)
>>
>>      ------------------------------------------------------
>>      *Lacey-Anne Sanderson*
>>      Bioinformaticist
>>      Pulse Crop Breeding and Genetics
>>      Phone: (306) 966-3208 <tel:%28306%29%20966-3208>
>>      Email: [hidden email] <mailto:[hidden email]>
>>      Room 2C33 Agriculture
>>      Department of Plant Sciences
>>      University of Saskatchewan
>>
>>      On Sep 17, 2014, at 8:52 AM, valentin.guignon
>>      <[hidden email] <mailto:[hidden email]>> wrote:
>>
>>>      Hi Stephen and the list,
>>>
>>>      first of all, thank you for Tripal which is simply awesome! :)
>>>
>>>      Recently, we had a discussion with colleagues about how to store
>>>      several
>>>      genomes within CHADO. Most people recommend 1 CHADO instance per
>>>      genome.
>>>      You can achieve that in 2 ways:
>>>      1) having separate CHADO databases;
>>>      2) having separate CHADO schema inside a same database.
>>>
>>>      The second way (ie. separate schema) offers several advantages,
>>>      one of
>>>      them being cross-schema querying. It would be useful for us to be
>>>      able
>>>      to compare genome directly in one query. Let me give you a simple use
>>>      case...
>>>
>>>      Let's say you have 2 genomes "A" and "B", A being stored in the
>>>      "chado_a" schema and B in "chado_b". You have loaded BLAST matches
>>>      for
>>>      both genomes with the genome "C". Now, if you want to get the all the
>>>      sequences of genome A and B that matched the same sequences of
>>>      genome C,
>>>      you could issue the following query:
>>>
>>>      SELECT
>>>        fma.uniquename AS "C match",
>>>        fa.feature_id,
>>>        fa.uniquename AS "A-C match",
>>>        fb.feature_id,
>>>        fb.uniquename AS "B-C match"
>>>      FROM
>>>        chado_a.feature fa
>>>          JOIN chado_a.featureloc fla ON fa.feature_id = fla.srcfeature_id
>>>          JOIN chado_a.feature fma ON fla.feature_id = fma.feature_id,
>>>        chado_b.feature fb
>>>          JOIN chado_b.featureloc flb ON fb.feature_id = flb.srcfeature_id
>>>          JOIN chado_b.feature fmb ON flb.feature_id = fmb.feature_id
>>>      WHERE
>>>        fma.uniquename = fmb.uniquename
>>>      ;
>>>
>>>      I didn't try that query and of course the one I could use for real
>>>      would
>>>      be more complicated (ie. with is_obsolete, type_id, etc.) but that's
>>>      just to give you a simple idea of what it could look-like. And
>>>      that's a
>>>      very simple use case but you can imagine many others where you'd
>>>      like to
>>>      be able to compare multiple genome data.
>>>
>>>      If you store your genomes in several databases (ie. "way 1)")
>>>      then, you
>>>      would have to "programmatically" compare many many records of each
>>>      database against each other. If your database is stored on another
>>>      server than the web one, then all those records would have to be
>>>      transferred from the database server to the web server while if
>>>      you used
>>>      the "way 2)", only relevant records would have been transferred.
>>>
>>>
>>>
>>>
>>>      So, here is my proposal/feature request to achieve that with Tripal:
>>>
>>>      "there should be a field in Tripal setup that allows us to specify
>>>      the
>>>      name of the (main) CHADO schema that is used by current tripal
>>>      instance.
>>>      By default (as currently), it would be 'chado'."
>>>
>>>
>>>
>>>
>>>      It doesn't sound very complicated since it would basically mean to
>>>      replace the hard-coded 'chado' string in the source code by a global
>>>      "$main_chado_schema" variable... but I know it usually is more
>>>      complicated than it seems! ;-)
>>>
>>>      That would be the first step. Then you can imagine many other
>>>      improvements on Tripal interface and API that would deal with
>>>      multiple
>>>      CHADO instances on a same site but so far, we just need to be able to
>>>      specify the name of the chado schema to use. Then we can setup
>>>      multiple
>>>      Drupal instances in a same database in the same public schema
>>>      using the
>>>      native Drupal feature that allows us to prefix Drupal tables. We
>>>      don't
>>>      need Tripal to do the cross-schema querying (but I think it could
>>>      do it
>>>      already though...) but we just need to have all our CHADO
>>>      instances in a
>>>      same database in order to run our other analysis tools (and we don't
>>>      want to maintain several copies since each copy could evolve on
>>>      its side
>>>      and that would mean sync issues...).
>>>
>>>      To sum up, 1 postgresql database, 2 genomes A and B stored in 2
>>>      distinct
>>>      shcema chado_a and chado_b, 1 public schema for 2 drupal/tripal
>>>      website
>>>      instances, one using the 'a_' table prefix and the other one using
>>>      the
>>>      'b_' table prefix --> 2 distinct websites, one for genome A with
>>>      default
>>>      tripal chado schema name option set to "chado_a" and another one for
>>>      genome B with the default tripal chado schema name option set to
>>>      "chado_b". Cross-schema querying achieved outside Drupal/Tripal (but
>>>      could be, in a far future, integrated to Tripal ;-) of course!).
>>>
>>>
>>>
>>>      -Who would be interested by such a feature?
>>>      -Stephen, how hard/long do you think it would be to implement that?
>>>      -Who would implement that?*
>>>      -Should I post this feature request on the Drupal project site?
>>>
>>>      Please give you opinions! Regards,
>>>
>>>
>>>         Valentin Guignon
>>>
>>>
>>>
>>>      *: I could do it but unfortunately not for 2014 and we need this
>>>      feature
>>>      ASAP for a 3rd year PhD student project... :-s
>>>
>>>      ------------------------------------------------------------------------------
>>>      Want excitement?
>>>      Manually upgrade your production database.
>>>      When you want reliability, choose Perforce
>>>      Perforce version control. Predictably reliable.
>>>      http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>>>      _______________________________________________
>>>      Gmod-tripal mailing list
>>>      [hidden email]
>>>      <mailto:[hidden email]>
>>>      https://lists.sourceforge.net/lists/listinfo/gmod-tripal
>>
>>      ------------------------------------------------------------------------------
>>      Want excitement?
>>      Manually upgrade your production database.
>>      When you want reliability, choose Perforce
>>      Perforce version control. Predictably reliable.
>>      http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>>      _______________________________________________
>>      Gmod-tripal mailing list
>>      [hidden email]
>>      <mailto:[hidden email]>
>>      https://lists.sourceforge.net/lists/listinfo/gmod-tripal
>>
>>
>>
>>
>> --
>> Mara Kim
>>
>> Ph.D. Candidate
>> Computational Biology
>> Vanderbilt University
>> Nashville, TN
> ----------------------------------------------------------------------
> -------- Slashdot TV.  Video for Nerds.  Stuff that Matters.
> http://pubads.g.doubleclick.net/gampad/clk?id=160591471&iu=/4140/ostg.
> clktrk _______________________________________________
> Gmod-tripal mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-tripal

------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-tripal mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-tripal