collections and pedigree data in chado

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

collections and pedigree data in chado

Sanjuro Jogdeo
Hello all, 


I'm working on a project that is doing extensive natural population collections of 10ish species of wild strawberry.  We plan to store collection data and downstream analysis in a chado database and make the data available via a Tripal enabled website. 

I'm trying to figure out the best way to load and store the initial collections, the various types of progeny that will follow (crosses, clones, induced polyploids, etc...), and the downstream analyses (genotyping, phenotyping, transcriptomes, etc...).  I have lots of questions.

1.  I'm assuming that the proper way to record progeny will be through the stock and stock relationship tables.  One of the problems I anticipate is difficulty linking 3rd or 4th generation plants with the ancestral population data (location, elevation, GPS coords, etc...).  I would have to recursively travel up through several levels of stock relationships to get the population data and if the progeny were a cross, this could get complicated.  I met Stephen Ficklin at a workshop this past week and he said that one solution would be similar to what has been done with the cvtermpath table.  That is, link each stock to each of it's ancestors directly in something like a stock_relationship_path table.  This seems like a good solution to me but I wanted to see if anyone out there had any experience (and perhaps even tools ;-)  )  for doing this sort of thing.  

2.  In general, I'm a little bit confused by how often nd_experment is supposed to be referenced when gathering this sort of data.  For instance, does each new generation that is grown get linked to the nd_experiment table, even though it seems like the stock and stock_relationship table can handle the data by themselves?  

3.  We will be doing transcriptome, phenotyping and genotyping experiments on the progeny of our collections. Related to aforementioned confusion, I'm curious how these analyses get referenced in chado.  For instance, does a plant get a single entry in the nd_experiment table which is used to connect to transcriptome, genotype, and phenotype tables?  Or is there a separate nd_experiment entry for each type of downstream analysis performed?

4.  Have people already developed tools for entering this sort of data into a chado database?  I just saw the Tripal general general loader at a workshop and I can start to look at that (though I can't find it on my Tripal installation for some reason).  Are there others that are more specific to natural collections?

Many thanks, 

Sanjuro



------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: collections and pedigree data in chado

Bob MacCallum
Hi Sanjuro,

Hopefully the other Natural Diversity users will chime in, but here's my 2p.

On Thu, Jul 25, 2013 at 8:22 PM, Sanjuro Jogdeo <[hidden email]> wrote:
Hello all, 


I'm working on a project that is doing extensive natural population collections of 10ish species of wild strawberry.  We plan to store collection data and downstream analysis in a chado database and make the data available via a Tripal enabled website. 

I'm trying to figure out the best way to load and store the initial collections, the various types of progeny that will follow (crosses, clones, induced polyploids, etc...), and the downstream analyses (genotyping, phenotyping, transcriptomes, etc...).  I have lots of questions.

1.  I'm assuming that the proper way to record progeny will be through the stock and stock relationship tables.  One of the problems I anticipate is difficulty linking 3rd or 4th generation plants with the ancestral population data (location, elevation, GPS coords, etc...).  I would have to recursively travel up through several levels of stock relationships to get the population data and if the progeny were a cross, this could get complicated.  I met Stephen Ficklin at a workshop this past week and he said that one solution would be similar to what has been done with the cvtermpath table.  That is, link each stock to each of it's ancestors directly in something like a stock_relationship_path table.  This seems like a good solution to me but I wanted to see if anyone out there had any experience (and perhaps even tools ;-)  )  for doing this sort of thing.  


won't answer this - but it sounds like an interesting problem

 
2.  In general, I'm a little bit confused by how often nd_experment is supposed to be referenced when gathering this sort of data.  For instance, does each new generation that is grown get linked to the nd_experiment table, even though it seems like the stock and stock_relationship table can handle the data by themselves?  


you could link stocks without nd_experiment (which you can think of as nd_assay), but if you use it you can describe the cross in more detail (e.g. where it was done, protocols, performers, reagents etc).

We at VectorBase (in our population biology browser) use nd_experiment to describe field collections.  the nd_experiment_stock.type is "assay creates stock" (in our private CV). 

Other assays, such as phenotypes, link to stock with nd_experiment_stock.type="assay uses stock".

In your case a cross would both use (parents) and create (offspring) stocks.

 
3.  We will be doing transcriptome, phenotyping and genotyping experiments on the progeny of our collections. Related to aforementioned confusion, I'm curious how these analyses get referenced in chado.  For instance, does a plant get a single entry in the nd_experiment table which is used to connect to transcriptome, genotype, and phenotype tables?  Or is there a separate nd_experiment entry for each type of downstream analysis performed?

I would use a separate nd_experiment entry for each type of assay because then nd_experiment.type can describe what kind of assay it is.

Again, think of nd_experiment as an assay, not an experiment.  Use project for the higher level.
 

4.  Have people already developed tools for entering this sort of data into a chado database?  I just saw the Tripal general general loader at a workshop and I can start to look at that (though I can't find it on my Tripal installation for some reason).  Are there others that are more specific to natural collections?


I can't speak for Tripal - haven't had time to try it.  We have a Perl-based loader on github.  Input format is ISA-Tab.  It's not intended for other people to use, but you might be able to learn from our mistakes/successes:
https://github.com/bobular/VBPopBio/tree/master/api/Bio-Chado-VBPopBio

We basically wrapped all the result classes of Bio-Chado-Schema that we need - and then provide higher-level functionality (e.g. stock->best_species https://github.com/bobular/VBPopBio/blob/master/api/Bio-Chado-VBPopBio/lib/Bio/Chado/VBPopBio/Result/Stock.pm#L424) and an extra direct stock<->project relationship (with adding new tables, very cunning...)

A simpler option might have been just to fork Bio-Chado-Schema :-)

Oh, and by the way we don't do genotypes properly in Chado (e.g. linked to genome features etc) - that's because most of our genotypes are in another database (Ensembl).  But we do store some structural variations (inversions) as text with some SO terms to help describe them.

Chado is supposed to be flexible and there is definitely more than one way to do what you want to do - try not to let that put you off!

HTH,
cheers,
Bob

 
Many thanks, 

Sanjuro



------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema



------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: collections and pedigree data in chado

Naama Menda-3

hi Sanjuro,

in general I agree with everything Bob mentioned . I'm adding inline a few more points:

On Thu, Jul 25, 2013 at 6:29 PM, Bob MacCallum <[hidden email]> wrote:
Hi Sanjuro,

Hopefully the other Natural Diversity users will chime in, but here's my 2p.

On Thu, Jul 25, 2013 at 8:22 PM, Sanjuro Jogdeo <[hidden email]> wrote:
Hello all, 


I'm working on a project that is doing extensive natural population collections of 10ish species of wild strawberry.  We plan to store collection data and downstream analysis in a chado database and make the data available via a Tripal enabled website. 

I'm trying to figure out the best way to load and store the initial collections, the various types of progeny that will follow (crosses, clones, induced polyploids, etc...), and the downstream analyses (genotyping, phenotyping, transcriptomes, etc...).  I have lots of questions.

1.  I'm assuming that the proper way to record progeny will be through the stock and stock relationship tables.  One of the problems I anticipate is difficulty linking 3rd or 4th generation plants with the ancestral population data (location, elevation, GPS coords, etc...).  I would have to recursively travel up through several levels of stock relationships to get the population data and if the progeny were a cross, this could get complicated.  I met Stephen Ficklin at a workshop this past week and he said that one solution would be similar to what has been done with the cvtermpath table.  That is, link each stock to each of it's ancestors directly in something like a stock_relationship_path table.  This seems like a good solution to me but I wanted to see if anyone out there had any experience (and perhaps even tools ;-)  )  for doing this sort of thing.  


won't answer this - but it sounds like an interesting problem

 
a path table could solve this problem, but the question is how would you like to represent the data? If you are interested to access from each descendant the information of parents several generations upstream, then maybe a path table would be best.
At SGN and CassavaBase  we don't have (yet) more than 4 generations, however, each descendant plant has its own geolocation data, which is completely separated from the geolocation information of the parents. We don't display metadata of the parental population.
I don't see an easy way for traversing up and down stock_relationship with the current design. 
 
2.  In general, I'm a little bit confused by how often nd_experment is supposed to be referenced when gathering this sort of data.  For instance, does each new generation that is grown get linked to the nd_experiment table, even though it seems like the stock and stock_relationship table can handle the data by themselves?  


you could link stocks without nd_experiment (which you can think of as nd_assay), but if you use it you can describe the cross in more detail (e.g. where it was done, protocols, performers, reagents etc).

We at VectorBase (in our population biology browser) use nd_experiment to describe field collections.  the nd_experiment_stock.type is "assay creates stock" (in our private CV). 

Other assays, such as phenotypes, link to stock with nd_experiment_stock.type="assay uses stock".

In your case a cross would both use (parents) and create (offspring) stocks.

For crosses we store an nd_experiment of type 'cross' (an internal cvterm) , and all the properties of the cross are stored in nd_experimentprop/

 
 
3.  We will be doing transcriptome, phenotyping and genotyping experiments on the progeny of our collections. Related to aforementioned confusion, I'm curious how these analyses get referenced in chado.  For instance, does a plant get a single entry in the nd_experiment table which is used to connect to transcriptome, genotype, and phenotype tables?  Or is there a separate nd_experiment entry for each type of downstream analysis performed?

I would use a separate nd_experiment entry for each type of assay because then nd_experiment.type can describe what kind of assay it is.

Again, think of nd_experiment as an assay, not an experiment.  Use project for the higher level.
 
We define an nd_experiment as an 'observation unit' , usually a phynotyping or genotyping event performed at a certain date at a certain location by a certain person. For that matter, a single phenotype could be a single nd_experiment, or a bunch of phenotypes recorded in the field during a single morning.
Genotyping assays , like a GBS run, would get a single nd_experiment id. 


4.  Have people already developed tools for entering this sort of data into a chado database?  I just saw the Tripal general general loader at a workshop and I can start to look at that (though I can't find it on my Tripal installation for some reason).  Are there others that are more specific to natural collections?


I can't speak for Tripal - haven't had time to try it.  We have a Perl-based loader on github.  Input format is ISA-Tab.  It's not intended for other people to use, but you might be able to learn from our mistakes/successes:
https://github.com/bobular/VBPopBio/tree/master/api/Bio-Chado-VBPopBio

We basically wrapped all the result classes of Bio-Chado-Schema that we need - and then provide higher-level functionality (e.g. stock->best_species https://github.com/bobular/VBPopBio/blob/master/api/Bio-Chado-VBPopBio/lib/Bio/Chado/VBPopBio/Result/Stock.pm#L424) and an extra direct stock<->project relationship (with adding new tables, very cunning...)

A simpler option might have been just to fork Bio-Chado-Schema :-)

Oh, and by the way we don't do genotypes properly in Chado (e.g. linked to genome features etc) - that's because most of our genotypes are in another database (Ensembl).  But we do store some structural variations (inversions) as text with some SO terms to help describe them.

Chado is supposed to be flexible and there is definitely more than one way to do what you want to do - try not to let that put you off!

SGN also uses Bio::Chado::Schema. Since most data comes from breeders, we have a wrapper for spreadsheets, and a certain format and certain fields are expected. This is how uploaded phenotyping data is handled: 


 
HTH,
cheers,
Bob

 
Many thanks, 

Sanjuro




Hope this helps!
-Naama

 

Naama Menda
Boyce Thompson Institute for Plant Research
Tower Rd
Ithaca NY 14853
USA

(607) 254 3569
Sol Genomics Network
http://solgenomics.net/
[hidden email]

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: collections and pedigree data in chado

Sook Jung
Hello,
I agree with what Bob and Naama wrote (since we developed ND module together :). I work for GDR so it would be beneficial to store your data the same way as we do so that we could easily exchange data. I just wanted to add a bit more below.


 
3.  We will be doing transcriptome, phenotyping and genotyping experiments on the progeny of our collections. Related to aforementioned confusion, I'm curious how these analyses get referenced in chado.  For instance, does a plant get a single entry in the nd_experiment table which is used to connect to transcriptome, genotype, and phenotype tables?  Or is there a separate nd_experiment entry for each type of downstream analysis performed?

I would use a separate nd_experiment entry for each type of assay because then nd_experiment.type can describe what kind of assay it is.

Again, think of nd_experiment as an assay, not an experiment.  Use project for the higher level.
 
We define an nd_experiment as an 'observation unit' , usually a phynotyping or genotyping event performed at a certain date at a certain location by a certain person. For that matter, a single phenotype could be a single nd_experiment, or a bunch of phenotypes recorded in the field during a single morning.
Genotyping assays , like a GBS run, would get a single nd_experiment id. 

For phenotypic data we create separate entry of 'sample' in Stock table, create nd_experiment entry and associate them with distinct phenotypic data. We link samples with the germplasm using Stock_relationship table.


4.  Have people already developed tools for entering this sort of data into a chado database?  I just saw the Tripal general general loader at a workshop and I can start to look at that (though I can't find it on my Tripal installation for some reason).  Are there others that are more specific to natural collections?


We use Tripal bulk uploader for genomic data and have been using our own uploader for ND data so far (since we've been doing this before the bulk uploader was ready)  but we'll soon be using bulk uploader. I think you should be able to create templates using bulk uploader. Stephen is in our group.
Thanks
Sook

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: collections and pedigree data in chado

Sook Jung
In reply to this post by Naama Menda-3
Hi Sanjuro,
I was thinking about your ancestor and progeny problem and one way of doing that is storing the ancestor as 'ancestor' type and the other as 'progeny' or whatever term you want to use. You can still link the direct parent and progeny using terms like 'maternal_parent_of'/'paternal_parent_of' in stock_relationship table but make additional association using terms like 'ancestor_of' so that you can just easily get the property out of it..
It would be nice if everyone store things the same way and we should really come up with stock ontology and/or stock relationship ontology so that we can standardize things but we all have deadlines to get the work done at the same time...
Sook


On Thu, Jul 25, 2013 at 11:37 PM, Naama Menda <[hidden email]> wrote:

hi Sanjuro,

in general I agree with everything Bob mentioned . I'm adding inline a few more points:

On Thu, Jul 25, 2013 at 6:29 PM, Bob MacCallum <[hidden email]> wrote:
Hi Sanjuro,

Hopefully the other Natural Diversity users will chime in, but here's my 2p.

On Thu, Jul 25, 2013 at 8:22 PM, Sanjuro Jogdeo <[hidden email]> wrote:
Hello all, 


I'm working on a project that is doing extensive natural population collections of 10ish species of wild strawberry.  We plan to store collection data and downstream analysis in a chado database and make the data available via a Tripal enabled website. 

I'm trying to figure out the best way to load and store the initial collections, the various types of progeny that will follow (crosses, clones, induced polyploids, etc...), and the downstream analyses (genotyping, phenotyping, transcriptomes, etc...).  I have lots of questions.

1.  I'm assuming that the proper way to record progeny will be through the stock and stock relationship tables.  One of the problems I anticipate is difficulty linking 3rd or 4th generation plants with the ancestral population data (location, elevation, GPS coords, etc...).  I would have to recursively travel up through several levels of stock relationships to get the population data and if the progeny were a cross, this could get complicated.  I met Stephen Ficklin at a workshop this past week and he said that one solution would be similar to what has been done with the cvtermpath table.  That is, link each stock to each of it's ancestors directly in something like a stock_relationship_path table.  This seems like a good solution to me but I wanted to see if anyone out there had any experience (and perhaps even tools ;-)  )  for doing this sort of thing.  


won't answer this - but it sounds like an interesting problem

 
a path table could solve this problem, but the question is how would you like to represent the data? If you are interested to access from each descendant the information of parents several generations upstream, then maybe a path table would be best.
At SGN and CassavaBase  we don't have (yet) more than 4 generations, however, each descendant plant has its own geolocation data, which is completely separated from the geolocation information of the parents. We don't display metadata of the parental population.
I don't see an easy way for traversing up and down stock_relationship with the current design. 
 
2.  In general, I'm a little bit confused by how often nd_experment is supposed to be referenced when gathering this sort of data.  For instance, does each new generation that is grown get linked to the nd_experiment table, even though it seems like the stock and stock_relationship table can handle the data by themselves?  


you could link stocks without nd_experiment (which you can think of as nd_assay), but if you use it you can describe the cross in more detail (e.g. where it was done, protocols, performers, reagents etc).

We at VectorBase (in our population biology browser) use nd_experiment to describe field collections.  the nd_experiment_stock.type is "assay creates stock" (in our private CV). 

Other assays, such as phenotypes, link to stock with nd_experiment_stock.type="assay uses stock".

In your case a cross would both use (parents) and create (offspring) stocks.

For crosses we store an nd_experiment of type 'cross' (an internal cvterm) , and all the properties of the cross are stored in nd_experimentprop/

 
 
3.  We will be doing transcriptome, phenotyping and genotyping experiments on the progeny of our collections. Related to aforementioned confusion, I'm curious how these analyses get referenced in chado.  For instance, does a plant get a single entry in the nd_experiment table which is used to connect to transcriptome, genotype, and phenotype tables?  Or is there a separate nd_experiment entry for each type of downstream analysis performed?

I would use a separate nd_experiment entry for each type of assay because then nd_experiment.type can describe what kind of assay it is.

Again, think of nd_experiment as an assay, not an experiment.  Use project for the higher level.
 
We define an nd_experiment as an 'observation unit' , usually a phynotyping or genotyping event performed at a certain date at a certain location by a certain person. For that matter, a single phenotype could be a single nd_experiment, or a bunch of phenotypes recorded in the field during a single morning.
Genotyping assays , like a GBS run, would get a single nd_experiment id. 


4.  Have people already developed tools for entering this sort of data into a chado database?  I just saw the Tripal general general loader at a workshop and I can start to look at that (though I can't find it on my Tripal installation for some reason).  Are there others that are more specific to natural collections?


I can't speak for Tripal - haven't had time to try it.  We have a Perl-based loader on github.  Input format is ISA-Tab.  It's not intended for other people to use, but you might be able to learn from our mistakes/successes:
https://github.com/bobular/VBPopBio/tree/master/api/Bio-Chado-VBPopBio

We basically wrapped all the result classes of Bio-Chado-Schema that we need - and then provide higher-level functionality (e.g. stock->best_species https://github.com/bobular/VBPopBio/blob/master/api/Bio-Chado-VBPopBio/lib/Bio/Chado/VBPopBio/Result/Stock.pm#L424) and an extra direct stock<->project relationship (with adding new tables, very cunning...)

A simpler option might have been just to fork Bio-Chado-Schema :-)

Oh, and by the way we don't do genotypes properly in Chado (e.g. linked to genome features etc) - that's because most of our genotypes are in another database (Ensembl).  But we do store some structural variations (inversions) as text with some SO terms to help describe them.

Chado is supposed to be flexible and there is definitely more than one way to do what you want to do - try not to let that put you off!

SGN also uses Bio::Chado::Schema. Since most data comes from breeders, we have a wrapper for spreadsheets, and a certain format and certain fields are expected. This is how uploaded phenotyping data is handled: 


 
HTH,
cheers,
Bob

 
Many thanks, 

Sanjuro




Hope this helps!
-Naama

 

Naama Menda
Boyce Thompson Institute for Plant Research
Tower Rd
Ithaca NY 14853
USA

<a href="tel:%28607%29%20254%203569" value="+16072543569" target="_blank">(607) 254 3569
Sol Genomics Network
http://solgenomics.net/
[hidden email]

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema



------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: collections and pedigree data in chado

Sanjuro Jogdeo
I just realized that my last message bounced.  It was just to say thanks for the help, but I wanted to make sure it went through.  Thanks!

Sanjuro

On Mon, Jul 29, 2013 at 9:37 AM, Sanjuro Jogdeo <[hidden email]> wrote:
Thanks to everyone for all your comments and links.  They are extremely helpful!  I'm going to have some follow up questions but I think I'll try loading some data before asking them. 

Thanks again!

Sanjuro



On Fri, Jul 26, 2013 at 11:43 AM, Sook Jung <[hidden email]> wrote:
Hi Sanjuro,
I was thinking about your ancestor and progeny problem and one way of doing that is storing the ancestor as 'ancestor' type and the other as 'progeny' or whatever term you want to use. You can still link the direct parent and progeny using terms like 'maternal_parent_of'/'paternal_parent_of' in stock_relationship table but make additional association using terms like 'ancestor_of' so that you can just easily get the property out of it..
It would be nice if everyone store things the same way and we should really come up with stock ontology and/or stock relationship ontology so that we can standardize things but we all have deadlines to get the work done at the same time...
Sook

 

On Thu, Jul 25, 2013 at 11:37 PM, Naama Menda <[hidden email]> wrote:

hi Sanjuro,

in general I agree with everything Bob mentioned . I'm adding inline a few more points:

On Thu, Jul 25, 2013 at 6:29 PM, Bob MacCallum <[hidden email]> wrote:
Hi Sanjuro,

Hopefully the other Natural Diversity users will chime in, but here's my 2p.

On Thu, Jul 25, 2013 at 8:22 PM, Sanjuro Jogdeo <[hidden email]> wrote:
Hello all, 


I'm working on a project that is doing extensive natural population collections of 10ish species of wild strawberry.  We plan to store collection data and downstream analysis in a chado database and make the data available via a Tripal enabled website. 

I'm trying to figure out the best way to load and store the initial collections, the various types of progeny that will follow (crosses, clones, induced polyploids, etc...), and the downstream analyses (genotyping, phenotyping, transcriptomes, etc...).  I have lots of questions.

1.  I'm assuming that the proper way to record progeny will be through the stock and stock relationship tables.  One of the problems I anticipate is difficulty linking 3rd or 4th generation plants with the ancestral population data (location, elevation, GPS coords, etc...).  I would have to recursively travel up through several levels of stock relationships to get the population data and if the progeny were a cross, this could get complicated.  I met Stephen Ficklin at a workshop this past week and he said that one solution would be similar to what has been done with the cvtermpath table.  That is, link each stock to each of it's ancestors directly in something like a stock_relationship_path table.  This seems like a good solution to me but I wanted to see if anyone out there had any experience (and perhaps even tools ;-)  )  for doing this sort of thing.  


won't answer this - but it sounds like an interesting problem

 
a path table could solve this problem, but the question is how would you like to represent the data? If you are interested to access from each descendant the information of parents several generations upstream, then maybe a path table would be best.
At SGN and CassavaBase  we don't have (yet) more than 4 generations, however, each descendant plant has its own geolocation data, which is completely separated from the geolocation information of the parents. We don't display metadata of the parental population.
I don't see an easy way for traversing up and down stock_relationship with the current design. 
 
2.  In general, I'm a little bit confused by how often nd_experment is supposed to be referenced when gathering this sort of data.  For instance, does each new generation that is grown get linked to the nd_experiment table, even though it seems like the stock and stock_relationship table can handle the data by themselves?  


you could link stocks without nd_experiment (which you can think of as nd_assay), but if you use it you can describe the cross in more detail (e.g. where it was done, protocols, performers, reagents etc).

We at VectorBase (in our population biology browser) use nd_experiment to describe field collections.  the nd_experiment_stock.type is "assay creates stock" (in our private CV). 

Other assays, such as phenotypes, link to stock with nd_experiment_stock.type="assay uses stock".

In your case a cross would both use (parents) and create (offspring) stocks.

For crosses we store an nd_experiment of type 'cross' (an internal cvterm) , and all the properties of the cross are stored in nd_experimentprop/

 
 
3.  We will be doing transcriptome, phenotyping and genotyping experiments on the progeny of our collections. Related to aforementioned confusion, I'm curious how these analyses get referenced in chado.  For instance, does a plant get a single entry in the nd_experiment table which is used to connect to transcriptome, genotype, and phenotype tables?  Or is there a separate nd_experiment entry for each type of downstream analysis performed?

I would use a separate nd_experiment entry for each type of assay because then nd_experiment.type can describe what kind of assay it is.

Again, think of nd_experiment as an assay, not an experiment.  Use project for the higher level.
 
We define an nd_experiment as an 'observation unit' , usually a phynotyping or genotyping event performed at a certain date at a certain location by a certain person. For that matter, a single phenotype could be a single nd_experiment, or a bunch of phenotypes recorded in the field during a single morning.
Genotyping assays , like a GBS run, would get a single nd_experiment id. 


4.  Have people already developed tools for entering this sort of data into a chado database?  I just saw the Tripal general general loader at a workshop and I can start to look at that (though I can't find it on my Tripal installation for some reason).  Are there others that are more specific to natural collections?


I can't speak for Tripal - haven't had time to try it.  We have a Perl-based loader on github.  Input format is ISA-Tab.  It's not intended for other people to use, but you might be able to learn from our mistakes/successes:
https://github.com/bobular/VBPopBio/tree/master/api/Bio-Chado-VBPopBio

We basically wrapped all the result classes of Bio-Chado-Schema that we need - and then provide higher-level functionality (e.g. stock->best_species https://github.com/bobular/VBPopBio/blob/master/api/Bio-Chado-VBPopBio/lib/Bio/Chado/VBPopBio/Result/Stock.pm#L424) and an extra direct stock<->project relationship (with adding new tables, very cunning...)

A simpler option might have been just to fork Bio-Chado-Schema :-)

Oh, and by the way we don't do genotypes properly in Chado (e.g. linked to genome features etc) - that's because most of our genotypes are in another database (Ensembl).  But we do store some structural variations (inversions) as text with some SO terms to help describe them.

Chado is supposed to be flexible and there is definitely more than one way to do what you want to do - try not to let that put you off!

SGN also uses Bio::Chado::Schema. Since most data comes from breeders, we have a wrapper for spreadsheets, and a certain format and certain fields are expected. This is how uploaded phenotyping data is handled: 


 
HTH,
cheers,
Bob

 
Many thanks, 

Sanjuro




Hope this helps!
-Naama

 

Naama Menda
Boyce Thompson Institute for Plant Research
Tower Rd
Ithaca NY 14853
USA

<a href="tel:%28607%29%20254%203569" value="+16072543569" target="_blank">(607) 254 3569
Sol Genomics Network
http://solgenomics.net/
[hidden email]

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema





------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema