gene families

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

gene families

Lukas A. Mueller
Hi,
we would like to store gene families, including alignments and trees, in Chado. So far, we store it in a separate database. What is the canonical way to store this information?
cheers
Lukas


------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: gene families

Scott Cain
Hi Lukas,

The silence probably gave you your answer: no, there isn't a way to
store MSAs or trees in Chado.

Scott


On Fri, Jun 18, 2010 at 1:21 PM, Lukas Mueller <[hidden email]> wrote:

> Hi,
> we would like to store gene families, including alignments and trees, in Chado. So far, we store it in a separate database. What is the canonical way to store this information?
> cheers
> Lukas
>
>
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>



--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: gene families

Joshua Orvis
Lukas -

My guess here would be that it depends on what sorts of queries you want to be able to run against these data once they were stored.  Scott's right that there may not be a canonical way to do it so far, but if storage is your primary concern there are definitely ways to do it.

We currently routinely store gene clusters (computed on protein similarity scores) and defining gene families could be done by a similar mechanism.  For example, each of your genes would be features within the feature table and then you could create a feature to represent the family, followed by 'member_of' or 'part_of' feature_relationship table entries for each of your genes to that family record.

Storing alignments is commonplace, including even the alignment strings itself in formats such as CIGAR.  As for the trees, they could be stored as a feature property of the gene family in something like PHYLIP tree format.  Again, the entire tree here would be within a single text value in the database, so querying based on properties of the tree here wouldn't be practical.

Anyway, there are probably ways to store what you're interested in but there just may not yet be a canonical way to do so.

Joshua


On Wed, Jun 23, 2010 at 8:23 AM, Scott Cain <[hidden email]> wrote:
Hi Lukas,

The silence probably gave you your answer: no, there isn't a way to
store MSAs or trees in Chado.

Scott


On Fri, Jun 18, 2010 at 1:21 PM, Lukas Mueller <[hidden email]> wrote:
> Hi,
> we would like to store gene families, including alignments and trees, in Chado. So far, we store it in a separate database. What is the canonical way to store this information?
> cheers
> Lukas
>
>
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>



--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema


------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: gene families

Lukas A. Mueller
Hi Joshua,

thanks for your answer! I agree that storing gene families using feature_relationship is probably a solution. However, each feature can be in many gene families (different levels of clustering, different input sequence sets, etc), how could this be supported with this scheme?

In which table and column would you store the alignments and the trees? (Yes, in some common format like aligned fasta and newick or the ones you suggest).

cheers
Lukas

On Jun 23, 2010, at 9:38 AM, Joshua Orvis wrote:

> Lukas -
>
> My guess here would be that it depends on what sorts of queries you want to be able to run against these data once they were stored.  Scott's right that there may not be a canonical way to do it so far, but if storage is your primary concern there are definitely ways to do it.
>
> We currently routinely store gene clusters (computed on protein similarity scores) and defining gene families could be done by a similar mechanism.  For example, each of your genes would be features within the feature table and then you could create a feature to represent the family, followed by 'member_of' or 'part_of' feature_relationship table entries for each of your genes to that family record.
>
> Storing alignments is commonplace, including even the alignment strings itself in formats such as CIGAR.  As for the trees, they could be stored as a feature property of the gene family in something like PHYLIP tree format.  Again, the entire tree here would be within a single text value in the database, so querying based on properties of the tree here wouldn't be practical.
>
> Anyway, there are probably ways to store what you're interested in but there just may not yet be a canonical way to do so.
>
> Joshua
>
>
> On Wed, Jun 23, 2010 at 8:23 AM, Scott Cain <[hidden email]> wrote:
> Hi Lukas,
>
> The silence probably gave you your answer: no, there isn't a way to
> store MSAs or trees in Chado.
>
> Scott
>
>
> On Fri, Jun 18, 2010 at 1:21 PM, Lukas Mueller <[hidden email]> wrote:
> > Hi,
> > we would like to store gene families, including alignments and trees, in Chado. So far, we store it in a separate database. What is the canonical way to store this information?
> > cheers
> > Lukas
> >
> >
> > ------------------------------------------------------------------------------
> > ThinkGeek and WIRED's GeekDad team up for the Ultimate
> > GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> > lucky parental unit.  See the prize list and enter to win:
> > http://p.sf.net/sfu/thinkgeek-promo
> > _______________________________________________
> > Gmod-schema mailing list
> > [hidden email]
> > https://lists.sourceforge.net/lists/listinfo/gmod-schema
> >
>
>
>
> --
> ------------------------------------------------------------------------
> Scott Cain, Ph. D.                                   scott at scottcain dot net
> GMOD Coordinator (http://gmod.org/)                     216-392-3087
> Ontario Institute for Cancer Research
>
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>


------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: gene families

Joshua Orvis
It's understandable that you might run multiple computes for gene family definitions, each with different parameters, and that any of your gene features can be in multiple families.  There's nothing in the schema that prevents this.  You will have one feature table entry for each gene and one for each family definition.  You can then define a one-to-many relationship between any one of your gene features with multiple family features (via feature_relationship).

Where to store your alignments and trees is debatable.  Depending on your planned usage, and if your alignment strings are large, it might be better to store them in files on disk and instead store their locations in the database.  We occasionally do this for large BLAST alignments.

If you want to store them in the database you could define them as properties of your family feature using the featureprop table.  In our case, we define a featureprop with cvterm_id corresponding to the term 'newick_tree'.

Joshua



On Wed, Jun 23, 2010 at 9:30 AM, Lukas Mueller <[hidden email]> wrote:
Hi Joshua,

thanks for your answer! I agree that storing gene families using feature_relationship is probably a solution. However, each feature can be in many gene families (different levels of clustering, different input sequence sets, etc), how could this be supported with this scheme?

In which table and column would you store the alignments and the trees? (Yes, in some common format like aligned fasta and newick or the ones you suggest).

cheers
Lukas

On Jun 23, 2010, at 9:38 AM, Joshua Orvis wrote:

> Lukas -
>
> My guess here would be that it depends on what sorts of queries you want to be able to run against these data once they were stored.  Scott's right that there may not be a canonical way to do it so far, but if storage is your primary concern there are definitely ways to do it.
>
> We currently routinely store gene clusters (computed on protein similarity scores) and defining gene families could be done by a similar mechanism.  For example, each of your genes would be features within the feature table and then you could create a feature to represent the family, followed by 'member_of' or 'part_of' feature_relationship table entries for each of your genes to that family record.
>
> Storing alignments is commonplace, including even the alignment strings itself in formats such as CIGAR.  As for the trees, they could be stored as a feature property of the gene family in something like PHYLIP tree format.  Again, the entire tree here would be within a single text value in the database, so querying based on properties of the tree here wouldn't be practical.
>
> Anyway, there are probably ways to store what you're interested in but there just may not yet be a canonical way to do so.
>
> Joshua
>
>
> On Wed, Jun 23, 2010 at 8:23 AM, Scott Cain <[hidden email]> wrote:
> Hi Lukas,
>
> The silence probably gave you your answer: no, there isn't a way to
> store MSAs or trees in Chado.
>
> Scott
>
>
> On Fri, Jun 18, 2010 at 1:21 PM, Lukas Mueller <[hidden email]> wrote:
> > Hi,
> > we would like to store gene families, including alignments and trees, in Chado. So far, we store it in a separate database. What is the canonical way to store this information?
> > cheers
> > Lukas
> >
> >
> > ------------------------------------------------------------------------------
> > ThinkGeek and WIRED's GeekDad team up for the Ultimate
> > GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> > lucky parental unit.  See the prize list and enter to win:
> > http://p.sf.net/sfu/thinkgeek-promo
> > _______________________________________________
> > Gmod-schema mailing list
> > [hidden email]
> > https://lists.sourceforge.net/lists/listinfo/gmod-schema
> >
>
>
>
> --
> ------------------------------------------------------------------------
> Scott Cain, Ph. D.                                   scott at scottcain dot net
> GMOD Coordinator (http://gmod.org/)                     216-392-3087
> Ontario Institute for Cancer Research
>
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>



------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: gene families

Robson de Souza
By the way, dummy question: I see all the discussion is focused on how
to represent the data in Chado but nobody has mentioned how to bring
it into the database, once a particular representation is chosen.

I understand that If the data is represented as gene annotation in a
GFF file, maybe the load GFF scripts can be used to update gene
annotation, thus adding, e.g. the family assignments. Is this correct?
Or do your guys use custom software to populate the feature and
feature_relationship table when adding this sort of information?

Best,
Robson

On Thu, Jun 24, 2010 at 1:07 AM, Joshua Orvis <[hidden email]> wrote:

> It's understandable that you might run multiple computes for gene family
> definitions, each with different parameters, and that any of your gene
> features can be in multiple families.  There's nothing in the schema that
> prevents this.  You will have one feature table entry for each gene and one
> for each family definition.  You can then define a one-to-many relationship
> between any one of your gene features with multiple family features (via
> feature_relationship).
>
> Where to store your alignments and trees is debatable.  Depending on your
> planned usage, and if your alignment strings are large, it might be better
> to store them in files on disk and instead store their locations in the
> database.  We occasionally do this for large BLAST alignments.
>
> If you want to store them in the database you could define them as
> properties of your family feature using the featureprop table.  In our case,
> we define a featureprop with cvterm_id corresponding to the term
> 'newick_tree'.
>
> Joshua
>
>
>
> On Wed, Jun 23, 2010 at 9:30 AM, Lukas Mueller <[hidden email]> wrote:
>>
>> Hi Joshua,
>>
>> thanks for your answer! I agree that storing gene families using
>> feature_relationship is probably a solution. However, each feature can be in
>> many gene families (different levels of clustering, different input sequence
>> sets, etc), how could this be supported with this scheme?
>>
>> In which table and column would you store the alignments and the trees?
>> (Yes, in some common format like aligned fasta and newick or the ones you
>> suggest).
>>
>> cheers
>> Lukas
>>
>> On Jun 23, 2010, at 9:38 AM, Joshua Orvis wrote:
>>
>> > Lukas -
>> >
>> > My guess here would be that it depends on what sorts of queries you want
>> > to be able to run against these data once they were stored.  Scott's right
>> > that there may not be a canonical way to do it so far, but if storage is
>> > your primary concern there are definitely ways to do it.
>> >
>> > We currently routinely store gene clusters (computed on protein
>> > similarity scores) and defining gene families could be done by a similar
>> > mechanism.  For example, each of your genes would be features within the
>> > feature table and then you could create a feature to represent the family,
>> > followed by 'member_of' or 'part_of' feature_relationship table entries for
>> > each of your genes to that family record.
>> >
>> > Storing alignments is commonplace, including even the alignment strings
>> > itself in formats such as CIGAR.  As for the trees, they could be stored as
>> > a feature property of the gene family in something like PHYLIP tree format.
>> >  Again, the entire tree here would be within a single text value in the
>> > database, so querying based on properties of the tree here wouldn't be
>> > practical.
>> >
>> > Anyway, there are probably ways to store what you're interested in but
>> > there just may not yet be a canonical way to do so.
>> >
>> > Joshua
>> >
>> >
>> > On Wed, Jun 23, 2010 at 8:23 AM, Scott Cain <[hidden email]> wrote:
>> > Hi Lukas,
>> >
>> > The silence probably gave you your answer: no, there isn't a way to
>> > store MSAs or trees in Chado.
>> >
>> > Scott
>> >
>> >
>> > On Fri, Jun 18, 2010 at 1:21 PM, Lukas Mueller <[hidden email]>
>> > wrote:
>> > > Hi,
>> > > we would like to store gene families, including alignments and trees,
>> > > in Chado. So far, we store it in a separate database. What is the canonical
>> > > way to store this information?
>> > > cheers
>> > > Lukas
>> > >
>> > >
>> > >
>> > > ------------------------------------------------------------------------------
>> > > ThinkGeek and WIRED's GeekDad team up for the Ultimate
>> > > GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
>> > > lucky parental unit.  See the prize list and enter to win:
>> > > http://p.sf.net/sfu/thinkgeek-promo
>> > > _______________________________________________
>> > > Gmod-schema mailing list
>> > > [hidden email]
>> > > https://lists.sourceforge.net/lists/listinfo/gmod-schema
>> > >
>> >
>> >
>> >
>> > --
>> > ------------------------------------------------------------------------
>> > Scott Cain, Ph. D.                                   scott at scottcain
>> > dot net
>> > GMOD Coordinator (http://gmod.org/)                     216-392-3087
>> > Ontario Institute for Cancer Research
>> >
>> >
>> > ------------------------------------------------------------------------------
>> > ThinkGeek and WIRED's GeekDad team up for the Ultimate
>> > GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
>> > lucky parental unit.  See the prize list and enter to win:
>> > http://p.sf.net/sfu/thinkgeek-promo
>> > _______________________________________________
>> > Gmod-schema mailing list
>> > [hidden email]
>> > https://lists.sourceforge.net/lists/listinfo/gmod-schema
>> >
>>
>
>
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>
>

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: gene families

David Emmert
In reply to this post by Lukas A. Mueller
Hi Lucas,

Its maybe a bit un-chado to create feature records for things like gene families,
which aren't actually features.  That said, you gotta do what you gotta do, but
here's a couple of additional ideas you might consider.  I don't know what your
alignment tree data looks like, so I can't really say whats most appropriate.

1) Have you considered using the Library module?  FlyBase uses this for organizing
any kind of collection of features.   You could maybe store the alignments & trees
as library_featureprops or libraryprops.

2) Or I wonder if you might not consider trying to implement the families using
the same implementation pattern as for any evidence alignments?  If you could
envision handling your alignments part of your data that way, then maybe the
trees could be stored... how... as analysisprops?

Hope this helps, or at least doesn't hurt,

-Dave

>From [hidden email]  Thu Jun 24 09:28:01 2010
>> To: Lukas Mueller <[hidden email]>
>> Cc: gmod schema <[hidden email]>
>> Subject: Re: [Gmod-schema] gene families
>>
>> It's understandable that you might run multiple computes for gene family
>> definitions, each with different parameters, and that any of your gene
>> features can be in multiple families.  There's nothing in the schema that
>> prevents this.  You will have one feature table entry for each gene and one
>> for each family definition.  You can then define a one-to-many relationship
>> between any one of your gene features with multiple family features (via
>> feature_relationship).
>>
>> Where to store your alignments and trees is debatable.  Depending on your
>> planned usage, and if your alignment strings are large, it might be better
>> to store them in files on disk and instead store their locations in the
>> database.  We occasionally do this for large BLAST alignments.
>>
>> If you want to store them in the database you could define them as
>> properties of your family feature using the featureprop table.  In our case,
>> we define a featureprop with cvterm_id corresponding to the term
>> 'newick_tree'.
>>
>> Joshua
>>
>>
>>
>> On Wed, Jun 23, 2010 at 9:30 AM, Lukas Mueller <[hidden email]> wrote:
>>
>> > Hi Joshua,
>> >
>> > thanks for your answer! I agree that storing gene families using
>> > feature_relationship is probably a solution. However, each feature can be in
>> > many gene families (different levels of clustering, different input sequence
>> > sets, etc), how could this be supported with this scheme?
>> >
>> > In which table and column would you store the alignments and the trees?
>> > (Yes, in some common format like aligned fasta and newick or the ones you
>> > suggest).
>> >
>> > cheers
>> > Lukas
>> >
>> > On Jun 23, 2010, at 9:38 AM, Joshua Orvis wrote:
>> >
>> > > Lukas -
>> > >
>> > > My guess here would be that it depends on what sorts of queries you want
>> > to be able to run against these data once they were stored.  Scott's right
>> > that there may not be a canonical way to do it so far, but if storage is
>> > your primary concern there are definitely ways to do it.
>> > >
>> > > We currently routinely store gene clusters (computed on protein
>> > similarity scores) and defining gene families could be done by a similar
>> > mechanism.  For example, each of your genes would be features within the
>> > feature table and then you could create a feature to represent the family,
>> > followed by 'member_of' or 'part_of' feature_relationship table entries for
>> > each of your genes to that family record.
>> > >
>> > > Storing alignments is commonplace, including even the alignment strings
>> > itself in formats such as CIGAR.  As for the trees, they could be stored as
>> > a feature property of the gene family in something like PHYLIP tree format.
>> >  Again, the entire tree here would be within a single text value in the
>> > database, so querying based on properties of the tree here wouldn't be
>> > practical.
>> > >
>> > > Anyway, there are probably ways to store what you're interested in but
>> > there just may not yet be a canonical way to do so.
>> > >
>> > > Joshua
>> > >
>> > >
>> > > On Wed, Jun 23, 2010 at 8:23 AM, Scott Cain <[hidden email]> wrote:
>> > > Hi Lukas,
>> > >
>> > > The silence probably gave you your answer: no, there isn't a way to
>> > > store MSAs or trees in Chado.
>> > >
>> > > Scott
>> > >
>> > >
>> > > On Fri, Jun 18, 2010 at 1:21 PM, Lukas Mueller <[hidden email]>
>> > wrote:
>> > > > Hi,
>> > > > we would like to store gene families, including alignments and trees,
>> > in Chado. So far, we store it in a separate database. What is the canonical
>> > way to store this information?
>> > > > cheers
>> > > > Lukas
>> > > >
>> > > >
>> > > >
>> > ------------------------------------------------------------------------------
>> > > > ThinkGeek and WIRED's GeekDad team up for the Ultimate
>> > > > GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
>> > > > lucky parental unit.  See the prize list and enter to win:
>> > > > http://p.sf.net/sfu/thinkgeek-promo
>> > > > _______________________________________________
>> > > > Gmod-schema mailing list
>> > > > [hidden email]
>> > > > https://lists.sourceforge.net/lists/listinfo/gmod-schema
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > ------------------------------------------------------------------------
>> > > Scott Cain, Ph. D.                                   scott at scottcain
>> > dot net
>> > > GMOD Coordinator (http://gmod.org/)                     216-392-3087
>> > > Ontario Institute for Cancer Research
>> > >
>> > >
>> > ------------------------------------------------------------------------------
>> > > ThinkGeek and WIRED's GeekDad team up for the Ultimate
>> > > GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
>> > > lucky parental unit.  See the prize list and enter to win:
>> > > http://p.sf.net/sfu/thinkgeek-promo
>> > > _______________________________________________
>> > > Gmod-schema mailing list
>> > > [hidden email]
>> > > https://lists.sourceforge.net/lists/listinfo/gmod-schema
>> > >
>> >
>> >
>>
>> --001485e0e558c35cdb0489bfa267
>> Content-Type: text/html; charset=ISO-8859-1
>> Content-Transfer-Encoding: quoted-printable
>>
>> It&#39;s understandable that you might run multiple computes for gene famil=
>> y definitions, each with different parameters, and that any of your gene fe=
>> atures can be in multiple families.=A0 There&#39;s nothing in the schema th=
>> at prevents this.=A0 You will have one feature table entry for each gene an=
>> d one for each family definition.=A0 You can then define a one-to-many rela=
>> tionship between any one of your gene features with multiple family feature=
>> s (via feature_relationship).<br>
>>
>> <br>Where to store your alignments and trees is debatable.=A0 Depending on =
>> your planned usage, and if your alignment strings are large, it might be be=
>> tter to store them in files on disk and instead store their locations in th=
>> e database.=A0 We occasionally do this for large BLAST alignments.<br>
>> <br>If you want to store them in the database you could define them as prop=
>> erties of your family feature using the featureprop table.=A0 In our case, =
>> we define a featureprop with cvterm_id corresponding to the term &#39;newic=
>> k_tree&#39;.<br>
>> <br>Joshua<br><br><br><br><div class=3D"gmail_quote">On Wed, Jun 23, 2010 a=
>> t 9:30 AM, Lukas Mueller <span dir=3D"ltr">&lt;<a href=3D"mailto:lam87@corn=
>> ell.edu" target=3D"_blank">[hidden email]</a>&gt;</span> wrote:<br>
>> <blockquote class=3D"gmail_quote" style=3D"border-left: 1px solid rgb(204, =
>> 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Hi Joshua,<br>
>> <br>
>> thanks for your answer! I agree that storing gene families using feature_re=
>> lationship is probably a solution. However, each feature can be in many gen=
>> e families (different levels of clustering, different input sequence sets, =
>> etc), how could this be supported with this scheme?<br>
>>
>>
>> <br>
>> In which table and column would you store the alignments and the trees? (Ye=
>> s, in some common format like aligned fasta and newick or the ones you sugg=
>> est).<br>
>> <br>
>> cheers<br>
>> <font color=3D"#888888">Lukas<br>
>> </font><div><div></div><div><br>
>> On Jun 23, 2010, at 9:38 AM, Joshua Orvis wrote:<br>
>> <br>
>> &gt; Lukas -<br>
>> &gt;<br>
>> &gt; My guess here would be that it depends on what sorts of queries you wa=
>> nt to be able to run against these data once they were stored. =A0Scott&#39=
>> ;s right that there may not be a canonical way to do it so far, but if stor=
>> age is your primary concern there are definitely ways to do it.<br>
>>
>>
>> &gt;<br>
>> &gt; We currently routinely store gene clusters (computed on protein simila=
>> rity scores) and defining gene families could be done by a similar mechanis=
>> m. =A0For example, each of your genes would be features within the feature =
>> table and then you could create a feature to represent the family, followed=
>>  by &#39;member_of&#39; or &#39;part_of&#39; feature_relationship table ent=
>> ries for each of your genes to that family record.<br>
>>
>>
>> &gt;<br>
>> &gt; Storing alignments is commonplace, including even the alignment string=
>> s itself in formats such as CIGAR. =A0As for the trees, they could be store=
>> d as a feature property of the gene family in something like PHYLIP tree fo=
>> rmat. =A0Again, the entire tree here would be within a single text value in=
>>  the database, so querying based on properties of the tree here wouldn&#39;=
>> t be practical.<br>
>>
>>
>> &gt;<br>
>> &gt; Anyway, there are probably ways to store what you&#39;re interested in=
>>  but there just may not yet be a canonical way to do so.<br>
>> &gt;<br>
>> &gt; Joshua<br>
>> &gt;<br>
>> &gt;<br>
>> &gt; On Wed, Jun 23, 2010 at 8:23 AM, Scott Cain &lt;<a href=3D"mailto:scot=
>> [hidden email]" target=3D"_blank">[hidden email]</a>&gt; wrote:<br>
>> &gt; Hi Lukas,<br>
>> &gt;<br>
>> &gt; The silence probably gave you your answer: no, there isn&#39;t a way t=
>> o<br>
>> &gt; store MSAs or trees in Chado.<br>
>> &gt;<br>
>> &gt; Scott<br>
>> &gt;<br>
>> &gt;<br>
>> &gt; On Fri, Jun 18, 2010 at 1:21 PM, Lukas Mueller &lt;<a href=3D"mailto:l=
>> [hidden email]" target=3D"_blank">[hidden email]</a>&gt; wrote:<br>
>> &gt; &gt; Hi,<br>
>> &gt; &gt; we would like to store gene families, including alignments and tr=
>> ees, in Chado. So far, we store it in a separate database. What is the cano=
>> nical way to store this information?<br>
>> &gt; &gt; cheers<br>
>> &gt; &gt; Lukas<br>
>> &gt; &gt;<br>
>> &gt; &gt;<br>
>> &gt; &gt; -----------------------------------------------------------------=
>> -------------<br>
>> &gt; &gt; ThinkGeek and WIRED&#39;s GeekDad team up for the Ultimate<br>
>> &gt; &gt; GeekDad Father&#39;s Day Giveaway. ONE MASSIVE PRIZE to the<br>
>> &gt; &gt; lucky parental unit. =A0See the prize list and enter to win:<br>
>> &gt; &gt; <a href=3D"http://p.sf.net/sfu/thinkgeek-promo" target=3D"_blank"=
>> >http://p.sf.net/sfu/thinkgeek-promo</a><br>
>> &gt; &gt; _______________________________________________<br>
>> &gt; &gt; Gmod-schema mailing list<br>
>> &gt; &gt; <a href=3D"mailto:[hidden email]" target=3D"_b=
>> lank">[hidden email]</a><br>
>> &gt; &gt; <a href=3D"https://lists.sourceforge.net/lists/listinfo/gmod-sche=
>> ma" target=3D"_blank">https://lists.sourceforge.net/lists/listinfo/gmod-sch=
>> ema</a><br>
>> &gt; &gt;<br>
>> &gt;<br>
>> &gt;<br>
>> &gt;<br>
>> &gt; --<br>
>> &gt; ----------------------------------------------------------------------=
>> --<br>
>> &gt; Scott Cain, Ph. D. =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
>>  =A0 =A0 =A0 =A0 scott at scottcain dot net<br>
>> &gt; GMOD Coordinator (<a href=3D"http://gmod.org/" target=3D"_blank">http:=
>> //gmod.org/</a>) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 216-392-3087<br>
>> &gt; Ontario Institute for Cancer Research<br>
>> &gt;<br>
>> &gt; ----------------------------------------------------------------------=
>> --------<br>
>> &gt; ThinkGeek and WIRED&#39;s GeekDad team up for the Ultimate<br>
>> &gt; GeekDad Father&#39;s Day Giveaway. ONE MASSIVE PRIZE to the<br>
>> &gt; lucky parental unit. =A0See the prize list and enter to win:<br>
>> &gt; <a href=3D"http://p.sf.net/sfu/thinkgeek-promo" target=3D"_blank">http=
>> ://p.sf.net/sfu/thinkgeek-promo</a><br>
>> &gt; _______________________________________________<br>
>> &gt; Gmod-schema mailing list<br>
>> &gt; <a href=3D"mailto:[hidden email]" target=3D"_blank"=
>> >[hidden email]</a><br>
>> &gt; <a href=3D"https://lists.sourceforge.net/lists/listinfo/gmod-schema" t=
>> arget=3D"_blank">https://lists.sourceforge.net/lists/listinfo/gmod-schema</=
>> a><br>
>> &gt;<br>
>> <br>
>> </div></div></blockquote></div><br>
>>
>> --001485e0e558c35cdb0489bfa267--
>>
>>
>> --===============1959945772239158574==
>> Content-Type: text/plain; charset="us-ascii"
>> MIME-Version: 1.0
>> Content-Transfer-Encoding: 7bit
>> Content-Disposition: inline
>>
>> ------------------------------------------------------------------------------
>> ThinkGeek and WIRED's GeekDad team up for the Ultimate
>> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
>> lucky parental unit.  See the prize list and enter to win:
>> http://p.sf.net/sfu/thinkgeek-promo
>> --===============1959945772239158574==
>> Content-Type: text/plain; charset="us-ascii"
>> MIME-Version: 1.0
>> Content-Transfer-Encoding: 7bit
>> Content-Disposition: inline
>>
>> _______________________________________________
>> Gmod-schema mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>>
>> --===============1959945772239158574==--
>>
>>

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: gene families

Joshua Orvis
In reply to this post by Robson de Souza
Robson -

I can't answer the official GMOD view and don't know enough about the GFF loaders to try.  Since I was giving our examples, we are one of the groups who uses Ergatis, which has one tool/component (initdb) that creates a chado instance from the DDLs and ontology files and then another (bsml2chado) that loads our BSML output into a Chado instance.  Of course, this means that each of our components such as clustal, which creates multiple sequence alignments, has a transformation step to convert the raw output into BSML.   In this way the same tool handles loading of gene predictions, blast alignments, etc.

Others can chime in on their own mechanisms of loading if they use something else other than the GFF loaders.

Joshua



On Thu, Jun 24, 2010 at 8:41 AM, Robson de Souza <[hidden email]> wrote:
By the way, dummy question: I see all the discussion is focused on how
to represent the data in Chado but nobody has mentioned how to bring
it into the database, once a particular representation is chosen.

I understand that If the data is represented as gene annotation in a
GFF file, maybe the load GFF scripts can be used to update gene
annotation, thus adding, e.g. the family assignments. Is this correct?
Or do your guys use custom software to populate the feature and
feature_relationship table when adding this sort of information?

Best,
Robson

On Thu, Jun 24, 2010 at 1:07 AM, Joshua Orvis <[hidden email]> wrote:
> It's understandable that you might run multiple computes for gene family
> definitions, each with different parameters, and that any of your gene
> features can be in multiple families.  There's nothing in the schema that
> prevents this.  You will have one feature table entry for each gene and one
> for each family definition.  You can then define a one-to-many relationship
> between any one of your gene features with multiple family features (via
> feature_relationship).
>
> Where to store your alignments and trees is debatable.  Depending on your
> planned usage, and if your alignment strings are large, it might be better
> to store them in files on disk and instead store their locations in the
> database.  We occasionally do this for large BLAST alignments.
>
> If you want to store them in the database you could define them as
> properties of your family feature using the featureprop table.  In our case,
> we define a featureprop with cvterm_id corresponding to the term
> 'newick_tree'.
>
> Joshua
>
>
>
> On Wed, Jun 23, 2010 at 9:30 AM, Lukas Mueller <[hidden email]> wrote:
>>
>> Hi Joshua,
>>
>> thanks for your answer! I agree that storing gene families using
>> feature_relationship is probably a solution. However, each feature can be in
>> many gene families (different levels of clustering, different input sequence
>> sets, etc), how could this be supported with this scheme?
>>
>> In which table and column would you store the alignments and the trees?
>> (Yes, in some common format like aligned fasta and newick or the ones you
>> suggest).
>>
>> cheers
>> Lukas
>>
>> On Jun 23, 2010, at 9:38 AM, Joshua Orvis wrote:
>>
>> > Lukas -
>> >
>> > My guess here would be that it depends on what sorts of queries you want
>> > to be able to run against these data once they were stored.  Scott's right
>> > that there may not be a canonical way to do it so far, but if storage is
>> > your primary concern there are definitely ways to do it.
>> >
>> > We currently routinely store gene clusters (computed on protein
>> > similarity scores) and defining gene families could be done by a similar
>> > mechanism.  For example, each of your genes would be features within the
>> > feature table and then you could create a feature to represent the family,
>> > followed by 'member_of' or 'part_of' feature_relationship table entries for
>> > each of your genes to that family record.
>> >
>> > Storing alignments is commonplace, including even the alignment strings
>> > itself in formats such as CIGAR.  As for the trees, they could be stored as
>> > a feature property of the gene family in something like PHYLIP tree format.
>> >  Again, the entire tree here would be within a single text value in the
>> > database, so querying based on properties of the tree here wouldn't be
>> > practical.
>> >
>> > Anyway, there are probably ways to store what you're interested in but
>> > there just may not yet be a canonical way to do so.
>> >
>> > Joshua
>> >
>> >
>> > On Wed, Jun 23, 2010 at 8:23 AM, Scott Cain <[hidden email]> wrote:
>> > Hi Lukas,
>> >
>> > The silence probably gave you your answer: no, there isn't a way to
>> > store MSAs or trees in Chado.
>> >
>> > Scott
>> >
>> >
>> > On Fri, Jun 18, 2010 at 1:21 PM, Lukas Mueller <[hidden email]>
>> > wrote:
>> > > Hi,
>> > > we would like to store gene families, including alignments and trees,
>> > > in Chado. So far, we store it in a separate database. What is the canonical
>> > > way to store this information?
>> > > cheers
>> > > Lukas
>> > >
>> > >
>> > >
>> > > ------------------------------------------------------------------------------
>> > > ThinkGeek and WIRED's GeekDad team up for the Ultimate
>> > > GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
>> > > lucky parental unit.  See the prize list and enter to win:
>> > > http://p.sf.net/sfu/thinkgeek-promo
>> > > _______________________________________________
>> > > Gmod-schema mailing list
>> > > [hidden email]
>> > > https://lists.sourceforge.net/lists/listinfo/gmod-schema
>> > >
>> >
>> >
>> >
>> > --
>> > ------------------------------------------------------------------------
>> > Scott Cain, Ph. D.                                   scott at scottcain
>> > dot net
>> > GMOD Coordinator (http://gmod.org/)                     216-392-3087
>> > Ontario Institute for Cancer Research
>> >
>> >
>> > ------------------------------------------------------------------------------
>> > ThinkGeek and WIRED's GeekDad team up for the Ultimate
>> > GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
>> > lucky parental unit.  See the prize list and enter to win:
>> > http://p.sf.net/sfu/thinkgeek-promo
>> > _______________________________________________
>> > Gmod-schema mailing list
>> > [hidden email]
>> > https://lists.sourceforge.net/lists/listinfo/gmod-schema
>> >
>>
>
>
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>
>

------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema


------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit.  See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema